# String Searching

Authors: Benjamin Qi, Siyong Huang, Dustin Miao, Mihnea Brebenel

Knuth-Morris-Pratt and Z Algorithms (and a few more related topics).

### Prerequisites

Resources | ||||
---|---|---|---|---|

CPC | String Matching, KMP, Tries | |||

CP2 |

# Single String

### A Note on Notation:

For a string $S$:

- $|S|$ denotes the size of string $S$
- $S[i]$ denotes the character at index $i$ starting from $0$
- $S[l:r]$ denotes the substring beginning at index $l$ and ending at index $r$
- $S[:r]$ is equivalent to $S[0:r]$, represents the prefix ending at $r$
- $S[l:]$ is equivalent to $S[l:|S| - 1]$, represents the suffix beginning of $l$.
- $S + T$ denotes concactinating $T$ to the end of $S$. Note that this implies that addition is
**non-commutative**.

## Knuth-Morris-Pratt Algorithm

Resources | ||||
---|---|---|---|---|

cp-algo | ||||

PAPS | ||||

GFG | ||||

TC |

Define an array $\pi_S$ of size $|S|$ such that $\pi_S[i]$ is equal to the length of the longest nontrivial suffix of the prefix ending at position $i$ that coincides with a prefix of the entire string. Formally,

In other words, for a given index $i$, we would like to compute the length of the longest substring that ends at $i$, such that this string also happens to be a prefix of the *entire* string. One such string that satisfies this criteria is the prefix ending at $i$; we will be disregarding this solution for obvious reasons.

For instance, for $S = \text{``abcabcd"}$, $\pi_S = [0, 0, 0, 1, 2, 3, 0]$, and the prefix function of $S = \text{``aabaaab"}$ is $\pi_S = [0, 1, 0, 1, 2, 2, 3]$. In the second example, $\pi_S[4] = 2$ because the prefix of length $2$ ($\text{``ab"})$ is equivalent to the substring of length $2$ that ends at index $4$. In the same way, $\pi_S[6] = 3$ because the prefix of length $3$ ($\text{``abb"}$) is equal to the substring of length $3$ that ends at index $6$. For both of these samples, there is no longer substring that satisfies these criteria.

The purpose of the KMP algorithm is to efficiently compute the $\pi_S$ array in linear time. Suppose we have already computed the $\pi_S$ array for indices $0\dots i$, and need to compute the value for index $i + 1$.

Firstly, note that between $\pi_S[i]$ and $\pi_S[i + 1]$, $\pi_S[i + 1]$ can be at most one greater. This occurs when $S[\pi_S[i]] = S[i + 1]$.

In the example above, $\pi_S[i] = 5$, meaning that the suffix of length $5$ is equivalent to a prefix of length $5$ of the entire string. It follows that if the character at position $5$ of the string is equal to the character at position $i + 1$, then the match is simply extended by a single character. Thus, $\pi_S[i + 1] = \pi_S[i] + 1 = 6$.

In the general case, however, this is not necessarily true. That is to say, $S[\pi_S[i]] \neq S[i + 1]$. Thus, we need to find the largest index $j < \pi_S[i]$ such that the prefix property holds (ie $S[:j - 1] \equiv S[i - j + 1:i]$). For such a length $j$, we repeat the procedure in the first example by comparing characters at indicies $j$ and $i + 1$: if the two are equal, then we can conclude our search and assign $\pi_S[i + 1] = j + 1$, and otherwise, we find the next smallest $j$ and repeat. Indeed, notice that the first example is simply the case where $j$ begins as $\pi_S[i]$.

In the second example above, we let $j = 2$.

The only thing that remains is to be able to efficiently find all the $j$ that we might possibly need. To recap, if the position we're currently at is $j$, to handle transitions we need to find the largest index $k$ that satisfies the prefix property $S[:k - 1] \equiv S[j - k + 1 : j]$. Since $j < i$, this value is simply $\pi_S[j - 1]$, a value that has already been computed. All that remains is to handle the case where $j = 0$. If $S[0] = S[i + 1]$, $\pi_S[i + 1] = 1$, otherwise $\pi_S[i + 1] = 0$.

C++

vector<int> pi(const string &s) {int n = (int)s.size();vector<int> pi_s(n);for (int i = 1, j = 0; i < n; i++) {while (j > 0 && s[j] != s[i]) { j = pi_s[j - 1]; }if (s[i] == s[j]) { j++; }pi_s[i] = j;}return pi_s;}

Python

from typing import Listdef pi(s: str) -> List[int]:n = len(s)pi_s = [0] * nj = 0for i in range(1, n):while j > 0 and s[j] != s[i]:j = pi_s[j - 1]if s[i] == s[j]:j += 1pi_s[i] = jreturn pi_s

**Claim:** The KMP algorithm runs in $\mathcal{O}(n)$ for computing the $\pi_S$ array on a string $S$ of length $n$.

**Proof:** Note that $j$ doesn't actually change through multiple iterations. This is because on iteration $i$, we assign $j = \pi_S[i - 1]$. However, in the previous iteration, we assign $\pi_S[i - 1]$ to be $j$. Furthermore, note that $j$ is always non-negative. In each iteration of $i$, $j$ is only increased by at most $1$ in the if statement. Since $j$ remains non-negative and is only increased a constant amount per iteration, it follows that $j$ can only decrease by at most $n$ times through all iterations of $i$. Since the inner loop is completely governed by $j$, the overall complexity amortizes to $\mathcal{O}(n)$. $\blacksquare$

### Problems

Status | Source | Problem Name | Difficulty | Tags | |
---|---|---|---|---|---|

CSES | Very Easy | ## Show TagsKMP, Z | |||

POI | Easy | ## Show TagsKMP, Strings | |||

Baltic OI | Normal | ## Show TagsKMP, Strings | |||

Old Gold | Hard | ## Show TagsKMP, Strings | |||

POI | Hard | ## Show TagsKMP, Strings | |||

CEOI | Hard | ## Show TagsKMP | |||

POI | Very Hard | ## Show TagsKMP | |||

POI | Very Hard | ## Show TagsKMP |

## Z Algorithm

The **Z-Algorithm** is another linear time string comparison algorithm like KMP,
but instead finds the longest common prefix of a string and all of its suffixes.

Resources | ||||
---|---|---|---|---|

cp-algo | ||||

CPH | ||||

CF |

Status | Source | Problem Name | Difficulty | Tags | |
---|---|---|---|---|---|

YS | Very Easy | ## Show TagsZ | |||

CSES | Very Easy | ## Show TagsKMP, Z | |||

CF | Normal | ## Show TagsDP, Strings | |||

CF | Normal | ## Show TagsZ | |||

CF | Hard |

# Palindromes

## Manacher

Focus Problem – try your best to solve this problem before continuing!

**Manacher's Algorithm** functions similarly to the **Z-Algorithm**. It
determines the longest palindrome centered at each character.

Let's denote $dp_i$ as the maximum diameter of a palindrome centered at $i$. Manacher's algorithm makes use of the previously determined $dp_j$, where $j < i$ incalculating $dp_i$. The main idea is that for a palindrome centered at $i$ with the borders $left$ and $right$ the $dp_j$ ($i < j \le right$) values are - probably - mirrors of the $dp_k$ ($left \le k < i$) values on the left side of the palindrome. Probably because for some $j$ the maximum palindrome might cross the right border. This way the algorithm only considers the palindrome centers that could lead to the expansion of the maximum palindrome.

**Time complexity:** $\mathcal{O}(N)$

C++

#include <bits/stdc++.h>using namespace std;string menacher(string s) {// Preprocess the input so it can handle even length palindromesstring arr;for (int i = 0; i < s.size(); i++) {arr.push_back('#');arr.push_back(s[i]);}

Resources | ||||
---|---|---|---|---|

HR | ||||

Medium | ||||

cp-algo |

### Don't Forget!

If s[l, r] is a palindrome, then s[l+1, r-1] is as well.

Status | Source | Problem Name | Difficulty | Tags | |
---|---|---|---|---|---|

CF | Normal | ## Show TagsStrings | |||

CF | Normal | ## Show TagsStrings | |||

CF | Hard | ## Show TagsPrefix Sums, Strings |

## Palindromic Tree

A **Palindromic Tree** is a tree-like data structure that behaves similarly to
KMP. Unlike KMP, in which the only empty state is $0$, the Palindromic Tree has
two empty states: length $0$, and length $-1$. This is because appending a
character to a palindrome increases the length by $2$, meaning a single
character palindrome must have been created from a palindrome of length $-1$

Resources | ||||
---|---|---|---|---|

CF | ||||

adilet.org |

Status | Source | Problem Name | Difficulty | Tags | |
---|---|---|---|---|---|

APIO | Easy | ||||

CF | Hard | ## Show TagsPrefix Sums, Strings | |||

MMCC | Very Hard |

# Multiple Strings

## Tries

Focus Problem – try your best to solve this problem before continuing!

A trie is a tree-like data structure that stores strings. Each node is a string, and each edge is a character.

The root is the empty string, and every node is represented by the characters along the path from the root to that node. This means that every prefix of a string is an ancestor of that string's node.

C++

#include <bits/stdc++.h>using namespace std;const int NMAX = 5e3;const int WMAX = 1e6;const int MOD = 1e9 + 7;int trie[WMAX][26];int node_count;bool stop[WMAX];

Resources | ||||
---|---|---|---|---|

CPH | ||||

CF | ||||

PAPS |

Status | Source | Problem Name | Difficulty | Tags | |
---|---|---|---|---|---|

COCI | Very Easy | ## Show TagsDFS, Strings, Trie | |||

IOI | Very Easy | ## Show TagsDFS, Strings, Trie | |||

CF | Normal | ## Show TagsMST, Trie | |||

YS | Easy | ## Show TagsGreedy, Trie | |||

Gold | Normal | ## Show TagsStrings, Trie | |||

CF | Normal | ## Show TagsStrings, Trie | |||

COCI | Normal | ## Show TagsTrie | |||

AC | Normal | ||||

CF | Normal | ## Show TagsSmall to Large, Tree, Trie | |||

CF | Normal | ## Show TagsBitmasks, Tree, Trie | |||

IZhO | Hard | ## Show TagsGreedy, Trie | |||

JOI | Hard | ## Show TagsBIT, Trie | |||

CF | Hard | ## Show TagsTree, Trie |

## Aho-Corasick

**Aho-Corasick** is the combination of **trie** and **KMP**. It is essentially a
trie with KMP's "fail" array.

### Warning!

Build the entire trie first, and then run a *BFS* to construct the fail array.

Resources | ||||
---|---|---|---|---|

cp-algo | ||||

CF | ||||

GFG |

Status | Source | Problem Name | Difficulty | Tags | |
---|---|---|---|---|---|

CF | Easy | ## Show TagsStrings | |||

Gold | Normal | ## Show TagsStrings | |||

CF | Normal | ## Show TagsStrings |

### This section is not complete.

1731 Word Combinations -> trie

1732 Finding Borders -> string search

1733 Finding Periods -> string search

1110 Minimal Rotation -> string search

1111 Longest Palindrome -> string search

1112 Required Substring -> string search

### Module Progress:

### Join the USACO Forum!

Stuck on a problem, or don't understand a module? Join the USACO Forum and get help from other competitive programmers!