PrevNext
Rare
 0/15

String Hashing

Authors: Benjamin Qi, Andi Qu

Contributors: Andrew Wang, Kevin Sheng

Quickly testing equality of substrings with a small probability of failure.

Edit This Page

Tutorial

Implementation

As mentioned in the articles above, there is no need to calculate modular inverses.

C++

class HashedString {
private:
// change M and B if you want
static const long long M = 1e9 + 9;
static const long long B = 9973;
// pow[i] contains B^i % M
static vector<long long> pow;
// p_hash[i] is the hash of the first i characters of the given string

Java

import java.util.*;
public class HashedString {
// Change M and B if you want
public static final long M = (long)1e9 + 9;
public static final long B = 9973;
// pow[i] contains B^i % M
private static ArrayList<Long> pow = new ArrayList<>();

Python

class HashedString:
# Change M and B if you want
M = int(1e9) + 9
B = 9973
# pow[i] contains B^i % M
_pow = [1]
def __init__(self, s: str):
while len(self._pow) < len(s):

This implementation calculates

hsh[i+1]=(x=0iBixS[x])modM\texttt{hsh}[i + 1] = \left(\sum_{x = 0}^i B^{i - x} \cdot S[x]\right) \bmod M

The hash of any particular substring S[a:b]S[a : b] is then calculated as

(x=abBbxS[x])modM=(hsh[b+1]hsh[a]Bba+1)modM\left(\sum_{x = a}^b B^{b - x} \cdot S[x] \right) \bmod M = (\texttt{hsh}[b + 1] - \texttt{hsh}[a] \cdot B^{b - a + 1}) \bmod M

using prefix sums. This is nice because the highest power of BB in that polynomial will always be BbaB^{b - a}.

Collision Probability

In general, when using polynomial hashing modulo a prime modulus MM, the probability of two distinct strings having equal hashing over all possible choices of the base BB can be up to nM\frac{n}{M}, where nn is the length of the longer of the two strings. See rng-58's blog post about hashing (linked above) for how to derive this probability using the Schwarz-Zippel lemma.

Since 109+910^9 + 9 is prime, the probability of collision when using this hash is at most N109+9<104\frac{N}{10^9 + 9} < 10^{-4}, by the Schwarz-Zippel lemma. This means that if you select any two different strings of length at most N=105N=10^5 and a random base modulo 109+910^9 + 9 (e.g. 99739973 in the code), the probability that they hash to the same value is at most 10410^{-4}.

Warning!

Given a set of the hashes of mm distinct strings with length up to nn, the probability of two strings having equal hashes can be up to m2nM\frac{m^2n}{M} by the birthday paradox. Assuming mm and nn are on the order of 10510^5, M=109+7M=10^9+7 is nowhere close to large enough to avoid collisions. Use a larger prime modulus such as 26112^{61}-1 (and do multiplications using 128-bit integers).

Warning!

In contests with open hacking (in particular, Codeforces educational rounds), make sure to choose the base of your polynomial hash randomly, as mentioned here.

C++

In C++, a virtually unhackable way of generating BB in the implementation above is to use a random number generator seeded with a high-precision clock, as described here.

mt19937 rng((uint32_t)chrono::steady_clock::now().time_since_epoch().count());
const ll B = uniform_int_distribution<ll>(0, M - 1)(rng);

Example Problem - Searching For Strings

Focus Problem – try your best to solve this problem before continuing!

Solution - One Hash

Time Complexity: O((N+H)Σ)\mathcal O((|N| + |H|) \cdot \Sigma), where Σ\Sigma is the size of the alphabet.

Failure Probability: O(NH2M)\mathcal O\left(\frac{|N||H|^2}{M}\right)

We'll use a sliding window over HH to find the "matches" with NN.

Since we don't care about relative order when comparing two substrings, we can store frequency tables of the characters in the current window and in NN. When we slide the window, at most two values in that table change. To compare two substrings, we simply compare the 26 values in each table.

If we only needed to count the number of matches, then the above alone would suffice (in fact, IOI 2006 Writing is just that). However, we need to count the distinct permutations of NN in HH, so we need to be a bit more clever.

One way to solve this is by storing the polynomial hashes of each match in a set, since we expect different permutations to have different polynomial hashes. The answer would simply be the size of that set at the end.

Using a relatively small modulus such as M=109+9M=10^9+9 will not pass (see the note above regarding the birthday paradox). Instead, we use M=2611M=2^{61}-1.

C++

#include <bits/stdc++.h>
using namespace std;
using ll = long long;
Code Snippet: HashedString (Click to expand)
int freq_target[26], freq_curr[26];
string n, h;

Solution - Two Hashes

Time Complexity: O((N+H)logM)\mathcal O((|N| + |H|) \log M)

Failure Probability: O(NH2M)\mathcal O\left(\frac{|N||H|^2}{M}\right)

An alternative solution without frequency tables would be to hash the substrings that we're trying to match. Since order doesn't matter, we need to modify our hash function slightly.

In particular, instead of computing the polynomial hash of the substrings, compute the product (B+s1)(B+s2)(B+sk)modM(B + s_1)(B + s_2) \dots (B + s_k) \bmod M as the hash (again, using two modulos). This hash is nice because the relative order of the letters doesn't matter, as multiplication is commutative. Furthermore, as any two strings with different frequency tables map to different polynomials in BB, they hash to the same value with probability at most NM\frac{|N|}{M} over the choice of BB.

Since this hash requires the modular inverse, there's an extra logM\log M factor in the time complexity.

C++

#include <bits/stdc++.h>
typedef long long ll;
using namespace std;
Code Snippet: HashedString (Click to expand)
const auto M = HashedString::M;
const auto B = HashedString::B;
const auto mul = HashedString::mul;
const auto mod_mul = HashedString::mod_mul;

Problems

StatusSourceProblem NameDifficultyTags
CSESVery Easy
Show TagsHashing
SilverEasy
Show TagsHashing
CEOIEasy
Show TagsGreedy, Hashing
CFEasy
Show TagsHashing
CFEasy
Show TagsHashing
GoldNormal
Show TagsBinary Search, Hashing
GoldNormal
Show TagsHashing, Simulation
RMINormal
Show TagsHashing
COCINormal
Show TagsHashing, Probability
COCIHard
Show TagsBinary Search, Hashing
CFHard
Show TagsDP, Hashing
Baltic OIHard
Show TagsHashing
COCIVery Hard
Show TagsDSU, Hashing
COIVery Hard
Show TagsBinary Search, Hashing

Module Progress:

Join the USACO Forum!

Stuck on a problem, or don't understand a module? Join the USACO Forum and get help from other competitive programmers!

PrevNext