PrevNext
Rare
 0/12

String Hashing

Authors: Benjamin Qi, Andrew Wang, Andi Qu

Quickly test equality of substrings with a small probability of failure.

Tutorial

Resources
CPHgood intro
cp-algocode
PAPSmany applications

Optional

If "small" isn't a satisfying-enough answer for "what's the probability of collision?", then you should check out rng-58's blog post talking about hashing. This blog post talks about the Schwarz-Zippel lemma and how that can be used to calculate the probability of a collision.

It also explains how to hash rooted trees - an uncommon technique, but still useful to know!

Implementation - Single Base

As mentioned in the articles above, there is no need to calculate modular inverses.

C++

const ll M = 1e9 + 9, P = 9973; // Change M and P if you want to
ll pw[100001]; // Stores the powers of P modulo M
int n;
string s;
ll hsh[100001];
void calc_hashes() {
hsh[0] = 1;

Java

public static final long M = (long)1e9 + 9, P = 9973; // Change M and P if you want to
public static long pw[] = new long[100001]; // Stores the powers of P modulo M
public static int n;
public static String s;
public static long hsh[] = new long[100001];
public static void calc_hashes() {
hsh[0] = 1;

This implementation calculates

hsh[i]=(x=0iPixS[x])modM\texttt{hsh}[i] = \left(\sum_{x = 0}^i P^{i - x} \cdot S[x]\right) \bmod M

The hash of any particular substring S[a:b]S[a : b] is then calculated as

(x=abPbxS[x])modM=(hsh[b]hsh[a]Pba+1)modM\left(\sum_{x = a}^b P^{b - x} \cdot S[x] \right) \bmod M = (\texttt{hsh}[b] - \texttt{hsh}[a] \cdot P^{b - a + 1}) \bmod M

using prefix sums. This is nice because the highest power of PP in that polynomial will always be PbaP^{b - a}.

Since 109+910^9 + 9 is prime, the probability of collision when using this hash is at most N109+9<104\frac{N}{10^9 + 9} < 10^{-4}, by the Schwarz-Zippel lemma. This means that if you select any two different strings of length at most NN and a random base modulo 109+910^9 + 9 (e.g. 99739973 in the code), the probability that they hash to the same value is at most 10410^{-4}.

Implementation - Multiple Bases

Resources
CF

regarding CF educational rounds in particular

Benq

It's generally a good idea to use two randomized bases rather than just one to decrease the probability that two random strings hash to the same value.

Focus Problem – read through this problem before continuing!

Solution - Searching For Strings

One Hash

Time Complexity: O((N+H)Σ)\mathcal O((|N| + |H|) \cdot \Sigma), where Σ\Sigma is the size of the alphabet.

We'll use a sliding window over HH to find the "matches" with NN.

Since we don't care about relative order when comparing two substrings, we can store frequency tables of the characters in the current window and in NN. When we slide the window, at most two values in that table change. To compare two substrings, we simply compare the 26 values in each table.

If we only needed to count the number of matches, then the above alone would suffice (in fact, IOI 2006 Writing is just that). However, we need to count the distinct permutations of NN in HH, so we need to be a bit more clever.

One way to solve this is by storing the polynomial hashes of each match in a hashset, since we expect different permutations to have different polynomial hashes. The answer would simply be the size of that hashset at the end.

Since the test data for this particular problem is very strong, we will probably get hash collisions with only one hash. To remedy this, we use two hashes for each match - this significantly decreases the probability of collisions.

Using the base 99739973 with the two modulos 109+910^9 + 9 and 109+710^9 + 7 works for this problem. (Note that using two bases with the same modulo works too.)

C++

#include <bits/stdc++.h>
typedef long long ll;
using namespace std;
const ll P = 9973, M1 = 1e9 + 9, M2 = 1e9 + 7;
int freq_target[26], freq_curr[26];
string n, h;
int main() {

Two Hashes

Time Complexity: O((N+H)logM)\mathcal O((|N| + |H|) \log M)

An alternative solution without frequency tables would be to hash the substrings that we're trying to match. Since order doesn't matter, we need to modify our hash function slightly.

In particular, instead of computing the polynomial hash of the substrings, compute the product (P+s1)(P+s2)(P+sk)modM(P + s_1)(P + s_2) \cdots (P + s_k) \bmod M as the hash (again, using two modulos). This hash is nice because the relative order of the letters doesn't matter, as multiplication is commutative.

Since this hash requires the modular inverse, there's an extra logM\log M factor in the time complexity.

Alternative hashes (e.g. computing the sum (P+s1)2+(P+s2)2++(P+sk)2modM(P + s_1)^2 + (P + s_2)^2 + \dots + (P + s_k)^2 \bmod M) also work for other hashing problems, but the test cases are too strong for that to pass here.

C++

#include <bits/stdc++.h>
typedef long long ll;
using namespace std;
const ll P = 9973, M1 = 1e9 + 9, M2 = 1e9 + 7;
ll inv(ll base, ll MOD) {
ll ans = 1, expo = MOD - 2;
while (expo) {
if (expo & 1) ans = ans * base % MOD;

Problems

StatusSourceProblem NameDifficultyTags
CEOIEasy
Show TagsGreedy, Hashing
CFEasy
Show TagsDP, Hashing
GoldNormal
Show TagsHashing
GoldNormal
Show TagsHashing, Simulation
RMINormal
Show TagsHashing
COCINormal
Show TagsHashing, Probability
COCIHard
Show TagsBinary Search, Hashing
CFHard
Show TagsDP, Hashing
Baltic OIHard
Show TagsHashing
COCIVery Hard
Show TagsDSU, Hashing
COIVery Hard
Show TagsBinary Search, Hashing

Module Progress:

Join the USACO Forum!

Stuck on a problem, or don't understand a module? Join the USACO Forum and get help from other competitive programmers!

PrevNext