Suffix Array

Construction Initial Transformation Sorting Optimization Implementation LCP Array Construction Application Implementation Burrows-Wheeler Transform Run Enumerate Problems


CF	Suffix Array	Videos & Practice Problems
cp-algo	Suffix Array - Definition & Construction
CPC	11 - Strings (Suffix Array)
CP2	6.6 - Suffix Array and Applications

The suffix array of a string is the sorted array of all possible suffixes of the string. Each suffix is usually represented by its starting index.

For example, if our string were abcbcba, the suffix array would be as follows:

Suffix	String
6	`a`
0	`abcbcba`
5	`ba`
3	`bcba`
1	`bcbcba`
4	`cba`
2	`cbcba`

Construction

Suffix Array

YS - Easy

Focus Problem – try your best to solve this problem before continuing!

The general philosophy behind the construction of a suffix array is to sort incrementally. We first start by comparing the first character of each suffix, and then double the amount we compare until we're comparing the entire length of the string. If this sounds really abstract, don't worry! All the implementation details will be hammered out below.

Initial Transformation

For convenience, let's start by appending a "null character" of sorts to the end of the string. This acts as a tiebreaker and will ensure we never hit the end of a suffix while comparing two suffixes. $ or the space character are possible options, as the ASCII code of either are less than a.

To avoid out of bounds issues, we'll also append the string to itself. Notice that since any comparisons would have stopped at the null character, this has no effect on the order of the suffix array.

After implementing these two transformations, the strings we compare would look like this (still using abcbcba as an example):

Suffix	String
7	`$abcbcba`
6	`a$abcbcba`
0	`abcbcba$abcbcba`
5	`ba$abcbcba`
3	`bcba$abcbcba`
1	`bcbcba$abcbcba`
4	`cba$abcbcba`
2	`cbcba$abcbcba`

The last suffix that starts with $ will always come first; it doesn't particularly matter.

Sorting

As stated earlier, let's first just consider the first character in each substring and sort it:

Suffix	String
7	`$abcbcba`
0	`abcbcba$abcbcba`
6	`a$abcbcba`
1	`bcbcba$abcbcba`
3	`bcba$abcbcba`
5	`ba$abcbcba`
2	`cbcba$abcbcba`
4	`cba$abcbcba`

From here we start doubling the length of string we compare. To get to the point where we efficiently compare suffixes, let's first do some analysis.

Say we just sorted all the suffixes by the first $2$ characters, and we're now comparing the first four characters of suffixes $2$ and $4$ :

Suffix	String
2	`cbcb`
4	`cba$`

If the first $2$ characters of suffix $2$ were already less than or greater than those of suffix $4$ , nothing changes. However, in this case they aren't; we're going to have to compare the right half.

How might we do this? Well, the cool thing is that since we just sorted all suffixes by the first $2$ , we actually have information about the right halves already! The right half of the first four characters of suffix $2$ is just the first two characters of suffix $4$ . Similarly, the right half of suffix $4$ is the first two characters of suffix $6$ .

To compare these half-suffixes, after each iteration of sorting we can assign each half-baked suffix a number representing its position in the array. If two half-baked suffixes are equal, then they would get the same number. These numbers are usually called equivalence classes.

Applying this augmentation to our array sorted by the first character, we would now have:

Eq. Class	Suffix	String
0	7	`$abcbcba`
1	0	`abcbcba$abcbcba`
1	6	`a$abcbcba`
2	1	`bcbcba$abcbcba`
2	3	`bcba$abcbcba`
2	5	`ba$abcbcba`
3	2	`cbcba$abcbcba`
3	4	`cba$abcbcba`

Optimization

Since we're doing $\mathcal{O}(N \log N)$ sorts $\log N$ times, our current algorithm as it stands is $\mathcal{O}(N \log^2 N)$ .

Although this is adequate, it can sometimes be a bit too slow especially when creating a suffix array is often just the first step in many problems. Fortunately, we can fix this! Since everything we're comparing is in the range $[0, N)$ , we can use radix sort to remove a log factor from our complexity.

Implementation

Time Complexity: $\mathcal{O}(N \log N)$

C++

#include <algorithm>
#include <iostream>
#include <string>
#include <vector>

using std::cout;
using std::endl;
using std::pair;
using std::vector;

A more compact implementation is also below.

Resources
		Benq	Suffix Array w/ LCP	O(N log N)

Java

Warning!

The below implementation will TLE on some test cases due to Java's constant factor.

import java.io.*;
import java.util.*;
import java.util.function.Function;

public class SuffixArray {
	static class Suffix implements Comparable<Suffix> {
		public int start;
		public int leftEq;
		public int rightEq;

It's recommended that you also test your implementation against a brute force solution for many small strings.

Here's a small script that outputs a test case and its answer:

C++

#include <algorithm>
#include <iostream>
#include <vector>

using std::cout;
using std::endl;
using std::vector;

int main() {
	int n = 50;  // adjust n as you please

Java

import java.io.*;
import java.util.*;

public class Generator {
	public static void main(String[] args) throws IOException {
		Random rng = new Random();

		int n = 50;  // adjust n as you please
		StringBuilder sb = new StringBuilder();
		for (int i = 0; i < n; i++) { sb.append((char)('a' + rng.nextInt(26))); }

Python

from random import choice
from string import ascii_lowercase

n = 50  # adjust n as you please
str_ = [choice(ascii_lowercase) for _ in range(n)]

print("".join(str_))

suffs = list(range(n))
suffs.sort(key=lambda s: str_[s:])

print(" ".join(str(i) for i in suffs))

LCP Array

The LCP array is a common auxiliary array based on the suffix array that stores the longest common prefix (LCP) between adjacent elements of the suffix array.

Taking our example string from above, such an array would look like this:

LCP	Suffix	String
NA	7	`$abcbcba`
0	0	`abcbcba$abcbcba`
1	6	`a$abcbcba`
0	1	`bcbcba$abcbcba`
3	3	`bcba$abcbcba`
1	5	`ba$abcbcba`
0	2	`cbcba$abcbcba`
2	4	`cba$abcbcba`

Each row contains the LCP between the row's string and that of the previous row.

Construction

An algorithm often used to construct this LCP array starts by calculating the LCP for the suffix starting at $0$ and its previous suffix. It then calculates the LCP for the suffix starting at $1$ and its previous suffix, and so on and so forth. The reason for this calculation order is so that we can apply a clever optimization that is best described through our example.

Say we just finished calculating the LCP for the row of suffix $3$ . Our next step, then, would be to calculate the LCP for suffix $4$ .

One might think that we would have to start comparing the prefixes all over again for this suffix, but we actually don't have to! Remember that we just calculated the LCP of suffixes $3$ and $1$ to be $3$ characters, which means that the suffixes $4$ and $2$ , and all the suffixes between them in the array (despite there not being any in our example) must have at least $2$ characters in common.

What this means is that we can just start comparing from the third character to compute the LCP for the row of suffix $4$ instead of starting all over.

In general, we keep a variable $\texttt{start\_at}$ which keeps the number of common characters we know the current suffix has with its predecessor in the suffix array. This variable tells us how many characters we can skip, as we know they're equal. After each computation of an LCP, we set this variable equal to its value minus one, except when the LCP is already $0$ .

You'll see an implementation of this algorithm in the next section.

Application

Now that we've constructed this LCP array, let's see an application of it.

Number of Substrings

YS - Easy

Focus Problem – try your best to solve this problem before continuing!

A crucial observation to be made for this problem is that every substring can be represented as a prefix of some suffix. We know that suffix $i$ has $N-i$ (recall that $N$ is the length of the string) possible prefixes.

Notice that the number of prefixes of suffix $i$ that are duplicates is just the longest common prefix between $i$ and its predecessor. With this, we can iterate through the LCP array and calculate our answer in one fell swoop.

Implementation

Time Complexity: $\mathcal{O}(N \log N)$

C++

#include <algorithm>
#include <iostream>
#include <string>
#include <vector>

using std::cout;
using std::endl;
using std::pair;
using std::string;
using std::vector;

Java

Warning!

This solution, like the previous one, will TLE for the same reason stated above.

import java.io.*;
import java.util.*;
import java.util.function.Function;

public class SuffixArray {
 	Code Snippet: Suffix Class (Click to expand)

	public static void main(String[] args) throws IOException {
		BufferedReader read = new BufferedReader(new InputStreamReader(System.in));
		String str = read.readLine() + '$';