Rare
0/6

# Suffix Array

Authors: Benjamin Qi, Kevin Sheng

Quickly sorting suffixes of a string and its applications

### Prerequisites

Resources
CF

Videos & Practice Problems

cp-algo
CPC
CP2

The suffix array of a string is the sorted array of all possible suffixes of the string. Each suffix is usually represented by its starting index.

For example, if our string were abcbcba, the suffix array would be as follows:

SuffixString
6a
0abcbcba
5ba
3bcba
1bcbcba
4cba
2cbcba

## Construction

Focus Problem – try your best to solve this problem before continuing!

The general philosophy behind the construction of a suffix array is to sort incrementally. We first start by comparing the first character of each suffix, and then double the amount we compare until we're comparing the entire length of the string. If this sounds really abstract, don't worry! All the implementation details will be hammered out below.

### Initial Transformation

For convenience, let's start by appending a "null character" of sorts to the end of the string. This acts as a tiebreaker and will ensure we never hit the end of a suffix while comparing two suffixes. $ or the space character are possible options, as the ASCII code of either are less than a. To avoid out of bounds issues, we'll also append the string to itself. Notice that since any comparisons would have stopped at the null character, this has no effect on the order of the suffix array. After implementing these two transformations, the strings we compare would look like this (still using abcbcba as an example): SuffixString 7$abcbcba
6a$abcbcba 0abcbcba$abcbcba
5ba$abcbcba 3bcba$abcbcba
1bcbcba$abcbcba 4cba$abcbcba
2cbcba$abcbcba The last suffix that starts with $ will always come first; it doesn't particularly matter.

### Sorting

As stated earlier, let's first just consider the first character in each substring and sort it:

SuffixString
7$abcbcba 0abcbcba$abcbcba
6a$abcbcba 1bcbcba$abcbcba
3bcba$abcbcba 5ba$abcbcba
2cbcba$abcbcba 4cba$abcbcba

From here we start doubling the length of string we compare. To get to the point where we efficiently compare suffixes, let's first do some analysis.

Say we just sorted all the suffixes by the first $2$ characters, and we're now comparing the first four characters of suffixes $2$ and $4$:

SuffixString
2cbcb
4cba$ If the first $2$ characters of suffix $2$ were already less than or greater than those of suffix $4$, nothing changes. However, in this case they aren't; we're going to have to compare the right half. How might we do this? Well, the cool thing is that since we just sorted all suffixes by the first $2$, we actually have information about the right halves already! The right half of the first four characters of suffix $2$ is just the first two characters of suffix $4$. Similarly, the right half of suffix $4$ is the first two characters of suffix $6$. To compare these half-suffixes, after each iteration of sorting we can assign each half-baked suffix a number representing its position in the array. If two half-baked suffixes are equal, then they would get the same number. These numbers are usually called equivalence classes. Applying this augmentation to our array sorted by the first character, we would now have: Eq. ClassSuffixString 07$abcbcba
10abcbcba$abcbcba 16a$abcbcba
21bcbcba$abcbcba 23bcba$abcbcba
25ba$abcbcba 32cbcba$abcbcba
34cba$abcbcba ### Optimization Since we're doing $\mathcal{O}(N \log N)$ sorts $\log N$ times, our current algorithm as it stands is $\mathcal{O}(N \log^2 N)$. Although this is adequate, it can sometimes be a bit too slow especially when creating a suffix array is often just the first step in many problems. Fortunately, we can fix this! Since everything we're comparing is in the range $[0, N)$, we can use radix sort to remove a log factor from our complexity. ### Implementation Time Complexity: $\mathcal{O}(N \log N)$ C++ #include <algorithm>#include <iostream>#include <string>#include <vector> using std::cout;using std::endl;using std::pair;using std::vector;  A more compact implementation is also below. Resources Benq O(N log N) Java ### Warning! The below implementation will TLE on some test cases due to Java's constant factor. import java.io.*;import java.util.*;import java.util.function.Function; public class SuffixArray { static class Suffix implements Comparable<Suffix> { public int start; public int leftEq; public int rightEq;  It's recommended that you also test your implementation against a brute force solution for many small strings. Here's a small script that outputs a test case and its answer: C++ #include <algorithm>#include <iostream>#include <vector> using std::cout;using std::endl;using std::vector; int main() { int n = 50; // adjust n as you please Java import java.io.*;import java.util.*; public class Generator { public static void main(String[] args) throws IOException { Random rng = new Random(); int n = 50; // adjust n as you please StringBuilder sb = new StringBuilder(); for (int i = 0; i < n; i++) { Python from random import choicefrom string import ascii_lowercase n = 50 # adjust n as you pleasestr_ = [choice(ascii_lowercase) for _ in range(n)] print("".join(str_)) suffs = list(range(n))suffs.sort(key=lambda s: str_[s:]) print(" ".join(str(i) for i in suffs)) ## LCP Array The LCP array is a common auxiliary array based on the suffix array that stores the longest common prefix (LCP) between adjacent elements of the suffix array. Taking our example string from above, such an array would look like this: LCPSuffixString NA7$abcbcba
00abcbcba$abcbcba 16a$abcbcba
01bcbcba$abcbcba 33bcba$abcbcba
15ba$abcbcba 02cbcba$abcbcba
24cba\$abcbcba

Each row contains the LCP between the row's string and that of the previous row.

### Construction

An algorithm often used to construct this LCP array starts by calculating the LCP for the suffix starting at $0$ and its previous suffix. It then calculates the LCP for the suffix starting at $1$ and its previous suffix, and so on and so forth. The reason for this calculation order is so that we can apply a clever optimization that is best described through our example.

Say we just finished calculating the LCP for the row of suffix $3$. Our next step, then, would be to calculate the LCP for suffix $4$.

One might think that we would have to start comparing the prefixes all over again for this suffix, but we actually don't have to! Remember that we just calculated the LCP of suffixes $3$ and $1$ to be $3$ characters, which means that the suffixes $4$ and $2$, and all the suffixes between them in the array (despite there not being any in our example) must have at least $2$ characters in common.

What this means is that we can just start comparing from the third character to compute the LCP for the row of suffix $4$ instead of starting all over.

In general, we keep a variable $\texttt{start\_at}$ which keeps the number of common characters we know the current suffix has with its predecessor in the suffix array. This variable tells us how many characters we can skip, as we know they're equal. After each computation of an LCP, we set this variable equal to its value minus one, except when the LCP is already $0$.

You'll see an implementation of this algorithm in the next section.

### Application

Now that we've constructed this LCP array, let's see an application of it.

Focus Problem – try your best to solve this problem before continuing!

A crucial observation to be made for this problem is that every substring can be represented as a prefix of some suffix. We know that suffix $i$ has $N-i$ (recall that $N$ is the length of the string) possible prefixes.

Notice that the number of prefixes of suffix $i$ that are duplicates is just the longest common prefix between $i$ and its predecessor. With this, we can iterate through the LCP array and calculate our answer in one fell swoop.

### Implementation

Time Complexity: $\mathcal{O}(N \log N)$

C++

#include <algorithm>#include <iostream>#include <string>#include <vector>
using std::cout;using std::endl;using std::pair;using std::string;using std::vector;

Java

### Warning!

This solution, like the previous one, will TLE for the same reason stated above.

import java.io.*;import java.util.*;import java.util.function.Function;
public class SuffixArray { 	Code Snippet: Suffix Class (Click to expand)
public static void main(String[] args) throws IOException {		BufferedReader read =		    new BufferedReader(new InputStreamReader(System.in));

## Burrows-Wheeler Transform

Resources
GFG

could be simpler?

Focus Problem – try your best to solve this problem before continuing!

View Internal Solution

## Run Enumerate

Resources
cp-algo

could be simpler?

Focus Problem – try your best to solve this problem before continuing!

### This section is not complete.

Any help would be appreciated! Just submit a Pull Request on Github.

you can do this easily with a suffix array

## Problems

StatusSourceProblem NameDifficultyTags
PlatinumHard
Show TagsSuffix Array
CFHard
Show TagsSuffix Array

### Join the USACO Forum!

Stuck on a problem, or don't understand a module? Join the USACO Forum and get help from other competitive programmers!