Learning notes suffix array SA

Suffixes are used for string operations, such as "sort all non empty suffixes of strings in dictionary order from small to large"

Suffix sorting is the process of building suffix arrays (that is, sorting the suffixes, and the array after sorting is called suffix array)

The implementation of suffix sorting mainly depends on multiplication method and Radix sorting

principle

Let's start with suffix sorting

definition:

\(sa[i]: \) position of suffix ranked as \ (i\)

\(rak[i]: \) the ranking of suffixes starting from position \ (i\). For the convenience of description, the suffixes starting from position \ (i\) are hereinafter referred to as "suffix \ (i\)"

\(tp[i]\): the second keyword of cardinality sorting. Its meaning is the same as \ (sa\)

\(tax[i]\): (i\) how many times the element number appears, and the bucket sorted by cardinality

(there is also DC3 algorithm for constructing suffix array (three points?) And create suffix tree, and run two \ (O(n) \) methods of dfs order, but the common method is suffix sorting)

It is easy to observe that \ (rk\) and \ (sa\) are mutually inverse arrays, that is, \ (rk[sa[i]]=i,sa[rk[i]]=i\)

If we sort directly, the complexity will explode because the length of our string is related to the complexity and length during comparison

So we will make an article for everyone

First, sort by cardinality, and arrange all suffixes according to the first character

Then consider multiplication

Then a clever step is that "\ the string formed by the first \ (\frac{w}{2}\) characters of the suffix \ (i\) is a string formed by the last \ (\frac{w}{2}\) characters of the suffix \ (i-\) \(\frac{w}{2}\)"

It should be noted here that we have removed the last part of the suffix \ (i\)

Each time we execute the contents of the loop, there is a partially ordered suffix array (can be expected)

Here, each time you consider multiplying the new digits, you can sort them by cardinality

CODE

#include<bits/stdc++.h>
using namespace std;
#define int long long
#define reg register
namespace yspm{
    inline int read(){
        int res=0,f=1; char k;
        while(!isdigit(k=getchar())) if(k=='-') f=-1;
        while(isdigit(k)) res=res*10+k-'0',k=getchar();
        return res*f;
    }
    const int N=1e6+10;
    int sa[N],rk[N],ton[N],sec[N],n,m,h[N];
    char s[N];
    inline void gb(){
        memset(ton,0,sizeof(ton));
        for(reg int i=1;i<=n;++i) ton[rk[i]]++;
        for(reg int i=1;i<=m;++i) ton[i]+=ton[i-1];
        for(reg int i=n;i>=1;--i) sa[ton[rk[sec[i]]]--]=sec[i];
        return ;
    }
    signed main(){
        scanf("%s",s+1); n=strlen(s+1);
        m=75; for(reg int i=1;i<=n;++i) rk[i]=s[i]-'0'+1,sec[i]=i; gb();
        for(reg int w=1,p=0;p<n;m=p,w<<=1){ 
            p=0; for(reg int i=1;i<=w;++i) sec[++p]=n-w+i;
            for(reg int i=1;i<=n;++i) if(sa[i]>w) sec[++p]=sa[i]-w;
            gb(); swap(rk,sec); rk[sa[1]]=p=1;  
            for(reg int i=2;i<=n;++i){
                if(sec[sa[i]]==sec[sa[i-1]]&&sec[sa[i]+w]==sec[sa[i-1]+w]) rk[sa[i]]=p; else rk[sa[i]]=++p;
            } 
        }
        int k=0;
        for(reg int i=1;i<=n;++i){
            if(rk[i]==1){h[1]=k=0; continue;}
            int j=sa[rk[i]-1];
            if(k>0) --k; while(i+k<=n&&j+k<=n&&s[i+k]==s[j+k]) ++k;
            h[rk[i]]=k;
        } 
        for(reg int i=1;i<=n;++i) printf("%lld ",h[i]); puts("");
        return 0;
    }
}
signed main(){return yspm::main();}

Then the suffix arrays are more important \ (h\) array and \ (height\) array

\[height[i]=LCP(sa[i-1],sa[i]) \]

\The (height\) array has the following properties:

If two subscripts \ (j,k\) meet \ (rk_j<rk_k\)

Then \ (LCP(j,k)=\min\limits_{l=j+1}^k height_l\)

This feeling is more useful

But it's not easy to ask

Redefinition: \ (h_i=height[rk_i]\)

Then the \ (h\) array has the following properties:

\[h_i\ge h_{i-1}-1 \]

Proof of the above two properties? link

Using the properties of \ (h\) array, we can calculate \ (h,height\) in a lower complexity

Example

\(Suffix\ Array\) how to find the number of substrings that are essentially different:

Find the \ (sa\) and \ (height\) arrays, and then

\[ans=\sum _ {i=1}^n n+1-sa_i-height_i \]

How to find any substring \ (lcs\) in suffix array:

Copy the original string in reverse and put it at the end of the original string (characters to be added in the middle)

Then it becomes a \ (lcp\) problem

The problem of finding the sum of some suffixes \ (lcp\)

(it seems that you can use the suffix virtual tree, that is, \ (bzoj\ \ SvT\) is the full name of \ (suffix\ virtual\ tree\), and you can directly create a tree for the topic and then do it every time according to the method of consumption war.)

Just use the suffix array to monotone the stack

Summary of routine questions

\(1.\) dichotomous \ (rk\) or subscript

\(JSOI2015\) in string segmentation, relevant properties are observed, the enumeration quantity is reduced, and then the solution is obtained by dichotomy

\In the (HEOI2016\) string, the binary length is combined with \ (rmq\) and the maintenance of \ (rk\) using \ (rmq\) or the chairman tree to achieve the maximum \ (lcp\) query

\(2.\) difference

In the stock market forecast and Sandy's card, the question involves the trend inquiry. At this time, the difference will be cleverly combined with the corresponding methods

\(3.\) two strings of \ (\cdots\)

Consider duplicating the second string to the next string, and adding a character with a large dictionary order in the middle for maintenance

Generally speaking, suffix array is more of a tool, but it still needs to be maintained by analyzing properties, combining dichotomy, data structure and other means

Posted by lasith on Fri, 03 Jun 2022 15:52:30 +0530