中文 | English
This project implements a suffix array based on the doubling algorithm, supporting efficient string processing and suitable for UTF-8 text. The core part is written in C++ and wrapped as a Python extension module via Cython.
-
Time Complexity:
$\mathcal{O}(n\log n)$ -
Space Complexity:
$\mathcal{O}(n)$ (with a large constant, be cautious with large datasets) - Multilingual Support: Input must be UTF-8 encoded
size()
: Returns the length of the textget_id(suf_rank)
: Given a suffix rank (1-indexed), returns its position in the text (0-indexed)get_suf(suf_rank)
: Given a suffix rank (1-indexed), returns the corresponding suffixget_rank(suf_id)
: Given the left boundary position of a suffix (0-indexed), returns its lexicographical rankget_count(pattern)
: Given a pattern string, returns the number of its occurrences in the textget_prob(prompt)
: Given a text prompt, returns the probability distribution of the next characterget_branch_entropy(prompt)
: Given a text prompt, returns its right branch entropyget_mutual_information(text)
: Given a text, returns its mutual information
It is recommended to use pip for local development installation:
pip install -e .
Or manually build the extension module:
python setup.py build_ext --inplace
The generated library file (e.g., SuffixArray.*.so
) can be placed in your project directory and imported in Python code.
test.ipynb
provides simple test cases. If you encounter unexpected outputs, feel free to open an issue.
- Currently, a linear-time construction algorithm is not used
- The space constant is large, not suitable for ultra-large datasets
- Test cases are limited, mainly for personal study and experimentation
This project is licensed under the MIT License. Free to use and modify.