slearn: Python package for learning symbolic sequences

Symbolic representations of time series have demonstrated their effectiveness in tasks such as motif discovery, clustering, classification, forecasting, and anomaly detection. These methods not only reduce the dimensionality of time series data but also accelerate downstream tasks. Elsworth and Güttel [Time Series Forecasting Using LSTM Networks: A Symbolic Approach, arXiv, 2020] have shown that symbolic forecasting significantly reduces the sensitivity of Long Short-Term Memory (LSTM) networks to hyperparameter settings. However, deploying machine learning algorithms at the symbolic level—rather than on raw time series data—remains a challenging problem for many practical applications. To support the research community and streamline the application of machine learning to symbolic representations, we developed the slearn Python library. This library offers APIs for symbolic sequence generation, complexity measurement, and machine learning model training on symbolic data. We will illustrate several core features and use cases below.

A key feature of slearn is its ability to compute distances between symbolic sequences, enabling similarity or dissimilarity measurements after transformation. The library includes the LZWStringLibrary, which supports string distance computation based on Lempel-Ziv-Welch (LZW) complexity. This is particularly beneficial for time series classification, clustering, and anomaly detection tasks.

Install the slearn package simply by

pip

pip install slearn

conda

conda install -c conda-forge slearn

To check which version you install, please use:

conda list slearn

Usage

Generate strings with customized complexity

skearn enables the generation of strings of tunable complexity using the LZW compressing method as base to approximate Kolmogorov complexity. It also contains the tools for the exploration of the hyperparameter space of commonly used RNNs as well as novel ones. The skearn library uses the LZWStringLibrary to compute distances between symbolic sequences. The distance measure is based on the LZW complexity, which quantifies the complexity of a string by counting the number of unique substrings in its LZW compression dictionary. The library provides a method called distance in the LZWStringLibrary class to compute the distance between two strings, which can be used to compare symbolic representations of time series. The distance measure is typically normalized and leverages the LZW complexity to provide a similarity score between two sequences. This is particularly useful when comparing time series that have been transformed into symbolic sequences using methods like SAX.

from slearn import *
df_strings = lzw_string_library(symbols=3, complexity=[4, 9], random_state=0)
print(df_strings)

Output:

  nr_symbols LZW_complexity length       string
0          3              4      4         ACBB
1          3              9     11  CBACBCABABB

Symbolic time seroes representation

The following table summarizes the implemented Symbolic Aggregate Approximation (SAX) variants and the ABBA method for time series representation:

Algorithm	Time Series Type	Segmentation	Features Extracted	Symbolization	Reconstruction
SAX	Univariate	Fixed-size segments	Mean (PAA)	Gaussian breakpoints, single symbol per segment	Piecewise constant from PAA values
SAX-TD	Univariate	Fixed-size segments	Mean (PAA), slope	Mean to symbol, trend suffix ('u', 'd', 'f')	Linear trends from PAA and slopes
eSAX	Univariate	Fixed-size segments	Min, mean, max	Three symbols per segment (min, mean, max)	Quadratic interpolation from min, mean, max
mSAX	Multivariate	Fixed-size segments	Mean per dimension	One symbol per dimension per segment	Piecewise constant per dimension
aSAX	Univariate	Adaptive segments (based on local variance)	Mean (PAA)	Gaussian breakpoints, single symbol per segment	Piecewise constant from adaptive segments
ABBA	Univariate	Adaptive piecewise linear segments	Length, increment	Clustering (k-means), symbols assigned to clusters	Piecewise linear from cluster centers

SAX: Standard SAX with fixed-size segments and mean-based symbolization.
SAX-TD: Extends SAX with trend information (up, down, flat) per segment.
eSAX: Enhanced SAX capturing min, mean, and max per segment for smoother reconstruction.
mSAX: Multivariate SAX, processing each dimension independently.
aSAX: Adaptive SAX, adjusting segment sizes based on local variance for better representation of variable patterns.
ABBA: Adaptive Brownian Bridge-based Aggregation, using piecewise linear segmentation and k-means clustering for symbolization (based on https://github.com/nla-group/fABBA).

from slearn.symbols import *


def test_sax_variant(model, ts, t, name, is_multivariate=False):
    symbols = model.fit_transform(ts)
    recon = model.inverse_transform()
    print(f"{name} reconstructed length: {len(recon)}")
    rmse = np.sqrt(np.mean((ts - recon) ** 2))
    return rmse

# Generate test time series
np.random.seed(42)
t = np.linspace(0, 10, 100)
ts = np.sin(t) + np.random.normal(0, 0.1, 100)  # Univariate, main test
ts_multi = np.vstack([np.sin(t), np.cos(t)]).T + np.random.normal(0, 0.1, (100, 2))  # Multivariate


sax = SAX(window_size=10, alphabet_size=8)
rmse = test_sax_variant(sax, ts, t, "SAX")


saxtd = SAXTD(window_size=10, alphabet_size=8)
rmse = test_sax_variant(saxtd, ts, t, "SAX-TD")

    
esax = ESAX(window_size=10, alphabet_size=8)
rmse = test_sax_variant(esax, ts, t, "eSAX")

    
msax = MSAX(window_size=10, alphabet_size=8)
rmse = test_sax_variant(msax, ts_multi, t, "mSAX", is_multivariate=True)

    
asax = ASAX(n_segments=10, alphabet_size=8)
rmse = test_sax_variant(asax, ts, t, "aSAX")

String distance and similarity metrics

slearn includes the implemented interface for string distance and similarity metrics as well as their normalized implementations, each strictly adhering to their formal definitions.

from slearn.dmetric import *

print(damerau_levenshtein_distance("cat", "act"))
print(jaro_winkler_distance("martha", "marhta"))

print(normalized_damerau_levenshtein_distance("cat", "act"))
print(normalized_jaro_winkler_distance("martha", "marhta"))

Model support

slearn currently supports SAX, ABBA, and fABBA symbolic representation, and the machine learning classifiers as below:

Support Classifiers	Parameter call
Multi-layer Perceptron	'MLPClassifier'
K-Nearest Neighbors	'KNeighborsClassifier'
Gaussian Naive Bayes	'GaussianNB'
Decision Tree	'DecisionTreeClassifier'
Support Vector Classification	'SVC'
Radial-basis Function Kernel	'RBF'
Logistic Regression	'LogisticRegression'
Quadratic Discriminant Analysis	'QuadraticDiscriminantAnalysis'
AdaBoost classifier	'AdaBoostClassifier'
Random Forest	'RandomForestClassifier'

Our documentation is available.

Citation

This slearn implementation is maintained by Roberto Cahuantzi (University of Manchester), Xinye Chen (Charles University Prague), and Stefan Güttel (University of Manchester). If you use the function of LZWStringLibrary in your research, or if you find slearn useful in your work, please consider citing the paper below. If you have any problems or questions, just drop us an email.

@InProceedings{10.1007/978-3-031-37963-5_53,
    author="Cahuantzi, Roberto
    and Chen, Xinye
    and G{\"u}ttel, Stefan",
    title="A Comparison of LSTM and GRU Networks for Learning Symbolic Sequences",
    booktitle="Intelligent Computing",
    year="2023",
    publisher="Springer Nature Switzerland",
    pages="771--785"
}

License

This project is licensed under the terms of the MIT license.

Contributing

Contributing to this repo is welcome! We will work through all the pull requests and try to merge into main branch.

TO DO LIST:

language modeling functionalities
comphrehensive documentation
performance optimization

Name		Name	Last commit message	Last commit date
Latest commit History 311 Commits
.github/workflows		.github/workflows
build/lib/slearn		build/lib/slearn
conda		conda
data		data
dist		dist
docs		docs
examples		examples
info		info
paper		paper
simulations		simulations
slearn.egg-info		slearn.egg-info
slearn		slearn
.gitattributes		.gitattributes
.readthedocs.yaml		.readthedocs.yaml
.travis.yml		.travis.yml
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
README.rst		README.rst
record.txt		record.txt
requirements.txt		requirements.txt
setup.py		setup.py
unittests.py		unittests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

slearn: Python package for learning symbolic sequences

pip

conda

Usage

Generate strings with customized complexity

Symbolic time seroes representation

String distance and similarity metrics

Model support

Citation

License

Contributing

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

nla-group/slearn

Folders and files

Latest commit

History

Repository files navigation

slearn: Python package for learning symbolic sequences

pip

conda

Usage

Generate strings with customized complexity

Symbolic time seroes representation

String distance and similarity metrics

Model support

Citation

License

Contributing

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages