Symbolic representations of time series have demonstrated their effectiveness in tasks such as motif discovery, clustering, classification, forecasting, and anomaly detection. These methods not only reduce the dimensionality of time series data but also accelerate downstream tasks.
Elsworth and Güttel [Time Series Forecasting Using LSTM Networks: A Symbolic Approach, arXiv, 2020] have shown that symbolic forecasting significantly reduces the sensitivity of Long Short-Term Memory (LSTM) networks to hyperparameter settings. However, deploying machine learning algorithms at the symbolic level—rather than on raw time series data—remains a challenging problem for many practical applications. To support the research community and streamline the application of machine learning to symbolic representations, we developed the slearn
Python library. This library offers APIs for symbolic sequence generation, complexity measurement, and machine learning model training on symbolic data. We will illustrate several core features and use cases below.
A key feature of slearn is its ability to compute distances between symbolic sequences, enabling similarity or dissimilarity measurements after transformation. The library includes the LZWStringLibrary
, which supports string distance computation based on Lempel-Ziv-Welch (LZW) complexity. This is particularly beneficial for time series classification, clustering, and anomaly detection tasks.
Install the slearn package simply by
pip install slearn
conda install -c conda-forge slearn
To check which version you install, please use:
conda list slearn
skearn
enables the generation of strings of tunable complexity using the LZW compressing method as base to approximate Kolmogorov complexity. It also contains the tools for the exploration of the hyperparameter space of commonly used RNNs as well as novel ones.
The skearn
library uses the LZWStringLibrary to compute distances between symbolic sequences. The distance measure is based on the LZW complexity, which quantifies the complexity of a string by counting the number of unique substrings in its LZW compression dictionary. The library provides a method called distance in the LZWStringLibrary class to compute the distance between two strings, which can be used to compare symbolic representations of time series.
The distance measure is typically normalized and leverages the LZW complexity to provide a similarity score between two sequences. This is particularly useful when comparing time series that have been transformed into symbolic sequences using methods like SAX.
from slearn import *
df_strings = lzw_string_library(symbols=3, complexity=[4, 9], random_state=0)
print(df_strings)
Output:
nr_symbols LZW_complexity length string
0 3 4 4 ACBB
1 3 9 11 CBACBCABABB
The following table summarizes the implemented Symbolic Aggregate Approximation (SAX) variants and the ABBA method for time series representation:
Algorithm | Time Series Type | Segmentation | Features Extracted | Symbolization | Reconstruction |
---|---|---|---|---|---|
SAX | Univariate | Fixed-size segments | Mean (PAA) | Gaussian breakpoints, single symbol per segment | Piecewise constant from PAA values |
SAX-TD | Univariate | Fixed-size segments | Mean (PAA), slope | Mean to symbol, trend suffix ('u', 'd', 'f') | Linear trends from PAA and slopes |
eSAX | Univariate | Fixed-size segments | Min, mean, max | Three symbols per segment (min, mean, max) | Quadratic interpolation from min, mean, max |
mSAX | Multivariate | Fixed-size segments | Mean per dimension | One symbol per dimension per segment | Piecewise constant per dimension |
aSAX | Univariate | Adaptive segments (based on local variance) | Mean (PAA) | Gaussian breakpoints, single symbol per segment | Piecewise constant from adaptive segments |
ABBA | Univariate | Adaptive piecewise linear segments | Length, increment | Clustering (k-means), symbols assigned to clusters | Piecewise linear from cluster centers |
- SAX: Standard SAX with fixed-size segments and mean-based symbolization.
- SAX-TD: Extends SAX with trend information (up, down, flat) per segment.
- eSAX: Enhanced SAX capturing min, mean, and max per segment for smoother reconstruction.
- mSAX: Multivariate SAX, processing each dimension independently.
- aSAX: Adaptive SAX, adjusting segment sizes based on local variance for better representation of variable patterns.
- ABBA: Adaptive Brownian Bridge-based Aggregation, using piecewise linear segmentation and k-means clustering for symbolization (based on https://github.com/nla-group/fABBA).
from slearn.symbols import *
def test_sax_variant(model, ts, t, name, is_multivariate=False):
symbols = model.fit_transform(ts)
recon = model.inverse_transform()
print(f"{name} reconstructed length: {len(recon)}")
rmse = np.sqrt(np.mean((ts - recon) ** 2))
return rmse
# Generate test time series
np.random.seed(42)
t = np.linspace(0, 10, 100)
ts = np.sin(t) + np.random.normal(0, 0.1, 100) # Univariate, main test
ts_multi = np.vstack([np.sin(t), np.cos(t)]).T + np.random.normal(0, 0.1, (100, 2)) # Multivariate
sax = SAX(window_size=10, alphabet_size=8)
rmse = test_sax_variant(sax, ts, t, "SAX")
saxtd = SAXTD(window_size=10, alphabet_size=8)
rmse = test_sax_variant(saxtd, ts, t, "SAX-TD")
esax = ESAX(window_size=10, alphabet_size=8)
rmse = test_sax_variant(esax, ts, t, "eSAX")
msax = MSAX(window_size=10, alphabet_size=8)
rmse = test_sax_variant(msax, ts_multi, t, "mSAX", is_multivariate=True)
asax = ASAX(n_segments=10, alphabet_size=8)
rmse = test_sax_variant(asax, ts, t, "aSAX")
slearn
includes the implemented interface for string distance and similarity metrics as well as their normalized implementations, each strictly adhering to their formal definitions.
from slearn.dmetric import *
print(damerau_levenshtein_distance("cat", "act"))
print(jaro_winkler_distance("martha", "marhta"))
print(normalized_damerau_levenshtein_distance("cat", "act"))
print(normalized_jaro_winkler_distance("martha", "marhta"))
slearn currently supports SAX, ABBA, and fABBA symbolic representation, and the machine learning classifiers as below:
Support Classifiers | Parameter call |
---|---|
Multi-layer Perceptron | 'MLPClassifier' |
K-Nearest Neighbors | 'KNeighborsClassifier' |
Gaussian Naive Bayes | 'GaussianNB' |
Decision Tree | 'DecisionTreeClassifier' |
Support Vector Classification | 'SVC' |
Radial-basis Function Kernel | 'RBF' |
Logistic Regression | 'LogisticRegression' |
Quadratic Discriminant Analysis | 'QuadraticDiscriminantAnalysis' |
AdaBoost classifier | 'AdaBoostClassifier' |
Random Forest | 'RandomForestClassifier' |
Our documentation is available.
This slearn implementation is maintained by Roberto Cahuantzi (University of Manchester), Xinye Chen (Charles University Prague), and Stefan Güttel (University of Manchester). If you use the function of LZWStringLibrary
in your research, or if you find slearn useful in your work, please consider citing the paper below. If you have any problems or questions, just drop us an email.
@InProceedings{10.1007/978-3-031-37963-5_53,
author="Cahuantzi, Roberto
and Chen, Xinye
and G{\"u}ttel, Stefan",
title="A Comparison of LSTM and GRU Networks for Learning Symbolic Sequences",
booktitle="Intelligent Computing",
year="2023",
publisher="Springer Nature Switzerland",
pages="771--785"
}
This project is licensed under the terms of the MIT license.
Contributing to this repo is welcome! We will work through all the pull requests and try to merge into main branch.
TO DO LIST:
- language modeling functionalities
- comphrehensive documentation
- performance optimization