Skip to content

nla-group/slearn

Repository files navigation

slearn: Python package for learning symbolic sequences

Build Status PyPI version License: MIT Anaconda-Server Badge Documentation Status

Symbolic representations of time series have demonstrated their effectiveness in tasks such as motif discovery, clustering, classification, forecasting, and anomaly detection. These methods not only reduce the dimensionality of time series data but also accelerate downstream tasks. Elsworth and Güttel [Time Series Forecasting Using LSTM Networks: A Symbolic Approach, arXiv, 2020] have shown that symbolic forecasting significantly reduces the sensitivity of Long Short-Term Memory (LSTM) networks to hyperparameter settings. However, deploying machine learning algorithms at the symbolic level—rather than on raw time series data—remains a challenging problem for many practical applications. To support the research community and streamline the application of machine learning to symbolic representations, we developed the slearn Python library. This library offers APIs for symbolic sequence generation, complexity measurement, and machine learning model training on symbolic data. We will illustrate several core features and use cases below.

A key feature of slearn is its ability to compute distances between symbolic sequences, enabling similarity or dissimilarity measurements after transformation. The library includes the LZWStringLibrary, which supports string distance computation based on Lempel-Ziv-Welch (LZW) complexity. This is particularly beneficial for time series classification, clustering, and anomaly detection tasks.

Install the slearn package simply by

pip

pip install slearn

conda

conda install -c conda-forge slearn

To check which version you install, please use:

conda list slearn

Usage

Generate strings with customized complexity

skearn enables the generation of strings of tunable complexity using the LZW compressing method as base to approximate Kolmogorov complexity. It also contains the tools for the exploration of the hyperparameter space of commonly used RNNs as well as novel ones. The skearn library uses the LZWStringLibrary to compute distances between symbolic sequences. The distance measure is based on the LZW complexity, which quantifies the complexity of a string by counting the number of unique substrings in its LZW compression dictionary. The library provides a method called distance in the LZWStringLibrary class to compute the distance between two strings, which can be used to compare symbolic representations of time series. The distance measure is typically normalized and leverages the LZW complexity to provide a similarity score between two sequences. This is particularly useful when comparing time series that have been transformed into symbolic sequences using methods like SAX.

from slearn import *
df_strings = lzw_string_library(symbols=3, complexity=[4, 9], random_state=0)
print(df_strings)

Output:

  nr_symbols LZW_complexity length       string
0          3              4      4         ACBB
1          3              9     11  CBACBCABABB

Symbolic time seroes representation

The following table summarizes the implemented Symbolic Aggregate Approximation (SAX) variants and the ABBA method for time series representation:

Algorithm Time Series Type Segmentation Features Extracted Symbolization Reconstruction
SAX Univariate Fixed-size segments Mean (PAA) Gaussian breakpoints, single symbol per segment Piecewise constant from PAA values
SAX-TD Univariate Fixed-size segments Mean (PAA), slope Mean to symbol, trend suffix ('u', 'd', 'f') Linear trends from PAA and slopes
eSAX Univariate Fixed-size segments Min, mean, max Three symbols per segment (min, mean, max) Quadratic interpolation from min, mean, max
mSAX Multivariate Fixed-size segments Mean per dimension One symbol per dimension per segment Piecewise constant per dimension
aSAX Univariate Adaptive segments (based on local variance) Mean (PAA) Gaussian breakpoints, single symbol per segment Piecewise constant from adaptive segments
ABBA Univariate Adaptive piecewise linear segments Length, increment Clustering (k-means), symbols assigned to clusters Piecewise linear from cluster centers
  • SAX: Standard SAX with fixed-size segments and mean-based symbolization.
  • SAX-TD: Extends SAX with trend information (up, down, flat) per segment.
  • eSAX: Enhanced SAX capturing min, mean, and max per segment for smoother reconstruction.
  • mSAX: Multivariate SAX, processing each dimension independently.
  • aSAX: Adaptive SAX, adjusting segment sizes based on local variance for better representation of variable patterns.
  • ABBA: Adaptive Brownian Bridge-based Aggregation, using piecewise linear segmentation and k-means clustering for symbolization (based on https://github.com/nla-group/fABBA).
from slearn.symbols import *


def test_sax_variant(model, ts, t, name, is_multivariate=False):
    symbols = model.fit_transform(ts)
    recon = model.inverse_transform()
    print(f"{name} reconstructed length: {len(recon)}")
    rmse = np.sqrt(np.mean((ts - recon) ** 2))
    return rmse

# Generate test time series
np.random.seed(42)
t = np.linspace(0, 10, 100)
ts = np.sin(t) + np.random.normal(0, 0.1, 100)  # Univariate, main test
ts_multi = np.vstack([np.sin(t), np.cos(t)]).T + np.random.normal(0, 0.1, (100, 2))  # Multivariate


sax = SAX(window_size=10, alphabet_size=8)
rmse = test_sax_variant(sax, ts, t, "SAX")


saxtd = SAXTD(window_size=10, alphabet_size=8)
rmse = test_sax_variant(saxtd, ts, t, "SAX-TD")

    
esax = ESAX(window_size=10, alphabet_size=8)
rmse = test_sax_variant(esax, ts, t, "eSAX")

    
msax = MSAX(window_size=10, alphabet_size=8)
rmse = test_sax_variant(msax, ts_multi, t, "mSAX", is_multivariate=True)

    
asax = ASAX(n_segments=10, alphabet_size=8)
rmse = test_sax_variant(asax, ts, t, "aSAX")

String distance and similarity metrics

slearn includes the implemented interface for string distance and similarity metrics as well as their normalized implementations, each strictly adhering to their formal definitions.

from slearn.dmetric import *

print(damerau_levenshtein_distance("cat", "act"))
print(jaro_winkler_distance("martha", "marhta"))

print(normalized_damerau_levenshtein_distance("cat", "act"))
print(normalized_jaro_winkler_distance("martha", "marhta"))

Model support

slearn currently supports SAX, ABBA, and fABBA symbolic representation, and the machine learning classifiers as below:

Support Classifiers Parameter call
Multi-layer Perceptron 'MLPClassifier'
K-Nearest Neighbors 'KNeighborsClassifier'
Gaussian Naive Bayes 'GaussianNB'
Decision Tree 'DecisionTreeClassifier'
Support Vector Classification 'SVC'
Radial-basis Function Kernel 'RBF'
Logistic Regression 'LogisticRegression'
Quadratic Discriminant Analysis 'QuadraticDiscriminantAnalysis'
AdaBoost classifier 'AdaBoostClassifier'
Random Forest 'RandomForestClassifier'

Our documentation is available.

Citation

This slearn implementation is maintained by Roberto Cahuantzi (University of Manchester), Xinye Chen (Charles University Prague), and Stefan Güttel (University of Manchester). If you use the function of LZWStringLibrary in your research, or if you find slearn useful in your work, please consider citing the paper below. If you have any problems or questions, just drop us an email.

@InProceedings{10.1007/978-3-031-37963-5_53,
    author="Cahuantzi, Roberto
    and Chen, Xinye
    and G{\"u}ttel, Stefan",
    title="A Comparison of LSTM and GRU Networks for Learning Symbolic Sequences",
    booktitle="Intelligent Computing",
    year="2023",
    publisher="Springer Nature Switzerland",
    pages="771--785"
}

License

This project is licensed under the terms of the MIT license.

Contributing

Contributing to this repo is welcome! We will work through all the pull requests and try to merge into main branch.

TO DO LIST:

  • language modeling functionalities
  • comphrehensive documentation
  • performance optimization

About

Symbolic sequence learning package

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •