The Canonical Piecewise Linear approximability (CaPLa) is a measure for quantifying the efficiency of learned data structures based on piecewise linear approximations (PLAs) on a given dataset.
The CaPLa for a genome is composed of three values,
To build the tool, you need CMake (≥ 3.16), a C++ compiler (GCC ≥ 8 or Clang ≥ 5), and Python (≥ 3.10, with pip and venv).
For example, you can use the following commands to set up the environment in a Docker container:
docker run -it ubuntu:24.04 bash
apt-get update && apt-get install -y build-essential cmake python3 python3-pip python3-venv
Then, you can clone the repository and build the tool as follows:
git clone --recursive https://github.com/medvedevgroup/CaPLa.git
cd CaPLa
mkdir build
cd build
cmake ..
make -j 8
The tool capla.sh scans a given folder for genomes with .fna
or .fasta
extension and computes the CaPLa for each genome.
It takes as arguments the path to the directory containing the genomes and an optional k-mer size (default: 21).
For example, you can run the tool on the sample genomes already included in the genomes
directory as follows:
./scripts/capla.sh genomes/ 21
The results will be saved in a file named CaPLa.csv
inside the specified directory.
Each row of the file contains:
- Genome name
- Number of unique
$k$ -mers -
$k$ -mer length $\alpha^\ast$ $\beta_\mathrm{low}^\ast$ $\beta_\mathrm{high}^\ast$
Let
In summary, CaPLa captures the tightest power-law bound on the average number of elements spanned by each segment in the piece-wise linear approximation of the
The tool begins by preprocessing each genome: it uses the process_fasta
executable to remove non-ACGT
characters and create a single string with a single header. It then builds a suffix array for each genome using the mksary
executable from the libdivsufsort
library.
Next, the tool computes the number of segments in the PLA of each genome's rank curve for each error bound count_segments
executable, which implements O'Rourke's algorithm to find the PLA with the minimal number of segments.
Finally, the tool finds the CaPLa values for each genome using the find_capla.py
script and writes the results to the CaPLa.csv
file.
The data used in our paper submission is described in the Reproducibility directory.