This repository provides optimized C implementations of Viterbi decoding for Hidden Markov Models (HMMs). The mission is to help practitioners easily and efficiently apply the Viterbi algorithm to solve problems.
π Paper link: https://arxiv.org/abs/2510.19301
Included Methods | Key Features | Implementations | Usage | Data Generation | Takeaways | Citation
This repository includes our proposed FLASH VITERBI and its beam search variant FLASH-BS VITERBI.
- FLASH Viterbi β a full-state-space Viterbi with parallel divide-and-conquer dynamic programming.
- FLASH-BS Viterbi β an extension of the above with dynamic Beam Search pruning for improved efficiency.
It also includes baseline Viterbi algorithms for performance and memory comparisons. All implementations are re-written and optimized in C, which are designed for high-performance execution on Linux/Unix-like environments. The baselines include vanilla Viterbi, checkpoint Viterbi, SIEVE-Mp, SIEVE-BS, and SIEVE-BS-Mp.
Key features of FLASH Viterbi and its variant are as follows.
- Fast execution with the help of non-recursive design, pruning, and parallelization.
- Lightweight memory usage with the help of non-recursive design, pruning, and parallelization.
- Adaptive to diverse development scenarios by dynamically tuning internal paramaters.
- Hardware-friendly properties due to non-recursive structures, the elimination of BFS traversal, and double-buffered memory schemes.
- POSIX multithreading: Uses
pthreadfor efficient parallel computation. - Divide-and-conquer decoding: Splits the sequence into segments for concurrent computation.
- Log-domain computation: Prevents numerical underflow during probability calculations.
- Performance reporting: Reports both decoding time and memory consumption after execution.
- Implements full dynamic programming over the entire state space.
- Uses chunked parallelism for efficient scaling.
- Performs efficient backtracking after decoding.
- Integrates dynamic Beam Search using a min-heap to maintain the top-B paths.
- Reduces both runtime and memory usage when
B << K.
All baseline algorithms were originally available as open-source Python implementations, relying heavily on Python-specific libraries. Since C provides finer-grained control over multi-threading and memory managementβmaking it more suitable for deployment on resource-constrained edge devicesβwe implemented our proposed FLASH Viterbi and FLASH-BS Viterbi in C. For fair comparison, we also re-implemented all baselines in C and verified that their outputs match those of the original Python versions, including:
- Vanilla Viterbi: the standard Viterbi algorithm;
-
Checkpoint Viterbi: stores intermediate decoding states every
$\sqrt{T}$ steps to reduce memory usage; - SIEVE-Mp: applies recursive divide-and-conquer strategy to reduce memory usage;
- SIEVE-BS: a SIEVE variant with static beam search;
- SIEVE-BS-Mp: a SIEVE-Mp variant with static beam search.
The unified C implementations substantially improve runtime and memory efficiency across all baselines. For example, under the same condition of
We have placed the C implementations of each baseline algorithm in the Base_line\C implementations directory. Two execution modes are provided:
- Run single code: You can modify the decoding problem scale and algorithm parameters directly in the code and run it.
- Run through Python script: A Python script is also provided to run multiple algorithms on the same problem for performance comparison.
For detailed instructions, refer to the README.Usage section.
In addition, we provide the original Python implementations of the baselines for reference in the Base_line\Python implementations directory. All baseline algorithms are integrated into Baseline.py, where you can run the code to obtain the time and memory consumption for each algorithm. These metrics are saved in the current folder as a .txt file. Due to the complexity of memory usage in the decoding algorithms, we provide detailed memory consumption, including the memory used by variables during decoding, the memory required for the final output paths, and the maximum memory usage during BFS operations for algorithms involving BFS.
In addition to the software-based implementations of the Viterbi algorithm, hardware-accelerated versions are also available. These implementations are written in Verilog and are designed to run on the Vivado 2019.2 platform.
Hardware implementation of the basic FLASH Viterbi algorithm, performing full state-space decoding.
Hardware implementation of FLASH Viterbi with dynamic Beam Search for faster decoding and lower memory usage.
-
All programs read data from the
./data/directory. The files are generated beforehand and follow the naming pattern:A_K{K}_T{T}_prob{p}.txtβ Transition matrix (A)B_K{K}_T{T}_prob{p}.txtβ Emission matrix (B)Pi_K{K}_T{T}_prob{p}.txtβ Initial state probabilities (\pi)ob_K{K}_T{T}_prob{p}.txtβ Observation sequence
Here
{T}corresponds to the observation sequence length (obserRouteLENin the code), and{p}denotes the edge probability used during data generation.
A unified Python driver run.py is provided to compile, execute, and benchmark all algorithms:
Edit run.py to specify:
base_pathβ path to C source files.data_pathβ path to input data.result_pathβ output directory for CSV results.file_namesβ list of algorithms to test.parametersβ a list of parameter sets:K_STATEβ number of states, corresponding to the{K}field in the input data file names and matching the (K) variable used in the paper.T_STATEβ observation dimension.obserRouteLENβ observation sequence length (number of timesteps), corresponding to the{T}field in the input data file names and matching the (T) variable used in the paper.probβ edge probability for input data.MAX_THREADSβ number of threads to use.BeamSearchWidthβ beam width for pruning.
python3 run.pyrun.py will set each parameter in the array parameters for each program in the array file_names . The running results will be saved separately by program name to the result_math directory and stored in CSV format. Please place all tested code and run.by in the same directory before running.
It should be noted that some parameters are invalid for some programs:
parameter['MAX_THREADS']only works inFLASH_Viterbi_multithreadandFLASH_BS_Viterbi_multithreadparameter['BeamSearchWidth']only works inFLASH_BS_Viterbi_multithread,SIEVE-BSandSIEVE-BS-Mp
- Set the parameters in the original file to the values of the data you want to test
- Use the following command to compile and run the specified file for the corresponding code:
gcc -g -pthread FLASH_Viterbi_multithread.c -o FLASH_Viterbi_multithread -lm
./FLASH_Viterbi_multithreadgcc -g -pthread FLASH_BS_Viterbi_multithread.c -o FLASH_BS_Viterbi_multithread -lm
./FLASH_BS_Viterbi_multithreadgcc -g -pthread 'vanilla Viterbi.c' -o 'vanilla Viterbi' -lm
'./vanilla Viterbi'gcc -g -pthread 'checkpoint Viterbi.c' -o 'checkpoint Viterbi' -lm
'./checkpoint Viterbi'gcc -g -pthread SIEVE-Mp.c -o SIEVE-Mp -lm
./SIEVE-Mpgcc -g -pthread SIEVE-BS.c `pkg-config --cflags --libs glib-2.0` -o SIEVE-BS -lm
./SIEVE-BSgcc -g -pthread SIEVE-BS-Mp.c `pkg-config --cflags --libs glib-2.0` -o SIEVE-BS-Mp -lm
./SIEVE-BS-MpIn cases the data size of the Transition matrix and Emission matrix is too large, the stack size should be specified during program compilation, such as by adding '-Wl,-z,stack-size=268435456' at the end of the compilation command.
You can use the following scripts to generate synthetic datasets for testing the Viterbi implementations:
This script generates transition (A), emission (B), initial probability (Pi), and observation (ob) files using a sparse random graph.
Usage:
python data_script.py -s <seed> -n <n_ob> -K <K> -T <T> -b <beam_width> -p <prob>Arguments:
-s: Random seed-n: Number of distinct observation symbols-K: Number of hidden states-T: Length of observation sequence-b: Beam width (for reference logging)-p: Probability of edge existence between states
Output Files:
A_K{K}_T{T}_prob{prob}.txtβ Transition matrixB_K{K}_T{T}_prob{prob}.txtβ Emission matrixPi_K{K}_T{T}_prob{prob}.txtβ Initial state probabilitiesob_K{K}_T{T}_prob{prob}.txtβ Observation sequence
This script constructs a directed acyclic graph (DAG) as the HMM transition structure for testing topologically ordered models.
Usage:
python data_script_dag.py -s <seed> -n <n_ob> -K <K> -T <T>Arguments:
-s: Random seed-n: Number of distinct observation symbols-K: Number of hidden states-T: Length of observation sequence
Output Files:
A_K{K}_T{T}_DAG.txtβ Transition matrix from a DAGB_K{K}_T{T}_DAG.txtβ Emission matrixPi_K{K}_T{T}_DAG.txtβ Initial state probabilitiesob_K{K}_T{T}_DAG.txtβ Observation sequence
- The FLASH_BS_Viterbi implementation is more memory-efficient for large state spaces
- The FLASH_Viterbi implementation may be faster for small state spaces
- Actual performance depends on observation sequence length and available cores
If you find FLASH Viterbi useful, please consider citing our paper.
@inproceedings{deng2026flash,
title={FLASH Viterbi: Fast and Adaptive Viterbi Decoding for Modern Data Systems},
author={Deng, Ziheng and Liu, Xue and Jiang, Jiantong and Li, Yankai and Deng, Qingxu and Yang, Xiaochun},
booktitle={IEEE International Conference on Data Engineering (ICDE)},
year={2026},
}