Just another minhash (jam) implementation. A high-performance minhashing tool for genomic sequence similarity analysis, specifically optimized for plasmids and other small genomic elements.
Implements the FracMinHash algorithm for rapid similarity comparisons with enhanced metadata tracking including GC content and sequence length categories tuned for typical plasmid ranges.
Install the latest release from crates.io:
cargo install jam-rs
Or install the development version from git:
cargo install --git https://github.com/St4NNi/jam-rs
- Plasmid-optimized: GC content and length categories specifically tuned for plasmid analysis (30-70% GC, 1kB-500kB lengths)
- Fast sketching: Entropy-filtered k-mers with optimized hash functions to exclude low-complexity regions
- Rich metadata: Enhanced metadata tracking with file index, GC category, and length category for each hash
- Memory-efficient: External sorting for processing datasets larger than available RAM
- LMDB storage: Fast random access and compact representation with dual database structure
- Parallel execution: File-level parallelization with configurable thread count
Multiple scaling methods for different use cases:
- FracMinHash (
--fscale
): Restricts hash-space to a fraction ofu64::MAX
/fscale
- Max hashes (
--nmax
): Limits maximum number of hashes per sequence (memory control) - Complexity filtering (
--complexity
): Only hash sequences with Shannon entropy above threshold (default: 0.0) - Singleton mode (
--singleton
): Creates separate sketch per sequence record
$ jam --help
Just another (genomic) minhasher (jam), obviously blazingly fast
Usage: jam [OPTIONS] <COMMAND>
Commands:
sketch Sketch one or more files and write the result to an output file
dist Estimate containment of a (small) sketch against a subset of one or more sketches as database. Requires all sketches to have the same kmer size
stats Display statistics about an LMDB database
help Print this message or the help of the given subcommand(s)
Options:
-t, --threads <THREADS> Number of threads to use [default: 1]
-f, --force Overwrite output files
-s, --silent Silent mode, no (additional) output to stdout Only errors and output files will be printed
-h, --help Print help
-V, --version Print version
Create sketches from FASTA/FASTQ files. Supports single files, multiple files, or directories.
$ jam sketch --help
Sketch one or more files and write the result to an output file
Usage: jam sketch [OPTIONS] --output <OUTPUT> [INPUT]...
Arguments:
[INPUT]... Input file(s), directories, or file with list of files to be hashed
Options:
-o, --output <OUTPUT> Output file (.lmdb will be appended if not present)
-k, --kmer-size <KMER_SIZE> K-mer size, all sketches must have the same size to be compared and below 32 [default: 21]
--fscale <FSCALE> Scale the hash space to a minimum fraction of the maximum hash value (FracMinHash)
--nmax <NMAX> Maximum number of k-mers (per record) to be hashed, top cut-off
--complexity <COMPLEXITY> Complexity cut-off, only hash sequences with complexity above this value This is created via shannon entropy [default: 0.0]
--singleton Create a separate sketch for each sequence record Will increase the size of the output file
-t, --threads <THREADS> Number of threads to use [default: 1]
-f, --force Overwrite output files
-h, --help Print help
Examples:
# Basic plasmid sketching
jam sketch plasmid.fasta -o sketch.lmdb
# Multiple plasmid files with custom k-mer size
jam sketch plasmids/ -o plasmid_db.lmdb -k 21 -t 8
# Large collections with memory limits and complexity filtering
jam sketch large_collection/ -o database.lmdb --nmax 10000 --fscale 1000000 --complexity 1.5
# Separate sketch per plasmid sequence
jam sketch multi_plasmids.fasta -o sketches.lmdb --singleton
Compare sequences against a sketch database. Supports both raw sequence files and pre-computed sketches.
$ jam dist --help
Estimate containment of a (small) sketch against a subset of one or more sketches as database. Requires all sketches to have the same kmer size
Usage: jam dist [OPTIONS] --input <INPUT> --database <DATABASE>
Options:
-i, --input <INPUT> Input sketch or raw sequence file
-d, --database <DATABASE> Database sketch (.lmdb file)
-o, --output <OUTPUT> Output to file instead of stdout
-c, --cutoff <CUTOFF> Cut-off value for similarity/containment [default: 0.0]
--singleton Singleton mode, process each query sequence separately
-t, --threads <THREADS> Number of threads to use [default: 1]
-f, --force Overwrite output files
-h, --help Print help
Examples:
# Query plasmid against database
jam dist -i query_plasmid.fasta -d plasmid_db.lmdb -c 0.1 -o results.tsv
# Sketch-to-sketch comparison
jam dist -i query.lmdb -d plasmid_db.lmdb -c 0.05
# Process each sequence separately with singleton mode
jam dist -i multi_query.fasta -d plasmid_db.lmdb --singleton -c 0.1
Display database statistics including hash counts and distribution analysis.
$ jam stats --help
Display statistics about an LMDB database
Usage: jam stats [OPTIONS] --input <INPUT>
Options:
-i, --input <INPUT> Input LMDB database
-s, --short Short summary only
-t, --threads <THREADS> Number of threads to use [default: 1]
-f, --force Overwrite output files
-h, --help Print help
Examples:
# Summary statistics
jam stats -i plasmid_db.lmdb --short
# Detailed distributions
jam stats -i plasmid_db.lmdb
Distance results are tab-separated with columns:
query target containment_query_in_target containment_target_in_query jaccard shared_hashes query_hashes target_hashes
Statistics include hash counts, GC content distribution, and sequence length categories optimized for plasmids and small genomic elements.
JAM uses entropy-filtered k-mers to exclude low-complexity regions, stores rich metadata (file index, GC category, length category) with each hash, and employs external sorting for memory-efficient processing of large datasets. The categorization system is specifically tuned for plasmid analysis, with fine-grained bins in typical plasmid size and GC content ranges.
This project is licensed under the MIT license. See the LICENSE file for more info.
If you have any ideas, suggestions, or issues, please don't hesitate to open an issue and/or PR. Contributions to this project are always welcome! We appreciate your help in making this project better.
This tool is inspired by finch-rs and sourmash. Check them out if you need a more mature ecosystem with well tested hash functions and more features.