Releases: pachterlab/kb_python
v0.30.0
Credits to @BenjaminDEMAILLE
Implementation Summary: GitHub Issues #295 and #296
Overview
This implementation addresses two feature requests for kb_python:
- Issue #295: Add output mode compatible with Read10X() in Seurat (spliced/unspliced separated as in CellRanger format)
- Issue #296: Enable gzip compression of output files and optional cleanup of BUS files to reduce disk usage
Changes Made
1. New Command-Line Flags (main.py)
Added three new flags to the kb count command:
--cellranger
- Purpose: Convert count matrices to CellRanger-compatible format
- Auto-enabled features:
- Gzip compression: Automatically compresses output matrices (can be disabled with--no-gzip)
- CellRanger-style directories: For nac/lamanno workflows, automatically createsspliced/andunspliced/subdirectories - Files created:
matrix.mtx.gz,barcodes.tsv.gz,genes.tsv.gz(orfeatures.tsv.gz) - Usage:
kb count --cellranger ...(gzip is automatic) - Disable gzip:
kb count --cellranger --no-gzip ...
--gzip
- Purpose: Manually enable gzip compression (not needed if using
--cellranger) - Default: False (but auto-enabled with
--cellranger) - Usage:
kb count --gzip ...(without --cellranger)
--no-gzip
- Purpose: Disable automatic gzip compression when using
--cellranger - Default: False
- Usage:
kb count --cellranger --no-gzip ...
--delete-bus
- Purpose: Delete intermediate BUS files after successful count to save disk space
- Files deleted: All
.busfiles generated during processing - Default: False
- Usage:
kb count --delete-bus ... - Safety: Only deletes files after successful completion of count operation
CellRanger-Style Output Structure
When using --cellranger with nac/lamanno workflows, the output is automatically organized as:
counts_unfiltered/
spliced/
matrix.mtx.gz
barcodes.tsv.gz
genes.tsv.gz
unspliced/
matrix.mtx.gz
barcodes.tsv.gz
genes.tsv.gz
cellranger_ambiguous/
matrix.mtx.gz
barcodes.tsv.gz
genes.tsv.gz
This structure is compatible with Seurat's Read10X() function and works with both filtered and unfiltered outputs.
2. Function Updates
matrix_to_cellranger() (count.py)
- New parameter:
gzip: bool = False - Behavior:
- Whengzip=True, outputs.gzcompressed versions of all files
- Usescompress_gzip()from utils for matrix file compression
- Usesopen_as_text()for text files (barcodes, genes) to handle gzip automatically - Backward compatible: Default behavior unchanged
count() function (count.py)
- New parameters:
-gzip: bool = False
-delete_bus: bool = False - Changes:
- Passesgzipparameter to allmatrix_to_cellranger()calls
- Passesgzipparameter tofilter_with_bustools()
- Implements BUS file cleanup at end of function whendelete_bus=True
- Collects all BUS file paths from results and deletes them
count_nac() function (count.py)
- New parameters:
-gzip: bool = False
-cellranger_style: bool = False
-delete_bus: bool = False - Changes:
- Passesgzipparameter to allmatrix_to_cellranger()calls
- Implementscellranger_stylelogic:
- Forprocessedmatrices (index 0): createsspliced/subdirectory
- Forunprocessedmatrices (index 1): createsunspliced/subdirectory
- Forambiguousmatrices: uses default naming
- Works for both filtered and unfiltered outputs
- Implements BUS file cleanup at end of function whendelete_bus=True
filter_with_bustools() function (count.py)
- New parameter:
gzip: bool = False - Changes: Passes
gzipparameter tomatrix_to_cellranger()call
3. Integration in main.py
parse_count() function
- Added logic to auto-enable gzip when
--cellrangeris used (unless--no-gzipis specified) - Auto-enables cellranger-style directory structure for nac/lamanno workflows when
--cellrangeris used - Passes appropriate parameters to count functions:
-gzip = (args.cellranger and not args.no_gzip) or args.gzip
-cellranger_style = args.cellranger(for count_nac)
-delete_bus = args.delete_bus
Usage Examples
Example 1: Standard workflow with automatic gzip compression
kb count -i index.idx -g t2g.txt -x 10XV3 -o output/ \
--cellranger \
sample_R1.fastq.gz sample_R2.fastq.gzNote: Gzip compression is now automatic! No need for --gzip flag.
Example 2: NAC workflow with automatic CellRanger-style output
kb count -i index.idx -g t2g.txt -c1 cdna_t2c.txt -c2 intron_t2c.txt \
-x 10XV3 -o output/ --workflow=nac \
--cellranger \
sample_R1.fastq.gz sample_R2.fastq.gzThis automatically creates:
output/counts_unfiltered/spliced/with spliced matrices (gzipped)output/counts_unfiltered/unspliced/with unspliced matrices (gzipped)
Example 3: Disable gzip if needed
kb count -i index.idx -g t2g.txt -x 10XV3 -o output/ \
--cellranger --no-gzip \
sample_R1.fastq.gz sample_R2.fastq.gzExample 4: All features combined
kb count -i index.idx -g t2g.txt -c1 cdna_t2c.txt -c2 intron_t2c.txt \
-x 10XV3 -o output/ --workflow=nac \
--cellranger --delete-bus \
sample_R1.fastq.gz sample_R2.fastq.gzThis automatically enables gzip and creates CellRanger-style directories!
Benefits
-
Disk Space Savings:
- Gzip compression reduces matrix file sizes by 70-90%
- Automatically enabled with--cellrangerflag
- BUS file deletion frees up significant temporary storage
- Particularly beneficial for large-scale processing pipelines -
Seurat Compatibility:
---cellrangerautomatically creates Seurat-compatible output
- CellRanger-style directories created automatically for nac/lamanno workflows
- Spliced/unspliced separation allows easy loading of separate assays
- Compatible with existing R-based analysis workflows
- No additional flags needed! -
Standard Compliance:
- Gzipped outputs align with CellRanger format standards (enabled by default)
- Scanpy and other tools natively support.gzfiles
- No manual compression/decompression needed -
Simplified Usage:
- Just use--cellrangerand everything is configured automatically
- Gzip compression: automatic
- CellRanger-style directories (for nac/lamanno): automatic
- Can disable gzip with--no-gzipif needed
Backward Compatibility
All changes are backward compatible:
- New flags are optional (default: False)
- Existing commands continue to work without modification
- Default behavior unchanged when new flags not specified
- Function signatures extended with optional parameters (defaults maintain old behavior)
Testing Recommendations
-
Test gzip compression:
- Verify.gzfiles are created
- Verify compressed files can be read by Seurat/Scanpy
- Compare file sizes before/after compression -
Test CellRanger-style directories:
- Verifyspliced/andunspliced/subdirectories are created
- Verify matrices contain correct data
- Test with Seurat'sRead10X()function -
Test BUS file deletion:
- Verify BUS files are deleted after successful run
- Verify final matrices are still correct
- Test that deletion doesn't occur if count fails -
Test combinations:
- All flags together
- Each flag independently
- With filtered and unfiltered outputs
Files Modified
-
/Users/benjamin/kb_python/kb_python/main.py
- Added--cellranger,--gzip,--no-gzip, and--delete-buscommand-line arguments
- Updatedparse_count()to auto-enable gzip and cellranger-style when--cellrangeris used
- Logic:use_gzip = (args.cellranger and not args.no_gzip) or args.gzip
- Logic:use_cellranger_style = args.cellranger(for nac/lamanno) -
/Users/benjamin/kb_python/kb_python/count.py
- Updatedmatrix_to_cellranger()with gzip support
- Updatedcount()with gzip and delete_bus support
- Updatedcount_nac()with gzip, cellranger_style, and delete_bus support
- Updatedfilter_with_bustools()with gzip support
- Added BUS file cleanup logic to count functions
Notes
- Simplified workflow:
--cellrangernow automatically enables both gzip and cellranger-style directories - The cellranger-style flag has been removed as it's redundant (automatic with
--cellranger) - Gzip compression uses existing
compress_gzip()utility from ngs_tools - BUS file deletion only occurs after successful completion of the count operation
- For smartseq3 technology, handles multiple BUS file suffixes correctly
- Use
--no-gzipwith--cellrangerif you need uncompressed outpu
v0.29.5
v0.29.4
v0.29.3
- Add 10xv4
- --exact-barcodes in kb count to "correct" barcodes to an on-list using only exact matches (i.e. no mismatches permitted)
- bustools binary updated to 0.45.0
- Allow kb count -g None (i.e. not supplying a t2g.txt file) in which case a synthetic one is generated with each target/transcript being its own gene.
v0.29.2
v0.29.1
Updates since version 0.28.2:
Major:
- Upgraded kallisto to 0.51.1 and bustools to 0.44.1
- Added lr-kallisto (--long) option, and enabling k>31
- Added kb extract
- Added various kallisto binaries (w/ and w/o optimizations; w/ and w/o long k-mer sizes)
Other:
- Allow -i NONE in kb ref to create t2g+fasta but no index
- Various bug fixes (pandas version dependency, adata.X in nac containing total matrix, summing matrices not mishandling scientific notation, etc.)
- Ended support for python 3.7
v0.28.2
v0.28.1
v0.28.0
Implements all the updates detailed in protocols paper: https://doi.org/10.1101/2023.11.21.568164
- kallisto version 0.50.1
- bustools version 0.43.1
v0.27.3
General
- Bumped
ngs-tools>=1.7.3.
ref
- [DEPRECATION] Split index generation using
-nhas been fully deprecated. (Thanks to @amcdavid for catching a bug)
count
- Fixed a minor issue with
--workflow kite:10xFB, wherebustools projectwould be called beforebustools correct(the order should be opposite). This fix required a bump to thengs-toolsdependency. - Support for
--workflow lamannofor-x smartseq3. - [DEPRECATION] Counting using split indices by providing a comma-delimited list to
-ihas been fully deprecated. - Support for whitelist (
-w option) forbulk,smartseq2andsmartseq3technologies. - Added support for
-x 10XV3_ULTIMA.