GitHub - ant-research/long-context-modeling

Milestones

"Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling" (ICML 2025) link

Achieved 1000x extrapolation, but limited by the inability to retrieve every token—only able to retrieve once every S tokens. Random access capability is not flexible enough.

"Hardware-aligned Hierarchical Sparse Attention for Efficient Long-term Memory Access" (NeurIPS 2025)

Compared to GCA, token-by-token retrieval has been achieved. But we find its extrapolation ability is not as strong as GCA. We recently found that combining it with a short sliding window instead of Mamba yields stronger extrapolation capability.

After the release of this work, we attempted to scale up a larger model and pre-trained it on trillions of tokens. However, we find that its extrapolation capability completely disappeared. Therefore, we strongly recommend using HSA with SWA. We will soon release a tech report on the HSA+SWA-based 8BA1B MoE architecture, which maintains strong extrapolation ability (16M) even after pre-training on trillion of tokens.

Core idea of HSA

The core idea of HSA is to perform sparse attention akin to MoE.

Overall, we split a KV cache into fixed-length chunks, each with a summary representation. Each token retrieves top-k chunks via these summary tokens, conducts attention with tokens in each retrieved chunk separately, and then fuses the attention results based on the normalized retrieval scores.

Results (To be updated for HSA)

All models were pre-trained on contexts of no more than 16K tokens, and all attention spans are limited to no more than 728 tokens. Our model (DRT) achieves 1000x extrapolation on the needle-in-a-haystack task, maintaining high accuracy even with 16M context length.

Environments

torch==2.4.0, transformers>=4.36.0, triton==3.0.0

pip install requirements.txt

Data Preparation

Before pre-training, ensure that the corpus is indexed. Pre-processing script:

Pile: python preprocess/pile_neox.py

Unittests

Test triton kernel:

pytest ops/hsa_tritoin.py

Pre-training

sh scripts/pretrain_pile/pretrain_model.sh

Contact

If you encounter any problems, please feel free to contact us: aaron.hx AT antgroup.com

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
configs		configs
eval		eval
figures		figures
fla		fla
model		model
modules		modules
native_sparse_attention/ops		native_sparse_attention/ops
ops		ops
preprocess		preprocess
reader		reader
rouge		rouge
scripts		scripts
sft		sft
tests		tests
trainer		trainer
unittests		unittests
utils		utils
.gitignore		.gitignore
LEGAL.md		LEGAL.md
README.md		README.md
train_model.py		train_model.py
train_model_partial.py		train_model_partial.py
train_ramba_passkey.py		train_ramba_passkey.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Milestones

Core idea of HSA

Results (To be updated for HSA)

Environments

Data Preparation

Unittests

Pre-training

Contact

About

Uh oh!

Releases

Packages

Languages

ant-research/long-context-modeling

Folders and files

Latest commit

History

Repository files navigation

Milestones

Core idea of HSA

Results (To be updated for HSA)

Environments

Data Preparation

Unittests

Pre-training

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages