GitHub - zhichao-aws/opensearch-sparse-model-tuning-sample: Code of fine-tuning neural sparse models and training from scratch. #SIGIR2025

Prepare the environment

Conda environment

conda create -n neural-sparse-mdbert python=3.9
conda activate neural-sparse-mdbert
pip install -r requirements.txt

OpenSearch service

To evaluate search relevance or mine hard negatives, run an OpenSearch node at local device. It can be accessed at http://localhost:9200 without username/password(security disabled). For more details, please check OpenSearch doc. Here are steps to start a node without security:

Follow the step1 and step2 in above documentation.
Modify /path/to/opensearch-2.16.0/config/opensearch.yml, add this line: plugins.security.disabled: true
Start a tmux session so the OpenSearch won't stop after the terminal is close tmux new -s opensearch. In the tmux session, run cd /path/to/opensearch-2.16.0 and ./bin/opensearch.
The sevice is running. Run curl -X GET http://localhost:9200 to test.

An example of reproducing an L0-enhanced inf-free model

Here is an example of reproducing an L0-enhanced inf-free model.

python prepare_msmarco_hard_negatives.py
bash run_train_eval.sh configs/config_l0.yaml

An example of fine-tuning on BEIR scifact

Here is an example of fine-tuning the opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini model at BEIR scifact.

Generate training data.
1. python demo_train_data.py(data parallel) or torchrun --nproc_per_node=${N_DEVICES} demo_train_data.py(distributed data parallel) with configs.
2. This will generate training data of hard negatives at data/scifact_train.jsonl

torchrun --nproc_per_node=${N_DEVICES} demo_train_data.py \
  --model_name_or_path opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini \
  --inf_free true \
  --idf_path idf.json \
  --beir_dir data/beir \
  --beir_datasets scifact

Run training.
1. python train_ir.py {config_file}(data parallel) or torchrun --nproc_per_node=${N_DEVICES} train_ir.py config.yaml(distributed data parallel)
2. If training using infoNCE loss, use configs/config_infonce.yaml
3. If training using ensembled teacher models, using configs/config_kd.yaml
Run evaluation on the test set.

for step in {500,1000,1500,2000}
do
OUTPUT_DIR="output/test"
torchrun --nproc_per_node=8 evaluate_beir.py \
  --model_name_or_path ${OUTPUT_DIR}/checkpoint-${step} \
  --inf_free true \
  --idf_path idf.json \
  --output_dir ${OUTPUT_DIR} \
  --log_level info \
  --beir_datasets scifact \
  --per_device_eval_batch_size 50
done

Run with infoNCE loss

Training with infoNCE loss. It pushes the model generates higher scores for the positive pairs than all other pairs. The training mode should be infonce.

python train_ir.py configs/config_infonce.yaml

Run with distributed data parallel:

# the number of GPU
N_DEVICES=8
torchrun --nproc_per_node=${N_DEVICES} train_ir.py configs/config_infonce.yaml

Data file is a jsonl file, each line is a data sample like this:

{
    "query":"xxx xxx xxx",
    "pos":"xxxx xxxx xxxx",
    "negs": ["xxx", "xxx", "xxx", "xxx"],
}

Run with knowledge distillation (ensemble teachers)

To ensemble dense and sparse teachers to generate superversary signals for knowledge distillation. The training_mode should be kd-ensemble. The superverary signals are generated dynamically during training.

Run with data parallel:

python train_ir.py configs/config_kd.yaml

Run with distributed data parallel:

# the number of GPU
N_DEVICES=8
torchrun --nproc_per_node=${N_DEVICES} train_ir.py configs/config_kd.yaml

The data file has the same format as training with infoNCE.

Run with knowledge distillation (scores pre-computed)

For expensive teacher models like LLM or cross-encoders, we can calculate the scores in advance and store the scores. To run with pre-computed KD scores, The training_mode should be kd.

Data file is a jsonl file, each line is a data sample like this:

{
    "query":"xxx xxx xxx",
    "docs": ["xxx", "xxx", "xxx", "xxx"],
    "scores": [1.0, 5.0, 9.0, 4.4]
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Prepare the environment

Conda environment

OpenSearch service

An example of reproducing an L0-enhanced inf-free model

An example of fine-tuning on BEIR scifact

Run with infoNCE loss

Run with knowledge distillation (ensemble teachers)

Run with knowledge distillation (scores pre-computed)

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
configs		configs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo_train_data.py		demo_train_data.py
evaluate_beir.py		evaluate_beir.py
idf.json		idf.json
prepare_msmarco_hard_negatives.py		prepare_msmarco_hard_negatives.py
requirements.txt		requirements.txt
run_ft_demo.sh		run_ft_demo.sh
run_train_eval.sh		run_train_eval.sh
train_ir.py		train_ir.py

License

zhichao-aws/opensearch-sparse-model-tuning-sample

Folders and files

Latest commit

History

Repository files navigation

Prepare the environment

Conda environment

OpenSearch service

An example of reproducing an L0-enhanced inf-free model

An example of fine-tuning on BEIR scifact

Run with infoNCE loss

Run with knowledge distillation (ensemble teachers)

Run with knowledge distillation (scores pre-computed)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages