conda create -n neural-sparse-mdbert python=3.9
conda activate neural-sparse-mdbert
pip install -r requirements.txt
To evaluate search relevance or mine hard negatives, run an OpenSearch node at local device. It can be accessed at http://localhost:9200
without username/password(security disabled). For more details, please check OpenSearch doc. Here are steps to start a node without security:
- Follow the step1 and step2 in above documentation.
- Modify
/path/to/opensearch-2.16.0/config/opensearch.yml
, add this line:plugins.security.disabled: true
- Start a tmux session so the OpenSearch won't stop after the terminal is close
tmux new -s opensearch
. In the tmux session, runcd /path/to/opensearch-2.16.0
and./bin/opensearch
. - The sevice is running. Run
curl -X GET http://localhost:9200
to test.
Here is an example of reproducing an L0-enhanced inf-free model.
python prepare_msmarco_hard_negatives.py
bash run_train_eval.sh configs/config_l0.yaml
Here is an example of fine-tuning the opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini
model at BEIR scifact.
- Generate training data.
python demo_train_data.py
(data parallel) ortorchrun --nproc_per_node=${N_DEVICES} demo_train_data.py
(distributed data parallel) with configs.- This will generate training data of hard negatives at
data/scifact_train.jsonl
torchrun --nproc_per_node=${N_DEVICES} demo_train_data.py \
--model_name_or_path opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini \
--inf_free true \
--idf_path idf.json \
--beir_dir data/beir \
--beir_datasets scifact
- Run training.
python train_ir.py {config_file}
(data parallel) ortorchrun --nproc_per_node=${N_DEVICES} train_ir.py config.yaml
(distributed data parallel)- If training using infoNCE loss, use configs/config_infonce.yaml
- If training using ensembled teacher models, using configs/config_kd.yaml
- Run evaluation on the test set.
for step in {500,1000,1500,2000}
do
OUTPUT_DIR="output/test"
torchrun --nproc_per_node=8 evaluate_beir.py \
--model_name_or_path ${OUTPUT_DIR}/checkpoint-${step} \
--inf_free true \
--idf_path idf.json \
--output_dir ${OUTPUT_DIR} \
--log_level info \
--beir_datasets scifact \
--per_device_eval_batch_size 50
done
Training with infoNCE loss. It pushes the model generates higher scores for the positive pairs than all other pairs. The training mode should be infonce
.
python train_ir.py configs/config_infonce.yaml
Run with distributed data parallel:
# the number of GPU
N_DEVICES=8
torchrun --nproc_per_node=${N_DEVICES} train_ir.py configs/config_infonce.yaml
Data file is a jsonl file, each line is a data sample like this:
{
"query":"xxx xxx xxx",
"pos":"xxxx xxxx xxxx",
"negs": ["xxx", "xxx", "xxx", "xxx"],
}
To ensemble dense and sparse teachers to generate superversary signals for knowledge distillation. The training_mode should be kd-ensemble
. The superverary signals are generated dynamically during training.
Run with data parallel:
python train_ir.py configs/config_kd.yaml
Run with distributed data parallel:
# the number of GPU
N_DEVICES=8
torchrun --nproc_per_node=${N_DEVICES} train_ir.py configs/config_kd.yaml
The data file has the same format as training with infoNCE.
For expensive teacher models like LLM or cross-encoders, we can calculate the scores in advance and store the scores. To run with pre-computed KD scores, The training_mode should be kd
.
Data file is a jsonl file, each line is a data sample like this:
{
"query":"xxx xxx xxx",
"docs": ["xxx", "xxx", "xxx", "xxx"],
"scores": [1.0, 5.0, 9.0, 4.4]
}