This tools evaluates the accuracy of instruct models on datasets that have hypotheses with and without negation in them.
# create conda env:
conda create -n nllm python=3.10
# activate the environment
conda activate nllm
# install poetry
pip install poetry
# install the project dependencies
poetry install --no-rootFirst, launch the vllm server with the desired model
model_name=Qwen/Qwen2.5-0.5B-Instruct
port=8000
apikey=makesomethingup
gpu=7
CUDA_VISIBLE_DEVICES=$gpu \
HF_CACHE=.cache/ \
vllm serve $model_name \
--port $port \
--api-key $apikey \
--dtype auto \
--task generate \
--max-model-len 1600 \
--enable-prefix-cachingSome parameters might need some tweaking, depending on your HW or the model used.
Optionally, use the following arguments for quantization
--quantization bitsandbytes --load-format bitsandbytes
Mistral models needs these arguments:
--tokenizer-mode mistral --config-format mistral --load-format mistral
If you need a quantized mistral, you are out of luck because you cannot pass --load-format bitsandbytes and --load-format mistral at the same time.
In that case, you have to quantize the model yourself with quantize.py into a local file. Then, you run the vllm server with a local path to the quantized model, and don't add any mistral-specific arguments.
After the inference server is running, you can launch script for generating predictions:
python run.py http://localhost:${port}/v1 ${apikey} ${model_name} nofever-ces.csv ces_prompt.txt --output_dir ./output ; \
python run.py http://localhost:${port}/v1 ${apikey} ${model_name} nofever-eng.csv eng_prompt.txt --output_dir ./output ; \
python run.py http://localhost:${port}/v1 ${apikey} ${model_name} nofever-ukr.csv ukr_prompt.txt --output_dir ./output ; \
python run.py http://localhost:${port}/v1 ${apikey} ${model_name} nofever-deu.csv deu_prompt.txt --output_dir ./outputYou can see more options with python run.py --help
This script creates 3 output files:
./output/Qwen_Qwen2.5-0.5B-Instruct_<timestamp>_P.csv: csv for the results on the positive hypotheses, containingdataset_id,predict_token(True or False),predicted_polarity(polarity of the hypothesis ifpredict_tokenTrue, the opposite polarity if False),correct_polarity(polarity of the actual correct hypothesis)./output/Qwen_Qwen2.5-0.5B-Instruct_<timestamp>_N.csv: csv for the results on the negative hypotheses, same structure as the*_P.csv./output/Qwen_Qwen2.5-0.5B-Instruct_<timestamp>_res.json: JSON object containing accuracy and other information about the run
<timestamp> is the timestamp of the start of the run.