《Large Language Models as Curriculum Data Generators for Unsupervised Sentence Representation》
For each existing model to be improved, you can use ChatGPT data from ./data or LLaMA data from ./data_llama.
We provide training scripts ./run_unsup_example.sh. Training scripts call ./train.py for training. The script runs with the following code,
bash run_unsup_example.shBefore evaluation, please download the evaluation datasets by running,
cd SentEval/data/downstream/
bash download_dataset.shOur evaluation code for sentence embeddings is based on a modified version of SentEval. It evaluates sentence embeddings on semantic textual similarity (STS) tasks and downstream transfer tasks. For STS tasks, our evaluation takes the "all" setting and reports Spearman's correlation. You can evaluate any transformers-based pre-trained models using our evaluation code. For example,
python evaluation.py --model_name_or_path `Replace with your model or path` --pooler cls_before_pooler --task_set sts --mode test