This directory serves as a repository for the code for merging and training CoCodeS, presented in Advancing Conversational Text-to-SQL: Current Landscape and Future Directions with Large Language Models.
This repository contains scripts to prepare datasets, generate GQR summaries, fine-tune, run zero-shot inference, and merge model checkpoints.
conda create -n codes python=3.8.5
conda activate merging
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -r requirements.txtDownload the following datasets and place them under data/sft_data_collections/:
Layout:
data/
  sft_data_collections/
    spider/
      ...
    cosql/
      ...
    sparc/
      ...
    bird/
      ...
python3 prepare_summary_prompts.pypython3 prepare_sft_datasets.pyFollow the instructions on the CodeS GitHub to set up accelerate for finetuning, and download the sic_ckpt folder from there as well.
Example: fine-tuning BIRD CodeS model on GQR CoSQL:
accelerate launch train_causal_lm.py   --per_device_train_batch_size 1   --block_size 4096   --seed 42   --pretrained_model_name_or_path seeklhy/codes-7b-bird   --epochs 4   --lr 5e-6   --warmup_ratio 0.05   --checkpointing_steps 100000   --tensorboard_log_dir ./train_logs/codes-7b-spider-cosql-bird   --mode sft   --output_ckpt_dir ./ckpts/codes-7b-bird-cosql-gqr   --text2sql_data_dir ./data/sft_cosql_train_gpt.json   --table_num 6   --column_num 10python -u prepare_zero_shot_summaries.py   --llm_path path/to/ckpt   --sic_path ./sic_ckpts/sic_spider   --table_num 6   --column_num 10   --max_tokens 4096   --max_new_tokens 256   --output_path results.txtpython -u prepare_zero_shot.py   --llm_path path/to/ckpt   --sic_path ./sic_ckpts/sic_spider   --table_num 6   --column_num 10   --max_tokens 4096   --max_new_tokens 256   --output_path results.txtTo Merge models, use the following:
python3 merge.py   --llm_path1 path/to/ckpt1   --llm_path2 path/to/ckpt2   --save_path path/to/ckpt_mergedAfter this, you can run inference on the new model.