Language Models are Unsupervised Multitask Learners
This is a fork of the affjljoo3581's excellent GPT2 implementation implementation, with QA fine-tuning and rudimentary chat capabilities.
- regex
- tqdm
- torch
- numpy
- matplotlib
- pandas
- Hugging Face Datasets
- Hugging Face Tokenizers
Before pre-training GPT-2, corpus dataset should be prepared. We recommend to build your own corpus by using Expanda.
After preparing datasets, you can pre-train GPT-2 by using as follows:
$ pipenv run python -m src.gpt2 train DATA_CORPUS_BUILD_DIR \
--train_corpus corpus.train.txt \
--eval_corpus corpus.test.txt \
--tokenizer_path tokenizer.json \
--save_checkpoint_path ckpt-gpt2.pth \
--save_model_path gpt2-pretrained.pth \
--batch_train 32 \
--batch_eval 32 \
--seq_len 256 \
--total_steps 10000 \
--eval_steps 1000 \
--save_steps 1000 \
--layers 12 \
--use_amp --use_grad_ckpt
To resume training from last checkpoint file, use --from_checkpoint [last checkpoint file]
option.
If you want to train GPT-2 with multiple GPUs, use --gpus [number of gpus]
option.
The detail of command-line usage is as follows:
usage: gpt2 train [-h] --train_corpus TRAIN_CORPUS --eval_corpus EVAL_CORPUS --tokenizer_path TOKENIZER_PATH [--seq_len SEQ_LEN] [--layers LAYERS] [--heads HEADS] [--dims DIMS] [--rate RATE] [--dropout DROPOUT]
[--batch_train BATCH_TRAIN] [--batch_eval BATCH_EVAL] [--base_lr BASE_LR] [--wd_rate WD_RATE] [--total_steps TOTAL_STEPS] [--eval_steps EVAL_STEPS] [--save_steps SAVE_STEPS] [--save_version_steps SAVE_VERSION_STEPS]
[--save_model_path SAVE_MODEL_PATH] [--save_checkpoint_path SAVE_CHECKPOINT_PATH] [--from_checkpoint FROM_CHECKPOINT] [--from_pretrained FROM_PRETRAINED] [--use_amp] [--use_grad_ckpt] [--gpus GPUS]
corpus_dir
options:
-h, --help show this help message and exit
Corpus and vocabulary:
corpus_dir root directory of corpus files
--train_corpus TRAIN_CORPUS
training corpus file path
--eval_corpus EVAL_CORPUS
evaluation corpus file path
--tokenizer_path TOKENIZER_PATH
tokenizer file path
Model configurations:
--seq_len SEQ_LEN maximum sequence length
--layers LAYERS number of transformer layers
--heads HEADS number of multi-heads in attention layer
--dims DIMS dimension of representation in each layer
--rate RATE increase rate of dimensionality in bottleneck
--dropout DROPOUT probability that each element is dropped
Training and evaluation:
--batch_train BATCH_TRAIN
number of training batch size
--batch_eval BATCH_EVAL
number of evaluation batch size
--base_lr BASE_LR default learning rate
--wd_rate WD_RATE weight decay rate
--total_steps TOTAL_STEPS
number of total training steps
--eval_steps EVAL_STEPS
period to evaluate model and record metrics
--save_steps SAVE_STEPS
period to save training state to checkpoint
--save_version_steps SAVE_VERSION_STEPS
period to save a versioned/branched model.
Saving and restoring:
--save_model_path SAVE_MODEL_PATH
save trained model weights to the file
--save_checkpoint_path SAVE_CHECKPOINT_PATH
save training state to the checkpoint file
--from_checkpoint FROM_CHECKPOINT
load last training state from checkpoint file
--from_pretrained FROM_PRETRAINED
initialize parameters from pretrained model
Extensions:
--use_amp use automatic mixed-precision in training
--use_grad_ckpt use gradient checkpointing in transformer layers
--gpus GPUS number of gpu devices to use in training
After training GPT-2, you can generate sentences with your trained model in interactive mode.
$ python -m gpt2 generate --vocab_path build/vocab.txt \
--model_path model.pth \
--seq_len 64 \
--nucleus_prob 0.8
The detail of command-line usage is as follows:
usage: gpt2 generate [-h] --vocab_path VOCAB_PATH --model MODEL
[--seq_len SEQ_LEN] [--layers LAYERS] [--heads HEADS]
[--dims DIMS] [--rate RATE] [--top_p TOP_P] [--use_gpu]
optional arguments:
-h, --help show this help message and exit
--vocab_path VOCAB_PATH
vocabulary file path
--model_path MODEL_PATH
trained GPT-2 model file path
Model configurations:
--seq_len SEQ_LEN maximum sequence length
--layers LAYERS number of transformer layers
--heads HEADS number of multi-heads in attention layer
--dims DIMS dimension of representation in each layer
--rate RATE increase rate of dimensionality in bottleneck
Generating options:
--nucleus_prob NUCLEUS_PROB
probability threshold for nucleus sampling
--use_gpu use gpu device in inferencing
To estimate the performance of trained model, calculate the objective metrics on the evaluation dataset.
$ pipenv run python -m src.gpt2 evaluate DATA_CORPUS_BUILD_DIR \
--eval_corpus corpus.test.txt \
--tokenizer_path tokenizer.json \
--model_path gpt2-pretrained.pth \
--batch_eval 96 \
--seq_len 256 \
--layers 12 \
--use_gpu
You can also analyse training loss graph by visualizing recorded metrics.
$ pipenv run python -m src.gpt2 visualize \
--model_path path/to/gpt2-pretrained.pth \
--figure src/build/gpt2-training.png
The example figure is as bellow:
Mixed precision training is possible with torch.amp
, provided your GPU meets the requirements. Use the --use_amp
flag in the training program.
This project is Apache-2.0 Licensed.