GPT-2 PyTorch Implementation

Language Models are Unsupervised Multitask Learners

Introduction

This is a fork of the affjljoo3581's excellent GPT2 implementation implementation, with QA fine-tuning and rudimentary chat capabilities.

Dependencies

regex
tqdm
torch
numpy
matplotlib
pandas
Hugging Face Datasets
Hugging Face Tokenizers

Usage

How to (pre) train?

Before pre-training GPT-2, corpus dataset should be prepared. We recommend to build your own corpus by using Expanda.

After preparing datasets, you can pre-train GPT-2 by using as follows:

$ pipenv run python -m src.gpt2 train DATA_CORPUS_BUILD_DIR \
    --train_corpus           corpus.train.txt \
    --eval_corpus            corpus.test.txt \
    --tokenizer_path         tokenizer.json \
    --save_checkpoint_path   ckpt-gpt2.pth \
    --save_model_path        gpt2-pretrained.pth \
    --batch_train            32 \
    --batch_eval             32 \
    --seq_len                256 \
    --total_steps            10000 \
    --eval_steps             1000 \
    --save_steps             1000 \
    --layers                 12 \
    --use_amp --use_grad_ckpt

To resume training from last checkpoint file, use --from_checkpoint [last checkpoint file] option. If you want to train GPT-2 with multiple GPUs, use --gpus [number of gpus] option.

The detail of command-line usage is as follows:

usage: gpt2 train [-h] --train_corpus TRAIN_CORPUS --eval_corpus EVAL_CORPUS --tokenizer_path TOKENIZER_PATH [--seq_len SEQ_LEN] [--layers LAYERS] [--heads HEADS] [--dims DIMS] [--rate RATE] [--dropout DROPOUT]
              [--batch_train BATCH_TRAIN] [--batch_eval BATCH_EVAL] [--base_lr BASE_LR] [--wd_rate WD_RATE] [--total_steps TOTAL_STEPS] [--eval_steps EVAL_STEPS] [--save_steps SAVE_STEPS] [--save_version_steps SAVE_VERSION_STEPS]
              [--save_model_path SAVE_MODEL_PATH] [--save_checkpoint_path SAVE_CHECKPOINT_PATH] [--from_checkpoint FROM_CHECKPOINT] [--from_pretrained FROM_PRETRAINED] [--use_amp] [--use_grad_ckpt] [--gpus GPUS]
              corpus_dir

options:
  -h, --help            show this help message and exit

Corpus and vocabulary:
  corpus_dir            root directory of corpus files
  --train_corpus TRAIN_CORPUS
                        training corpus file path
  --eval_corpus EVAL_CORPUS
                        evaluation corpus file path
  --tokenizer_path TOKENIZER_PATH
                        tokenizer file path

Model configurations:
  --seq_len SEQ_LEN     maximum sequence length
  --layers LAYERS       number of transformer layers
  --heads HEADS         number of multi-heads in attention layer
  --dims DIMS           dimension of representation in each layer
  --rate RATE           increase rate of dimensionality in bottleneck
  --dropout DROPOUT     probability that each element is dropped

Training and evaluation:
  --batch_train BATCH_TRAIN
                        number of training batch size
  --batch_eval BATCH_EVAL
                        number of evaluation batch size
  --base_lr BASE_LR     default learning rate
  --wd_rate WD_RATE     weight decay rate
  --total_steps TOTAL_STEPS
                        number of total training steps
  --eval_steps EVAL_STEPS
                        period to evaluate model and record metrics
  --save_steps SAVE_STEPS
                        period to save training state to checkpoint
  --save_version_steps SAVE_VERSION_STEPS
                        period to save a versioned/branched model.

Saving and restoring:
  --save_model_path SAVE_MODEL_PATH
                        save trained model weights to the file
  --save_checkpoint_path SAVE_CHECKPOINT_PATH
                        save training state to the checkpoint file
  --from_checkpoint FROM_CHECKPOINT
                        load last training state from checkpoint file
  --from_pretrained FROM_PRETRAINED
                        initialize parameters from pretrained model

Extensions:
  --use_amp             use automatic mixed-precision in training
  --use_grad_ckpt       use gradient checkpointing in transformer layers
  --gpus GPUS           number of gpu devices to use in training

Generate sentences

After training GPT-2, you can generate sentences with your trained model in interactive mode.

$ python -m gpt2 generate --vocab_path      build/vocab.txt \
                          --model_path      model.pth \
                          --seq_len         64 \
                          --nucleus_prob    0.8

The detail of command-line usage is as follows:

usage: gpt2 generate [-h] --vocab_path VOCAB_PATH --model MODEL
                     [--seq_len SEQ_LEN] [--layers LAYERS] [--heads HEADS]
                     [--dims DIMS] [--rate RATE] [--top_p TOP_P] [--use_gpu]

optional arguments:
  -h, --help            show this help message and exit
  --vocab_path VOCAB_PATH
                        vocabulary file path
  --model_path MODEL_PATH
                        trained GPT-2 model file path

Model configurations:
  --seq_len SEQ_LEN     maximum sequence length
  --layers LAYERS       number of transformer layers
  --heads HEADS         number of multi-heads in attention layer
  --dims DIMS           dimension of representation in each layer
  --rate RATE           increase rate of dimensionality in bottleneck

Generating options:
  --nucleus_prob NUCLEUS_PROB
                        probability threshold for nucleus sampling
  --use_gpu             use gpu device in inferencing

Evaluate the model

To estimate the performance of trained model, calculate the objective metrics on the evaluation dataset.

$  pipenv run python -m src.gpt2 evaluate DATA_CORPUS_BUILD_DIR \
    --eval_corpus            corpus.test.txt \
    --tokenizer_path         tokenizer.json \
    --model_path             gpt2-pretrained.pth \
    --batch_eval             96 \               
    --seq_len                256 \                                                         
    --layers                 12 \
    --use_gpu

Visualize metrics

You can also analyse training loss graph by visualizing recorded metrics.

$ pipenv run python -m src.gpt2 visualize \
    --model_path path/to/gpt2-pretrained.pth \
    --figure src/build/gpt2-training.png

The example figure is as bellow:

Mixed Precision Training

Mixed precision training is possible with torch.amp, provided your GPU meets the requirements. Use the --use_amp flag in the training program.

License

This project is Apache-2.0 Licensed.

Name		Name	Last commit message	Last commit date
Latest commit History 462 Commits
.github/workflows		.github/workflows
src/gpt2		src/gpt2
tests		tests
.gitignore		.gitignore
GPT2_Interactive_Notebook.ipynb		GPT2_Interactive_Notebook.ipynb
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
example-figure.png		example-figure.png
korean_gpt2_302M_demo.ipynb		korean_gpt2_302M_demo.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPT-2 PyTorch Implementation

Table of contents

Introduction

Dependencies

Usage

How to (pre) train?

Generate sentences

Evaluate the model

Visualize metrics

Mixed Precision Training

License

About

Uh oh!

Releases

Packages

Languages

License

mandar2812/GPT2

Folders and files

Latest commit

History

Repository files navigation

GPT-2 PyTorch Implementation

Table of contents

Introduction

Dependencies

Usage

How to (pre) train?

Generate sentences

Evaluate the model

Visualize metrics

Mixed Precision Training

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages