Training GPT-2 transformer language model with sentencepiece tokenizer

Training GPT-2 transformer language model on your own corpora with sentencepiece tokenization.

This repo contains a PyTorch implementation of GPT-2, which support multi-GPU training. It also contains a TensorFlow implementation in lm/gpt_2_tf, but it is not developed any more. They share the same data preparation scripts. TF training command is gpt-2-tf-train and needs TensorFlow 1.13. Documentation below is for PyTorch version.

Contents

Installation
Usage
License & credits

Installation

Python 3.6+ is required with torch nightly or 1.6.0+. Working in a virtualenv is assumed below. Install appropriate version of pytorch first, and then:

pip install -r requirements.txt
python setup.py develop

Usage

Instructions are below. See also test/test_shakespeare.sh for a complete pipeline demo on a small corpus (takes a minute on a CPU).

Prepare data for training

Corpus format: a directory with top-level train, valid and test folders. Each top-level folder may contain sub-folders. Inside them, there must be utf-8 encoded text files with .txt extension.

The commands to train sentencepiece model and encode the corpus support multiple corpora, in below examples we assume they can be listed as data/corpora-*.

Train sentencepiece model (sp-text.txt can be removed after running). This can consume a large amount of memory, adjust sentencepiece arguments as advised if needed (this is not supported in the sp-train command directly):
```
sp-train data/corpora-* sp-text.txt sp-model
```

Encode corpora, producing numpy files:

sp-encode data/corpora-* sp-model.model data/encoded

Training

Example command:

gpt-2 run-root data/encoded sp-model.model

run-root would contain model checkpoints and json-lines logs, which can be plotted in a jupyter notebook with json_log_plots.plot("run-root"), with number of tokens seen on the X axis.

Default hyperparameters correspond to released "small" GPT-2 model.

When multiple GPUs are available, they would be used for training with the help of torch.distributed.

If the path exists and --clean key is NOT passed, training would be resumed. Note that all parameters still need to be specified and model parameters need to match.

Notes on training parameters:

--batch-size is per-GPU, so you don't need to re-tune it when changing number of GPUs, just use max that fits into memory.
--g-accum-gradients is the global number of gradient accumulations, it must be divisible by the number of GPUs. Effective global batch size is always batch_size * g_accum_gradients.
--lr does not need to be changed when changing --batch-size or --g-accum-gradients or number of GPUs or --n-ctx: loss is already scaled appropriately.

Inference

Example command:

gpt-2-gen run-root "Artificial intelligence"

run-root would contain model checkpoints "Artificial intelligence" is the text prefix used as a starting point for generating tokens

Notes on inference parameters:

--tokens-to-generate: number of tokens to generate, default is 42
--top-k: number of token candidates to generate for each position (beam width), default is 8.

License & credits

License is MIT.

TensorFlow GPT-2 model is taken from https://github.com/openai/gpt-2/blob/master/src/model.py and TensorFlow GPT-2 training code is based on https://github.com/nshepperd/gpt-2/blob/finetuning/train.py

PyTorch port is based on original OpenAI code.

Test Shakespeare corpus under tests/shakespeare is from http://shakespeare.mit.edu under public domain.

Name	Name	Last commit message	Last commit date
Latest commit lopuhin update aiohttp requirements Feb 27, 2021 b32e7fb · Feb 27, 2021 History 169 Commits
lm	lm	expose more text generation options in the UI	Jul 23, 2020
lm_web_ui	lm_web_ui	update aiohttp requirements	Feb 27, 2021
tests	tests	Use --sample-senteces during validation as well, more tests	Oct 30, 2019
.dockerignore	.dockerignore	Add a simple Dockerfile	Apr 15, 2019
.gitignore	.gitignore	Add a complete pipeline demo/test	Mar 19, 2019
.travis.yml	.travis.yml	use torch nightly on travis	May 30, 2020
Dockerfile	Dockerfile	Install vim and htop for convenience	Apr 18, 2019
README.rst	README.rst	use torch.amp: more efficient compared to apex.amp	May 30, 2020
lambda.py	lambda.py	WIP lambda stuff	Apr 22, 2019
lambda.sh	lambda.sh	WIP lambda stuff	Apr 22, 2019
requirements.lambda.txt	requirements.lambda.txt	Update requirements	Oct 22, 2019
requirements.txt	requirements.txt	unfreeze matplotlib and sentencepiece	Feb 27, 2021
setup.py	setup.py	Move generate CLI into a separate module	Oct 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Training GPT-2 transformer language model with sentencepiece tokenizer

Installation

Usage

Prepare data for training

Training

Inference

License & credits

About

Releases

Packages

Contributors 3

Languages

lopuhin/transformer-lm

Folders and files

Latest commit

History

Repository files navigation

Training GPT-2 transformer language model with sentencepiece tokenizer

About

Topics

Resources

Stars

Watchers

Forks

Languages