GitHub - obdwinston/Transformer-Model: Transformer-based language model with multi-head attention.

Decoder-only generative model implemented from scratch based on GPT-2/GPT-Neo architecture, featuring Transformer blocks with masked multi-head self-attention (local and global) for autoregressive text generation.

The configured 41M model was pre-trained on the TinyStories dataset, which comprises short, synthetically-generated (using GPT-3.5 and GPT-4) children’s stories. The TinyStories paper by Microsoft Research showed that very Small Language Models (SLMs) can learn to generate coherent, diverse texts, and even demonstrate limited reasoning capabilities.

Besides reducing the number of transformer blocks, the small parameter counts reported (≤33M parameters) can also be attributed to a reduced vocabulary size, from the original ~50K tokens for the default GPT-2/GPT-Neo tokeniser, to just the top 10K tokens based on frequency (this critical detail was relegated to a footnote). This drastic reduction is enabled by the simpler corpus, resulting in significantly smaller embedding and output layers, and therefore a much smaller model (~60% reduction in total parameters, see table below).

Component	Full Vocabulary (50,257 tokens)		Reduced Vocabulary (8,000 tokens)
	Parameters	Percentage	Parameters	Percentage
Embedding Layers	38,990,592	36.8%	6,537,984	15.9%
Transformer Blocks	28,342,272	26.8%	28,342,272	69.1%
Output Layers	38,598,912	36.4%	6,146,304	15.0%
Total Parameters	105,931,776		41,026,560

Despite replicating the paper's 33M model architecture and further reducing the vocabulary size from 10K to 8K tokens, the current model still has roughly 41M parameters. Notably, the top 10K and 8K tokens cover 99.9% and 99.8% of all words in the TinyStories dataset respectively.

The training cross-entropy loss for the current 41M model was ~1.4 after 1 epoch, which is higher than the ~1.0 (assumed to also be cross-entropy loss) reported for the paper's 33M model after 2,500 training steps. Validation losses for both models were ~1.5 and ~1.2 respectively. The only apparent differences from the paper's setup are the reduced vocabulary size (8K versus 10K tokens), the smaller effective batch size (16 batch size × 5 gradient accumulation steps = 80 versus 1,280 based on the model card), and the total number of training steps/epochs (what constitutes a training step was not defined). Despite the ~40% higher training losses, the current 41M model was still able to generate coherent, logical sentences.

Examples of the model-generated texts are given below, with the prompt highlighted.

Once upon a time in a land far, far away, there lived a little girl named Lily. Lily loved to play outside in the sun and feel the warm breeze on her face. One day, Lily's mommy told her they were going to visit her grandma. Lily was so excited! When they arrived at her grandma's house, Lily saw a big, scary dog. She was afraid and started to cry. Her mommy hugged her and said, "Don't worry, Lily. The dog is just playing. He won't hurt you."

There was once a beautiful princess who lived in a big castle. One day, the princess decided to go for a walk in the forest. As she was walking, she saw a little bird with a broken wing. The princess felt sad for the bird and wanted to help. She gently picked up the bird and brought it home. The princess took care of the bird and made sure it was safe. She gave it food and water and talked to it.

There once was a boy who lived in a small house. He was very curious and wanted to explore the world around him. One day, he decided to go outside and see what he could find. He walked around the garden, looking at all the plants and trees. He was so excited to explore. Suddenly, he heard a loud noise and saw a big, scary monster! The boy was so scared, he ran away as fast as he could.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
model.py		model.py
pyproject.toml		pyproject.toml
train.py		train.py
utils.py		utils.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

obdwinston/Transformer-Model

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages