Baby GPT #6

k4ml · 2023-04-10T10:55:52Z

k4ml
Apr 10, 2023
Maintainer

This is a baby GPT with two tokens 0/1 and context length of 3, viewing it as a finite state markov chain. It was trained on the sequence "111101111011110" for 50 iterations. The parameters and the architecture of the Transformer modifies the probabilities on the arrows.

E.g. we can see that:

state 101 deterministically transitions to 011 in the training data, so the probability of that transition becomes higher (79%). Not near 100% because we only did 50 steps of optimization.
state 111 goes to 111 and 110 with 50% probability each, which the model almost learns (45%, 55%).
states like 000 are never encountered during training, but have relatively sharp transition probabilities, e.g. 73% of going to 001. This is a consequence of inductive biases in the Transformer. One might imagine wanting this to be 50%, except in a real deployment almost every input sequence is unique, not present in the training data verbatim.

Not really sure where I was going with this :D, I think it's interesting to train/study tiny GPTs because it becomes tractable to visualize and get an intuitive sense of the entire dynamical system. Play with here:

https://colab.research.google.com/drive/1SiF0KZJp75rUeetKOWqpsA8clmHP6jMg?usp=sharing
https://mobile.twitter.com/karpathy/status/1645115622517542913

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Baby GPT #6

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Baby GPT #6

Uh oh!

Uh oh!

k4ml Apr 10, 2023 Maintainer

Replies: 0 comments

k4ml
Apr 10, 2023
Maintainer