To understand Transformer Architectures through practice—from implementing one from scratch to adapting pre-trained models (BERT, GPT).
- Tokenization
- Transformer Architecture
- Functions and Tools
- Transformer Training
- Improvements and Experiments
Paper | Component/Method | Link |
---|---|---|
Attention is All You Need | Transformer Architecture | arXiv:1706.03762 |
Neural Machine Translation of Rare Words with Subword Units | Byte-Pair Encoding (BPE) | arXiv:1508.07909 |
SentencePiece: A Simple Language-Independent Subword Tokenizer | SentencePiece Tokenizer | arXiv:1808.06226 |
Layer Normalization | Layer Normalization | arXiv:1607.06450 |
Adam: A Method for Stochastic Optimization | Adam Optimizer | arXiv:1412.6980 |
SGDR: Stochastic Gradient Descent with Warm Restarts | Cosine Learning Rate Scheduler | arXiv:1608.03983 |
Decoupled Weight Decay Regularization | AdamW Optimizer | arXiv:1711.05101 |
Understanding the Difficulty of Training Deep Feedforward Neural Networks | Xavier/Glorot Initialization | PMLR 9:249-256 |
Rethinking the Inception Architecture for Computer Vision | Label Smoothing | arXiv:1512.00567 |
Practical Bayesian Optimization of Machine Learning Algorithms | Hyperparameter Tuning | arXiv:1206.2944 |
Beam Search Strategies for Neural Machine Translation | Beam Search Decoding | arXiv:1702.01806 |
Efficient Transformers: A Survey | Memory Optimization Techniques | arXiv:2009.06732 |
Note: This is a living project—code and structure may evolve.