A PyTorch implementation of the Transformer architecture for neural machine translation from scratch, based on the "Attention Is All You Need" paper.
This project implements a complete Transformer model for translating between English and Swedish using the OPUS Books dataset. The implementation includes all core components: multi-head attention, positional encoding, encoder-decoder architecture, and training pipeline.
- Complete Transformer architecture implementation from scratch
- Multi-head self-attention and cross-attention mechanisms
- Sinusoidal positional encoding
- Layer normalization and residual connections
- Custom tokenizer training for source and target languages
- Comprehensive training loop with validation metrics
- TensorBoard logging for training visualization
- Greedy decoding for inference
transformer/
├── model.py # Transformer architecture implementation
├── train.py # Training script and validation logic
├── dataset.py # Dataset class and data preprocessing
├── config.py # Configuration parameters
├── requirements.txt # Python dependencies
├── tokenizer_en.json # English tokenizer (generated)
├── tokenizer_sv.json # Swedish tokenizer (generated)
├── weights/ # Model checkpoints directory
└── runs/ # TensorBoard logs directory
- Python 3.9
- PyTorch 2.0.1
- See
requirements.txt
for complete dependencies
- Clone the repository:
git clone <repository-url>
cd transformer
- Install dependencies:
pip install -r requirements.txt
Run the training script:
python train.py
The script will:
- Download and prepare the OPUS Books English-Swedish dataset
- Build or load tokenizers for both languages
- Train the Transformer model
- Save checkpoints in the
weights/
directory - Log training metrics to TensorBoard
Modify config.py
to adjust training parameters:
batch_size
: Training batch size (default: 2)num_epochs
: Number of training epochs (default: 5)lr
: Learning rate (default: 1e-4)seq_len
: Maximum sequence length (default: 320)d_model
: Model dimension (default: 128)lang_src
: Source language (default: "en")lang_tgt
: Target language (default: "sv")
View training progress with TensorBoard:
tensorboard --logdir=runs
The implementation includes:
- InputEmbeddings: Token embedding with scaling
- PositionalEncoding: Sinusoidal position embeddings
- MultiHeadAttentionBlock: Multi-head self and cross-attention
- FeedForwardBlock: Position-wise feed-forward network
- LayerNormalization: Custom layer normalization
- ResidualConnection: Residual connections with dropout
- Encoder: 6 encoder layers with self-attention
- Decoder: 6 decoder layers with self and cross-attention
- ProjectionLayer: Final linear layer for vocabulary prediction
- Model dimension: 128
- Number of layers: 6
- Number of attention heads: 8
- Feed-forward dimension: 2048
- Dropout rate: 0.1
The model tracks several metrics during validation:
- Word-level accuracy
- Character Error Rate (CER)
- Word Error Rate (WER)
- BLEU Score
Uses the OPUS Books dataset for English-Swedish translation:
- Training: 90% of the dataset
- Validation: 10% of the dataset
- Tokenizers are trained on the dataset vocabulary
- Special tokens: [SOS], [EOS], [PAD], [UNK]
- Custom causal masking for decoder self-attention
- Xavier uniform parameter initialization
- Greedy decoding for inference
- Gradient clipping and learning rate scheduling available
- Automatic tokenizer generation and caching
model.py
: Complete Transformer implementation with all building blockstrain.py
: Training loop, validation, and inference functionsdataset.py
: Bilingual dataset class with proper maskingconfig.py
: Centralized configuration management
This project is for educational purposes and implements the Transformer architecture as described in "Attention Is All You Need" by Vaswani et al.