Welcome! This repository offers a two-part guide designed to demystify the internal workings and training lifecycle of modern Large Language Models (LLMs), focusing on the Transformer architecture. We aim to bridge the gap between abstract concepts and concrete examples by visualizing a real model's parameters and explaining how such models learn.
- Part 1 explores and visualizes the architecture, parameters, and dynamic attention mechanisms of the
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
model to build intuition. - Part 2 provides a conceptual overview of the LLM training lifecycle, including pre-training, fine-tuning strategies (SFT, Alignment with RLHF/GRPO), knowledge distillation, and parameter-efficient techniques (PEFT/LoRA).
This guide is intended for:
- Students learning about AI, Machine Learning, and Natural Language Processing (NLP).
- Developers curious about the models they interact with.
- Researchers looking for practical ways to inspect model internals or understand training paradigms.
- Anyone seeking a deeper understanding of how LLMs function and learn.
Basic familiarity with Python is assumed. Key concepts are explained within the notebooks.
This guide covers the following key areas across two notebooks:
Part 1: Architecture & Visualization (LLM_Architecture_Visualization.ipynb
)
- Foundations: Core ML/ANN concepts, parameters.
- Input: Tokenization, Token Embeddings (visualized).
- Transformer Blocks: Self-Attention (QKV, Multi-Head, GQA context), Position-wise Feed-Forward Networks (FFN using SwiGLU), Layer , Residual Connections (components visualized for Layer 0, Middle, Last).
- Output: Final Normalization, Language Modeling Head (visualized, weight tying checked).
- Analysis: Parameter statistics across layers, dynamic attention pattern heatmaps, aggregate weight visualizations (Q, K, V, O, FFN projections across all layers).
Part 2: Training & Fine-tuning Concepts (LLM_Training_Lifecycle.ipynb
)
- Pre-training: Building foundational knowledge (Next-Token Prediction).
- Knowledge Distillation: Context for the specific
DeepSeek-R1-Distill
model. - Fine-tuning: Supervised Fine-tuning (SFT) / Instruction Tuning.
- Alignment Tuning: Concepts of RLHF/PPO, GRPO.
- Efficiency: Parameter-Efficient Fine-tuning (PEFT), focusing on LoRA.
-
Part 1: Architecture & Visualization
-
Part 2: Training & Fine-tuning Concepts
- Model ID:
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
- Link: Hugging Face Model Card
- Open in Colab: Click the "Open In Colab" badges in the Notebooks section above.
- Select Runtime (Part 1): Use a
GPU
accelerator in Colab (Runtime
->Change runtime type
->T4 GPU
) for Part 1 for best performance with model loading and visualization. Part 2 is conceptual. - Run Cells Sequentially: Execute the notebook cells in order.
- Explore: Read the explanations and observe the generated outputs and visualizations in Part 1. Note: The aggregate weight plots near the end of Part 1 can be very resource-intensive (RAM/CPU) and may take significant time to render or cause slowdowns.
- Python libraries:
transformers
,torch
,accelerate
,matplotlib
,seaborn
,numpy
. (Installed by the notebook). - Internet connection (for model download).
- Sufficient RAM (>= 12GB recommended) and GPU VRAM (>= 8GB recommended).
- Part 1 Focus: Primarily architecture, parameters, attention visualization. Excludes runnable training, activation analysis. Aggregate plots are resource-heavy.
- Part 2 Focus: Conceptual explanations only; no runnable training code.
- Model Specificity: While core concepts are general, some implementation details relate to the specific Qwen/DeepSeek model.