This workspace showcases what modern AI infrastructure looks like when it is written entirely on top of the Rust standard library. Every crate in the repository is dependency-free and built to be instructive: you can inspect, tweak, and extend the full stack without pulling code from crates.io. The primary purpose is educational—learning how the pieces of a transformer-based language model, autograd engine, tensor library, tokenizer, and support tooling fit together. The code is not optimized for performance or production use, it is pretty slow compared to optimized libraries, but it is simple and easy to follow.
Quite many modules were written with help from large language models (LLMs) which has quite of impact on code. However, overall architecture is designed without LLM.
Unsafe code is widely used inside kernels to win a bit on runtime performance (e.g. for matrix multiplication), however it is better to avoid that in production code.
Basically, I've built this project while reading Build a Large Language Model from Scratch
book by Sebastian Raschka. In this book, pytorch is used for replicating some simple LLM (GPT2) using pytorch. To be a bit picky, I would say it is not really "from scratch" because pytorch is a huge library with many dependencies and it hides you a lot of important details about training itself, such as autograd, tensor operations, etc.
So I decided to implement the same thing but without any dependencies, just using Rust standard library. This is ultimately a fun exercise, using LLM can help a lot, but it is not magic, you still need to understand what is going on and guide the LLM to produce the code you want.
- GPT-2 style transformer implemented from scratch in
neural-net
, including embeddings, multi-head attention, feed-forward blocks, layer norms, dropout, and a text generation API. - Autograd and training utilities in
zero-grad
, supporting forward/backward passes, state dicts, and simple optimization loops. - Tokenization, dataset, and generation tooling for loading GPT-2 BPE vocabularies (
texten
), turning raw text into training batches, and streaming generated tokens. - Reusable building blocks like
numeric
(matrix ops),par-iter
(Rayon-inspired parallel iterators),small-vec
,rand
,json
,json-derive
, andzerr
, all written without external crates.
Crate | Purpose |
---|---|
neural-net |
End-to-end neural network stack with GPT-2 model, dataset loaders, text generation utilities, and SafeTensors/Hugging Face weight import helpers. |
zero-grad |
Reverse-mode autograd engine, tensor runtime, and training scaffolding used by neural-net . |
numeric |
Core tensor math and linear algebra primitives (views, matmul, reductions, etc.). |
par-iter |
Parallel iterator combinators (map, zip, reduce, cartesian product) built with std threads/atomics. |
small-vec |
Small vector optimized for stack storage with transparent heap promotion. |
rand |
Minimal random number utilities (e.g., SimpleRng ) for sampling, dropout, initialization. |
texten |
Zero-dependency BPE tokenizer compatible with GPT-2 vocabularies. |
json , json-derive |
JSON parser/serializer plus a derive macro crate, used for configuration and tooling. |
zerr |
Ergonomic error handling helpers (context chaining, threading support). |
json-derive-test |
Integration tests and examples for the derive macros. |
In addition, the notebooks/
folder contains reference datasets (notebooks/data
) and Jupyter notebooks that showcase training and inference workflows.
The neural-net
crate assembles a complete GPT-2 style decoder-only transformer:
modules::gpt
: model builder with token/position embeddings, transformer stacks, and language-model head.modules::transformer_block
&modules::multi_head_attention
: full attention pipeline with dropout, causal masking, and QKV projections.generation::text_simple
: iterative text generation with greedy or top-k sampling, streaming support, and performance metrics.generation::data_loader
: slice datasets, batching, and shuffling utilities for training.state::{binary, safetensors, huggingface}
: load/save checkpoints, including conversion from Hugging Face naming.
Because everything lives in this workspace, you can inspect the entire inference/training path: tensors come from numeric
, gradients from zero-grad
, scheduling from par-iter
, tokenization from texten
, and errors bubble through zerr
.
use neural_net::generation::TextGenerator;
use neural_net::modules::GPTModelBuilder;
use neural_net::TensorResult;
use zero_grad::Tensor;
fn main() -> TensorResult<()> {
// 1. Build a small GPT model (adjust hyperparameters as needed)
let model = GPTModelBuilder::new()
.vocab_size(50257)
.context_length(128)
.emb_dim(768)
.n_heads(12)
.n_layers(12)
.drop_rate(0.0)
.build::<f32>()?;
// 2. Prepare an input prompt (batch size = 1)
let prompt_ids = Tensor::from(vec![[50256_f32, 318., 257., 2211.]])?; // ``~ hello gpt2``
// 3. Generate additional tokens with greedy decoding
let generated = TextGenerator::new()
.max_new_tokens(32)
.temperature(1.0)
.greedy()
.generate(&model, &prompt_ids, 128)?;
println!("Generated token ids: {:?}", generated.elements().collect::<Vec<_>>());
Ok(())
}
See neural-net/examples/step.rs
for a runnable training demo that:
- Loads the GPT-2 BPE vocabulary from
notebooks/data/gpt2
using thetexten
tokenizer. - Creates a
SliceDataset
fromnotebooks/data/dataset/verdict.txt
with stride-based sampling. - Builds the GPT model and runs a single optimizer step with
zero_grad::train
.
Run the example:
cargo run -p neural-net --release --example step -- ./notebooks/data
- Install Rust (stable toolchain is sufficient).
- Clone the repository and navigate into it.
- Verify everything builds and the tests pass:
cargo test
Download a safetensors GPT-2 model (e.g., from Hugging Face) and place it in .data/model.safetensors
. Then run text generation:
cargo run --release --example safetensors -- .data/model.safetensors
It will be pretty slow, but you should see generated tokens printed to the console in streaming fashion.
The notebooks in notebooks/
mirror some of these flows and include exploratory analysis, GPT fine-tuning experiments, and profiling notes.
- Transparency & auditability – every line is here, so you can review and modify the full stack.
- Educational value – learn how tensors, autograd, attention, tokenization, and training loops are implemented under the hood.
- Portability – minimal compile times and no external supply-chain risk.
- Experimentation playground – swap components (e.g., attention variants, optimizers) without fighting opaque upstream implementations.