ScratchGPT

ScratchGPT is a Python project that implements a small-scale transformer-based language model from scratch. It is designed for educational purposes, allowing developers to explore the internals of a transformer model without the complexity of large-scale frameworks. The project provides functionality for training the model on custom datasets and generating text from a prompt.

Why?

We want to allow people to experiment easily with any sequence-to-sequence problems. This package is simple to understand, simple to use - show us your projects using ScratchGPT.

Features

Custom transformer architecture implementation
Training on user-provided text data
Text generation using the trained model
Command-line interfaces for training and inference

Key Features

Custom Transformer Architecture: A from-the-ground-up implementation of a decoder-only transformer, including Multi-Head Self-Attention , Feed-Forward layers, and Layer Normalization.
Flexible Tokenization: Includes a simple character-level tokenizer and a wrapper for using any tokenizer from the Hugging Face Hub.
Configurable Training: Easily configure model architecture (e.g., embedding_size, num_heads) and training parameters (e.g., learning_rate, batch_size) via a scratch_gpt.yaml file.
Command-Line Interfaces: Comes with user-friendly CLIs for both training the model and performing inference.
Pre-tokenization Caching: Caches tokenized datasets to disk for significantly faster startup on subsequent training runs.

Requirements

Python 3.12+
uv for dependency management

Installation

Clone the repository:

git clone https://github.com/LabStrangeLoop/scratchgpt.git
cd scratchgpt

Install dependencies using uv:
```
uv sync --all-groups
```
Install from pip:
```
pip install scratchgpt
```

Full Usage Examples

Please take a look at the simple example in the examples folder.

Note: Some examples require additional dependencies. To run all examples, install the optional dependencies:

uv sync --extra examples-dependencies

Usage

Training

To train the model on your custom dataset, run the train command. This will create an experiment folder containing the model weights, tokenizer files, and configuration.

uv run train -t <path_to_training_data> -e <experiment_folder>

-d, --data_source: Path to the training data file or folder
-e, --experiment: Path to the folder where experiment checkpoints will be saved
-t, --tokenizer: (Optional) The Hugging Face Hub tokenizer to use (default: "gpt2")

Inference

To generate text using a trained model, use infer command:

uv run infer -e <experiment_folder> [-dv <device>] [-m <max_tokens>]

-e, --experiment: Path to the folder containing the trained model
-dv, --device: Device to run the model on (default: "cuda")
-m, --max_tokens: Maximum number of tokens to generate (default: 512)

Tokenization

This project allows you to create your own tokenizers easily or bootstraps huggingface tokenizers for you to use.

Project Structure

The repository is organized to separate concerns, making it easy to navigate.

scratchgpt/train.py: Main training script.
scratchgpt/infer.py: Inference script for text generation.
scratchgpt/config.py: Contains all Pydantic configuration models.
scratchgpt/model/model.py: The core Transformer model implementation.
scratchgpt/training/trainer.py: Orchestrates the training and validation loops.
scratchgpt/tokenizer/: Tokenizer implementations, including wrappers for Hugging Face.
scratchgpt/model_io.py: Utilities for saving and loading models and tokenizers.
tests/: Unit tests for the project.

Development

This project uses various development tools:

mypy for static type checking
ruff for formatting and standard adherence
pytest for testing

Run the following commands to ensure code quality:

uv run ruff check --fix .
uv run mypy scratchgpt
uv run pytest ./tests/

Future Roadmap

Apply SOTA optimizations

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

How To Publish To PyPI

export UV_PUBLISH_USERNAME=__token__
export UV_PUBLISH_PASSWORD=
uv build -vv --wheel
uv publish --publish-url https://upload.pypi.org/legacy/

License

MIT License

Authors

Aleksandr Yeganov
Dario Cazzani

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.github/workflows		.github/workflows
.hooks		.hooks
assets		assets
docker		docker
examples		examples
scratchgpt		scratchgpt
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
fringe.md		fringe.md
pyproject.toml		pyproject.toml
scratch_gpt.yaml.sample		scratch_gpt.yaml.sample
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ScratchGPT

Why?

Features

Key Features

Requirements

Installation

Full Usage Examples

Usage

Training

Inference

Tokenization

Project Structure

Development

Future Roadmap

Contributing

How To Publish To PyPI

License

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

LabStrangeLoop/scratchgpt

Folders and files

Latest commit

History

Repository files navigation

ScratchGPT

Why?

Features

Key Features

Requirements

Installation

Full Usage Examples

Usage

Training

Inference

Tokenization

Project Structure

Development

Future Roadmap

Contributing

How To Publish To PyPI

License

Authors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages