Important
The original authors have moved on to other projects. While the code might still be functional for its original purpose, please be aware that the original team does not plan to develop new features, bug fixes, or updates. If you'd like to become a maintainer, please open an issue to discuss it.

Distilabel is the framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
If you just want to get started, we recommend you check the documentation. Curious, and want to know more? Keep reading!
Distilabel can be used for generating synthetic data and AI feedback for a wide variety of projects including traditional predictive NLP (classification, extraction, etc.), or generative and large language model scenarios (instruction following, dialogue generation, judging etc.). Distilabel's programmatic approach allows you to build scalable pipelines for data generation and AI feedback. The goal of distilabel is to accelerate your AI development by quickly generating high-quality, diverse datasets based on verified research methodologies for generating and judging with AI feedback.
Compute is expensive and output quality is important. We help you focus on data quality, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time achieving and keeping high-quality standards for your data.
Ownership of data for fine-tuning your own LLMs is not easy but Distilabel can help you to get started. We integrate AI feedback from any LLM provider out there using one unified API.
Synthesize and judge data with latest research papers while ensuring flexibility, scalability and fault tolerance. So you can focus on improving your data and training your models.
We are an open-source community-driven project and we love to hear from you. Here are some ways to get involved:
-
Community Meetup: listen in or present during one of our bi-weekly events.
-
Discord: get direct support from the community in #argilla-general and #argilla-help.
-
Roadmap: plans change but we love to discuss those with our community so feel encouraged to participate.
The Argilla community uses distilabel to create amazing datasets and models.
- The 1M OpenHermesPreference is a dataset of ~1 million AI preferences derived from teknium/OpenHermes-2.5. It shows how we can use Distilabel to synthesize data on an immense scale.
- Our distilabeled Intel Orca DPO dataset and the improved OpenHermes model, show how we improve model performance by filtering out 50% of the original dataset through AI feedback.
- The haiku DPO data outlines how anyone can create a dataset for a specific task and the latest research papers to improve the quality of the dataset.
pip install distilabel --upgrade
Requires Python 3.9+
In addition, the following extras are available:
anthropic
: for using models available in Anthropic API via theAnthropicLLM
integration.cohere
: for using models available in Cohere via theCohereLLM
integration.argilla
: for exporting the generated datasets to Argilla.groq
: for using models available in Groq usinggroq
Python client via theGroqLLM
integration.hf-inference-endpoints
: for using the Hugging Face Inference Endpoints via theInferenceEndpointsLLM
integration.hf-transformers
: for using models available in transformers package via theTransformersLLM
integration.litellm
: for usingLiteLLM
to call any LLM using OpenAI format via theLiteLLM
integration.llama-cpp
: for using llama-cpp-python Python bindings forllama.cpp
via theLlamaCppLLM
integration.mistralai
: for using models available in Mistral AI API via theMistralAILLM
integration.ollama
: for using Ollama and their available models viaOllamaLLM
integration.openai
: for using OpenAI API models via theOpenAILLM
integration, or the rest of the integrations based on OpenAI and relying on its client asAnyscaleLLM
,AzureOpenAILLM
, andTogetherLLM
.vertexai
: for using Google Vertex AI proprietary models via theVertexAILLM
integration.vllm
: for using vllm serving engine via thevLLM
integration.sentence-transformers
: for generating sentence embeddings using sentence-transformers.mlx
: for using MLX models via theMlxLLM
integration.
outlines
: for using structured generation of LLMs with outlines.instructor
: for using structured generation of LLMs with Instructor.
ray
: for scaling and distributing a pipeline with Ray.faiss-cpu
andfaiss-gpu
: for generating sentence embeddings using faiss.text-clustering
: for using text clustering with UMAP and Scikit-learn.minhash
: for using minhash for duplicate detection with datasketch and nltk.
To run the following example you must install distilabel
with the hf-inference-endpoints
extra:
pip install "distilabel[hf-inference-endpoints]" --upgrade
Then run:
from datasets import load_dataset
from distilabel.models import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration
with Pipeline() as pipeline:
TextGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
generation_kwargs={"temperature": 0.7, "max_new_tokens": 512},
),
)
if __name__ == "__main__":
dataset = load_dataset("distilabel-internal-testing/instructions", split="test")
distiset = pipeline.run(dataset=dataset)
distiset.push_to_hub(repo_id="distilabel-example")
If you build something cool with distilabel
consider adding one of these badges to your dataset or model card.
[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)
[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)
To directly contribute with distilabel
, check our good first issues or open a new one.
- Modular pipeline system that allows for any type of generation (text, image, images, any output format) and composable steps whereas data_gen is sort of for questions and answers only (limited output format, no composability into more complex pipelines, not quite ready for multiple images).
- This also makes it way more extensible. You can't really build on top of data_gen, only modify its internals to do some simple generation. You can build on top of this with 2 files, a config and a pipeline.
- Better parallelism by handling it with just a config and allowing pretty arbitrary gpu usage via tensor parallelism, replicas and available_gpus. data_gen only has the data parallelism wrapper I made which has no tensor parallelism support and requires sharding the chunks json manually before and after.
- Input and output in huggingface datasets rather than using the chunking library with its custom format and taking/outputting jsons.
- Built in and hidden caching for easy resuming
- Inherits some cool things from distilabel such as the premade EvolInstructGenerator Task and others.
- Slightly improved prompt sampler by making it part of the config (easier to edit and have multiple of) and adding the ability to generate list fields in an API call (say generate 4 questions instead of 1 and split these into separate rows)
- Run everything from the outside the
distilabel
directory. e.g.python distilabel/pipelines/single_page_qa.py
- In the modified distilabel package, here are some of the files I have added (you could also check the git commit history)
pipelines/single_page_qa.py
. Put new pipelines here. The single page one is a good reference for how to do everything, copy and modifysrc/distilabel/configs/single_pages.py
. The config for single page QA, check it out to understand how the pipeline runs and what you can modifysrc/distilabel/pydantics.py
. Put Pydantic models here (configs, output formats)src/distilabel/llms/openai_compatible.py, vllm_api.py
. The wrapper that handles structured generation with openai compatible endpoints for different providers and vLLM servers as well.src/distilabel/utils/misc.py, prompt_sampler.py, pipe_utils.py, image.py
. Check out the prompt sampler and how it works in the config.pipe_utils.py
has useful/reusable code for pipelines in general.src/distilabel/steps/columns/pydantic_to_cols.py
,.../steps/filtering/filter_rows.py
,.../steps/list_to_rows.py
,.../tasks/lm_generation.py
. You can see each of them imported in the single page qa pipeline.lm_generation.py
is important to know of because I use it for the structured generation step using a LM. Kind of obvious, but this is where your custom steps go.
- The only requirement for the dataset format is having a source column which is expected to be a string (straight input to the LM) or a list of image paths (which can point straight to jpg/png files or a page in a pdf with format
path/to/pdf_page_x.pdf
). This is atm only an expectation inVLM._format_input()
when it is passed toLMGenerationTask.input_formatter
, so you can change theinput_formatter
/override this if you need or just makeVLM._format_input()
more general. - I handle scheduling gpus by overriding the available gpus seen by
CudaDevicePlacementMixin
and breaking the tasks into multiple load stages so that there are enough gpus available during each. - It will launch a vllm server if the model name is not a recognized proprietary model.
- Stages in the config and load stages in distilabel are different concepts. Stages in the config are broken into maybe multiple load stages/groups in distilabel so that I can handle scheduling arbitrary amounts of models into different load stages in the pipeline.
- I added a timeout to the output batch loop (set in
constants.py
), if it doesn't receive an output batch within that timeframe, it will break and finalize the pipeline execution. Hanging issues are easy to run into so this just closes it in a controlled manner. - The Ray pipeline is out of date and not supported, though it would not be terribly difficult to update by comparing with the local pipeline.
- Convergence steps (those that come after steps that are routed to (route steps)) will act like global steps (waiting for all batches to be received before continuing). This is due to issues with properly maintaining order of batches.
- I don't believe order of branches in
def process(self, *inputs: StepInput) -> "StepOutput":
is guaranteed. - Don't connect a global step to a routing batch function, a global step must take and output all current rows, so that will all get routed together. Connect it to a
NoOp
to break it up. - I have implemented a batch level caching system.
- The cache key is dependent on only the data in the batch (order included) and the name of the class of step it is being sent to.
- This works pretty well but is sensitive to the elements in the batch, which is influenced by batch size
- Different batch sizes are different batches, none of them will be loaded from cache. This is how it should be in a sense, because a step receives a batch and may respond differently depending on the content of the whole batch
- Or even the offset of data given to that batch: say you have 100 rows at some point in your pipeline, if a batches were cached for rows 33-38, 38-43, ..., then you repeat and somehow end up with a batch with 32-37, 37-42, none of the following batches will be retrieved from cache.
- If you change a step's code or a routing batch function's code, caching can't detect this, but you can pass
invalidate_cache=True
to them to tell them to redo things. You can also passinvalidate_distiset=True
to the pipeline to not used the distiset which is cached on completion of a pipeline. - Caching at the step level is turned off by default except for LMGenerationTask (to save disk space for cheap, deterministic steps). Just set
use_cache=True
when creating anything inheriting from_Step
to turn it on. - As a backup to the sensitivity, there is LLM level caching also, so that LLM responses are not recomputed. The downside is this won't stop the step from being loaded (i.e. the vllm server startup) when all the LLM responses are cached. This is also sensitive to any randomness such as prompt sampling in your pipeline, so be aware of that.
- An explanation of caching levels and setting
use_cache=True/False
andinvalidate_cache=True/False
.- There are 3 levels of caching:
- Batch level: controlled by
use_cache
andinvalidate_cache
when initializing theStep/Task
- LM Request Level: controlled by
use_cache
andinvalidate_cache
when initializing the OpenAILM. - Distiset level: controlled by
use_cache
andinvalidate_distiset
when callingpipeline.run()
.
- Batch level: controlled by
- There are 3 levels of caching:
- You can enable a function level timer with
DISTILABEL_ENABLE_TIMER=1
. You can decorate functions as shown indistilabel/pipeline/base.py
. - At the end of the pipeline,
write_buffer.py
writes all the final batches to disk in parquets (writing everyconstants.WRITE_BUFFER_SIZE
rows of data). Then it has to reload each of them to make sure they have a matching schema for the whole set. Then, they are loaded into a distiset and returned. This can take e.g. 10 min for 60K batches with 16M rows. Just note the time it takes. - You can set the base url for running vllm (used when
Config(use_running_vllm=True)
in the config) with the environment variableVLLM_API_BASE_URL
. - You can use
uv
to make the environment. Simply useuv sync
thenuv sync --extra flash
from the root dir of the repository. This should run the pipelines I have built though it may be missing dependencies for things like ray (which aren't supported atm anyways).
- Short Version: distilabel is very particular about how things are done, so there's a reason why every line is the way it is and I recommend starting off of one of the existing pipelines. Also, reading my code for e.g. the single page pipeline will tell you how to build on top of distilabel. Use the rest of this list as an issue tracker so people know how to solve issues in the future.
- It took me a while to figure out how to handle different providers, it turns out their OpenAI compatible endpoints accept varying basic parameters and it works best to ignore most of the parameters and send basic messages.
- You can't output a pydantic object from a step since it isn't serializable with pyarrow.
- In as many places as possible, I think you want to use the
load()
method instead of__init__
, since__init__
is handled by pydantic and you'll be able to see the inherited args if you don't override it. It also matches the logic of the library better (matching load groups better for instance). - I ran into some errors with the decorators that I tried to make for multiple generations and structured output because distilabel inspects the signature of functions and somehow came up with
**kwargs
was a required runtime parameter that needed to be set at pipeline init. The solution I am using is to copy the function signature from the library, though this isn't ideal for maintenance. - I ran into some errors with it not being able to make a pyarrow table after finishing the
LMGenerationTask
which were due to the parameteradd_raw_input=True
. Since I overrode theOpenAILLM
class to add support for more flexible vision (arbitrary number/order of images in chat format)(and to allow grouping all model providers under a single class), the formatted input was a list of messages, some text, some visual, all in one column (so you can vary the number of images). Pyarrow can't make a table out of this because the structure of a text and an image message are different so it can't make a type for the column. Thus, I have setself.add_raw_input=False
in e.g. theLMGenerationTask
.- This is no longer a current issue since I moved the prompt sampler into format input, which is called before the lm and discarded after (no serialization).
StepResources
seems like it might handle scheduling tasks across gpus for you, but I understand this only happens when using Ray, which has some internal scheduling that will respect those resources (there's a section in the documentation about how to use Ray for larger scale distributed generation).- What it does actually do is respect
replicas
, which is basically just data parallelism for non-generator/global steps/tasks (replicates models as well). - It will put LLMs on different gpus (provided you use the mixin properly) until it runs out of gpus (cuda_device_placement.py), but it won't reschedule them
- What it does actually do is respect
- To handle scheduling tasks (say your pipeline will use 10 different vllm servers but you have 8 gpus), you use load stages. See the docs
Task.unload()
callsself.llm.unload()
so you don't have to handle it yourself. If you wanted to keep it alive (say the vllm server), you'd need to get around this- Distilabel can handle a list of tasks in the
>>
syntax, for each task in the previous stage, it sends the task's completed batches to all of the next stage (or in the case of using a router, it will select some set of the next stage per batch) - Don't include a routing function step in the load groups, it isn't quite a step and will throw an error, but runs even when left out of the load groups
- I would like each LM to be able to have its own system prompt, which means they each need their own prompt sampler. I see two ways to do this, either make a step for each LM that has the prompt sampler and connect them properly, or put the prompt sampler with the LM. Making a bunch of steps and connecting them seems annoying and not as clean for writing new pipelines. Putting it with the LM means you don't see the system prompt since it isn't a step input or output, so I have sort of hacked distilabel by inplace updating input, which gets forwarded to
LMGenerationTask.format_output()
. - Serialization
- Initially, I ran into an error trying to hash my Config object (for the caching system) so I overrode the serialization to return an empty dict
- When I was trying to test the caching, I ran into another error where it couldn't resume from the yaml because the
LMGenerationTask
has an input_formatter callable attribute. It loads the yaml with yaml.FullLoader which won't allow arbitrary python execution (setting the input_formatter). I foundField(exclude=True)
in Pydantic to solve this. Then it occurred to me that I should do the same for the configs I was using rather than erasing their signatures. After this, there was another error in resuming because it couldn't initialize the e.g.LMGenerationTask
without providing the configs. So, I gave these default initializations. This uncovered another error which was a bug in distilabel, I had no choice but to modify the actual package to fix it. InDAG.from_dict()
, they don't set therouting_batch_function._step
which is set duringStep.connect()
, so I just added the line to do that.- The way its resuming works is when you call
pipeline.run()
, one of the early steps isself._refresh_pipeline_from_cache()
which essentially creates an entirely new dag from the cached information. Then, for excluded or secret fields, it sets them using the values of the current dag. Now that I know this, their design seems reasonable, but it is important that you understand the effect ofField(exclude=True)
to get resuming working properly. The need for serialization and deserialization also justifies the extensive use of Pydantic in distilabel.
- The way its resuming works is when you call
- Had to set
vllm_api
field to private so that it didn't try to serialize it in multiprocessing. - Might be errors with changing load_groups for a pipeline that you are trying to resume
- You can't connect a list of steps to a routing batch function, so I have a
NoOp
step that can serve as a 'junction' before the routing batch function. - I made step resources an excluded parameter (from the signature and caching) so that you can change these and the pipeline will resume as normal
- [IMPORTANT] I ran into a tough error with distilabel hanging when trying to resume. The root cause (or one of them) was probably that I had stopped execution in the vscode debugger, which hard stops the program and distilabel didn't save the batch back to the pipeline's batch manager, making it so that my initial generator step didn't have its batch data and wasn't sending it up the pipeline. I am still not sure entirely how batches are routed, since this is a large and complex system, but anyways, be wary of the hanging issue. Keep in mind the functions
_manage_batch_flow, _BatchManagerStep._get_data(), get_batch() and add_batch() and _initialize_pipeline_execution()
which are related to batches in distilabel. I am not sure how exactly to solve this if it happens on something expensive to re-run. Maybe try manually editing the cache if you can find the right information.- You can run into an error with it not being able to allocate gpus, making
os.environ['CUDA_VISIBLE_DEVICES']
fail, when it has been stopped in a weird way. Run it and let it crash, clear the cache and restart it. If you want to avoid clearing the cache, perhaps try finding the device_map file fromcuda_device_placement.py
and clearing that.
- You can run into an error with it not being able to allocate gpus, making
- Ran into another hanging issue during normal run: the system previously sent
LAST_BATCH_SENT_FLAG
only to predecessors of a step that had sent a last batch to output. i.e. the step should be done, but only the predecessors get told to quit. When a load stage ends with a step that has multiple replicas, it will wait for the next stage, but not send a signal to tell the other replicas to quit working. I added sending a signal to the step that sent the last batch to output so that all the replicas are closed if needed. (this seems like it should be the way to do things in the first place, but I will settle for adding it on top of existing logic in case there was another reason for it) - While debugging, you can set
DISTILABEL_LOG_LEVEL=DEBUG
to see a lot of helpful info. - [IMPORTANT] Another tough deadlock/hanging error: if one of your steps gives a batch size of 0 as output (e.g. you give the wrong system prompt for a given pydantic output and all of them fail to be structured so your filter step drops all rows), then it can hang. The responsible line, if the step is a normal step (as opposed to a convergence step (a step after a list of steps that are routed to with a routing batch function) or an accumulate step), is (probably)
if num_rows == 0 and step_name in self.last_batch_received:
in_BatchManagerStep
(around L:555). Edit: I have made an additional case for this to avoid hanging, but I am unsure if there are unintended consequences. - If the vLLM server fails to start, the error will be something like: 'cannot pickle thread'. Check the
vllm_logs/
for the appropriate rank. - You can run into a deadlock if steps you're trying to route to are split across stages and no batch is sent to a certain stage. e.g. step_0 routes to [step_1, step_2] which route to step_3, load stages are like so [[step_0, step_1], [step_2], [step_3]]. If no batches are sent to step_2, the stage won't load and run properly. Edit: I have patched this.
- Route steps need to be 1-1 mappings (no dropping or adding rows to batches). Edit: I patched this, it should work.
- You should still drop None after any LMGenerationTask step because there are other ways than sturctured generation to end up with a None in the response.
- Say you generate multiple responses to some list of images, then split and later rejoin the images. If a response was a list with duplicates, then when split, the rows will be exact copies of each other and will trigger the warning in join parallel branches. Solution: use a smarter model.
- The pipeline can crash and throw an error about not being able to pickle
_thread.lock
. In my case, this was due to not having pynvml installed and it crashed when trying to load the LMGenerationTask.
@misc{distilabel-argilla-2024,
author = {Álvaro Bartolomé Del Canto and Gabriel Martín Blázquez and Agustín Piqueres Lajarín and Daniel Vila Suero},
title = {Distilabel: An AI Feedback (AIF) framework for building datasets with and for LLMs},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/argilla-io/distilabel}}
}