Skip to content

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.

License

Notifications You must be signed in to change notification settings

lightonai/distilabel

Repository files navigation

Important

The original authors have moved on to other projects. While the code might still be functional for its original purpose, please be aware that the original team does not plan to develop new features, bug fixes, or updates. If you'd like to become a maintainer, please open an issue to discuss it.

Distilabel Logo

Synthesize data for AI and add feedback on the fly!

CI CI

Distilabel is the framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.

If you just want to get started, we recommend you check the documentation. Curious, and want to know more? Keep reading!

Why use distilabel?

Distilabel can be used for generating synthetic data and AI feedback for a wide variety of projects including traditional predictive NLP (classification, extraction, etc.), or generative and large language model scenarios (instruction following, dialogue generation, judging etc.). Distilabel's programmatic approach allows you to build scalable pipelines for data generation and AI feedback. The goal of distilabel is to accelerate your AI development by quickly generating high-quality, diverse datasets based on verified research methodologies for generating and judging with AI feedback.

Improve your AI output quality through data quality

Compute is expensive and output quality is important. We help you focus on data quality, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time achieving and keeping high-quality standards for your data.

Take control of your data and models

Ownership of data for fine-tuning your own LLMs is not easy but Distilabel can help you to get started. We integrate AI feedback from any LLM provider out there using one unified API.

Improve efficiency by quickly iterating on the right research and LLMs

Synthesize and judge data with latest research papers while ensuring flexibility, scalability and fault tolerance. So you can focus on improving your data and training your models.

Community

We are an open-source community-driven project and we love to hear from you. Here are some ways to get involved:

  • Community Meetup: listen in or present during one of our bi-weekly events.

  • Discord: get direct support from the community in #argilla-general and #argilla-help.

  • Roadmap: plans change but we love to discuss those with our community so feel encouraged to participate.

What do people build with Distilabel?

The Argilla community uses distilabel to create amazing datasets and models.

  • The 1M OpenHermesPreference is a dataset of ~1 million AI preferences derived from teknium/OpenHermes-2.5. It shows how we can use Distilabel to synthesize data on an immense scale.
  • Our distilabeled Intel Orca DPO dataset and the improved OpenHermes model, show how we improve model performance by filtering out 50% of the original dataset through AI feedback.
  • The haiku DPO data outlines how anyone can create a dataset for a specific task and the latest research papers to improve the quality of the dataset.

Installation

pip install distilabel --upgrade

Requires Python 3.9+

In addition, the following extras are available:

LLMs

  • anthropic: for using models available in Anthropic API via the AnthropicLLM integration.
  • cohere: for using models available in Cohere via the CohereLLM integration.
  • argilla: for exporting the generated datasets to Argilla.
  • groq: for using models available in Groq using groq Python client via the GroqLLM integration.
  • hf-inference-endpoints: for using the Hugging Face Inference Endpoints via the InferenceEndpointsLLM integration.
  • hf-transformers: for using models available in transformers package via the TransformersLLM integration.
  • litellm: for using LiteLLM to call any LLM using OpenAI format via the LiteLLM integration.
  • llama-cpp: for using llama-cpp-python Python bindings for llama.cpp via the LlamaCppLLM integration.
  • mistralai: for using models available in Mistral AI API via the MistralAILLM integration.
  • ollama: for using Ollama and their available models via OllamaLLM integration.
  • openai: for using OpenAI API models via the OpenAILLM integration, or the rest of the integrations based on OpenAI and relying on its client as AnyscaleLLM, AzureOpenAILLM, and TogetherLLM.
  • vertexai: for using Google Vertex AI proprietary models via the VertexAILLM integration.
  • vllm: for using vllm serving engine via the vLLM integration.
  • sentence-transformers: for generating sentence embeddings using sentence-transformers.
  • mlx: for using MLX models via the MlxLLM integration.

Structured generation

  • outlines: for using structured generation of LLMs with outlines.
  • instructor: for using structured generation of LLMs with Instructor.

Data processing

  • ray: for scaling and distributing a pipeline with Ray.
  • faiss-cpu and faiss-gpu: for generating sentence embeddings using faiss.
  • text-clustering: for using text clustering with UMAP and Scikit-learn.
  • minhash: for using minhash for duplicate detection with datasketch and nltk.

Example

To run the following example you must install distilabel with the hf-inference-endpoints extra:

pip install "distilabel[hf-inference-endpoints]" --upgrade

Then run:

from datasets import load_dataset

from distilabel.models import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration

with Pipeline() as pipeline:
    TextGeneration(
        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
            generation_kwargs={"temperature": 0.7, "max_new_tokens": 512},
        ),
    )

if __name__ == "__main__":
    dataset = load_dataset("distilabel-internal-testing/instructions", split="test")
    distiset = pipeline.run(dataset=dataset)
    distiset.push_to_hub(repo_id="distilabel-example")

Badges

If you build something cool with distilabel consider adding one of these badges to your dataset or model card.

[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

Built with Distilabel

[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

Built with Distilabel

Contribute

To directly contribute with distilabel, check our good first issues or open a new one.

Improvements over Data Generation in RAG Tooling (data_gen)

  • Modular pipeline system that allows for any type of generation (text, image, images, any output format) and composable steps whereas data_gen is sort of for questions and answers only (limited output format, no composability into more complex pipelines, not quite ready for multiple images).
    • This also makes it way more extensible. You can't really build on top of data_gen, only modify its internals to do some simple generation. You can build on top of this with 2 files, a config and a pipeline.
  • Better parallelism by handling it with just a config and allowing pretty arbitrary gpu usage via tensor parallelism, replicas and available_gpus. data_gen only has the data parallelism wrapper I made which has no tensor parallelism support and requires sharding the chunks json manually before and after.
  • Input and output in huggingface datasets rather than using the chunking library with its custom format and taking/outputting jsons.
  • Built in and hidden caching for easy resuming
  • Inherits some cool things from distilabel such as the premade EvolInstructGenerator Task and others.
  • Slightly improved prompt sampler by making it part of the config (easier to edit and have multiple of) and adding the ability to generate list fields in an API call (say generate 4 questions instead of 1 and split these into separate rows)

Notes About My Additions

  • Run everything from the outside the distilabel directory. e.g. python distilabel/pipelines/single_page_qa.py
  • In the modified distilabel package, here are some of the files I have added (you could also check the git commit history)
    • pipelines/single_page_qa.py. Put new pipelines here. The single page one is a good reference for how to do everything, copy and modify
    • src/distilabel/configs/single_pages.py. The config for single page QA, check it out to understand how the pipeline runs and what you can modify
    • src/distilabel/pydantics.py. Put Pydantic models here (configs, output formats)
    • src/distilabel/llms/openai_compatible.py, vllm_api.py. The wrapper that handles structured generation with openai compatible endpoints for different providers and vLLM servers as well.
    • src/distilabel/utils/misc.py, prompt_sampler.py, pipe_utils.py, image.py. Check out the prompt sampler and how it works in the config. pipe_utils.py has useful/reusable code for pipelines in general.
    • src/distilabel/steps/columns/pydantic_to_cols.py, .../steps/filtering/filter_rows.py, .../steps/list_to_rows.py, .../tasks/lm_generation.py. You can see each of them imported in the single page qa pipeline. lm_generation.py is important to know of because I use it for the structured generation step using a LM. Kind of obvious, but this is where your custom steps go.
  • The only requirement for the dataset format is having a source column which is expected to be a string (straight input to the LM) or a list of image paths (which can point straight to jpg/png files or a page in a pdf with format path/to/pdf_page_x.pdf). This is atm only an expectation in VLM._format_input() when it is passed to LMGenerationTask.input_formatter, so you can change the input_formatter/override this if you need or just make VLM._format_input() more general.
  • I handle scheduling gpus by overriding the available gpus seen by CudaDevicePlacementMixin and breaking the tasks into multiple load stages so that there are enough gpus available during each.
  • It will launch a vllm server if the model name is not a recognized proprietary model.
  • Stages in the config and load stages in distilabel are different concepts. Stages in the config are broken into maybe multiple load stages/groups in distilabel so that I can handle scheduling arbitrary amounts of models into different load stages in the pipeline.
  • I added a timeout to the output batch loop (set in constants.py), if it doesn't receive an output batch within that timeframe, it will break and finalize the pipeline execution. Hanging issues are easy to run into so this just closes it in a controlled manner.
  • The Ray pipeline is out of date and not supported, though it would not be terribly difficult to update by comparing with the local pipeline.
  • Convergence steps (those that come after steps that are routed to (route steps)) will act like global steps (waiting for all batches to be received before continuing). This is due to issues with properly maintaining order of batches.
  • I don't believe order of branches in def process(self, *inputs: StepInput) -> "StepOutput": is guaranteed.
  • Don't connect a global step to a routing batch function, a global step must take and output all current rows, so that will all get routed together. Connect it to a NoOp to break it up.
  • I have implemented a batch level caching system.
    • The cache key is dependent on only the data in the batch (order included) and the name of the class of step it is being sent to.
    • This works pretty well but is sensitive to the elements in the batch, which is influenced by batch size
      • Different batch sizes are different batches, none of them will be loaded from cache. This is how it should be in a sense, because a step receives a batch and may respond differently depending on the content of the whole batch
      • Or even the offset of data given to that batch: say you have 100 rows at some point in your pipeline, if a batches were cached for rows 33-38, 38-43, ..., then you repeat and somehow end up with a batch with 32-37, 37-42, none of the following batches will be retrieved from cache.
      • If you change a step's code or a routing batch function's code, caching can't detect this, but you can pass invalidate_cache=True to them to tell them to redo things. You can also pass invalidate_distiset=True to the pipeline to not used the distiset which is cached on completion of a pipeline.
      • Caching at the step level is turned off by default except for LMGenerationTask (to save disk space for cheap, deterministic steps). Just set use_cache=True when creating anything inheriting from _Step to turn it on.
      • As a backup to the sensitivity, there is LLM level caching also, so that LLM responses are not recomputed. The downside is this won't stop the step from being loaded (i.e. the vllm server startup) when all the LLM responses are cached. This is also sensitive to any randomness such as prompt sampling in your pipeline, so be aware of that.
    • An explanation of caching levels and setting use_cache=True/False and invalidate_cache=True/False.
      • There are 3 levels of caching:
        • Batch level: controlled by use_cache and invalidate_cache when initializing the Step/Task
        • LM Request Level: controlled by use_cache and invalidate_cache when initializing the OpenAILM.
        • Distiset level: controlled by use_cache and invalidate_distiset when calling pipeline.run().
  • You can enable a function level timer with DISTILABEL_ENABLE_TIMER=1. You can decorate functions as shown in distilabel/pipeline/base.py.
  • At the end of the pipeline, write_buffer.py writes all the final batches to disk in parquets (writing every constants.WRITE_BUFFER_SIZE rows of data). Then it has to reload each of them to make sure they have a matching schema for the whole set. Then, they are loaded into a distiset and returned. This can take e.g. 10 min for 60K batches with 16M rows. Just note the time it takes.
  • You can set the base url for running vllm (used when Config(use_running_vllm=True) in the config) with the environment variable VLLM_API_BASE_URL.
  • You can use uv to make the environment. Simply use uv sync then uv sync --extra flash from the root dir of the repository. This should run the pipelines I have built though it may be missing dependencies for things like ray (which aren't supported atm anyways).

Notes on Distilabel (Issues and Helpful Knowledge)

  • Short Version: distilabel is very particular about how things are done, so there's a reason why every line is the way it is and I recommend starting off of one of the existing pipelines. Also, reading my code for e.g. the single page pipeline will tell you how to build on top of distilabel. Use the rest of this list as an issue tracker so people know how to solve issues in the future.

  • It took me a while to figure out how to handle different providers, it turns out their OpenAI compatible endpoints accept varying basic parameters and it works best to ignore most of the parameters and send basic messages.
  • You can't output a pydantic object from a step since it isn't serializable with pyarrow.
  • In as many places as possible, I think you want to use the load() method instead of __init__, since __init__ is handled by pydantic and you'll be able to see the inherited args if you don't override it. It also matches the logic of the library better (matching load groups better for instance).
  • I ran into some errors with the decorators that I tried to make for multiple generations and structured output because distilabel inspects the signature of functions and somehow came up with **kwargs was a required runtime parameter that needed to be set at pipeline init. The solution I am using is to copy the function signature from the library, though this isn't ideal for maintenance.
  • I ran into some errors with it not being able to make a pyarrow table after finishing the LMGenerationTask which were due to the parameter add_raw_input=True. Since I overrode the OpenAILLM class to add support for more flexible vision (arbitrary number/order of images in chat format)(and to allow grouping all model providers under a single class), the formatted input was a list of messages, some text, some visual, all in one column (so you can vary the number of images). Pyarrow can't make a table out of this because the structure of a text and an image message are different so it can't make a type for the column. Thus, I have set self.add_raw_input=False in e.g. the LMGenerationTask.
    • This is no longer a current issue since I moved the prompt sampler into format input, which is called before the lm and discarded after (no serialization).
  • StepResources seems like it might handle scheduling tasks across gpus for you, but I understand this only happens when using Ray, which has some internal scheduling that will respect those resources (there's a section in the documentation about how to use Ray for larger scale distributed generation).
    • What it does actually do is respect replicas, which is basically just data parallelism for non-generator/global steps/tasks (replicates models as well).
    • It will put LLMs on different gpus (provided you use the mixin properly) until it runs out of gpus (cuda_device_placement.py), but it won't reschedule them
  • To handle scheduling tasks (say your pipeline will use 10 different vllm servers but you have 8 gpus), you use load stages. See the docs
  • Task.unload() calls self.llm.unload() so you don't have to handle it yourself. If you wanted to keep it alive (say the vllm server), you'd need to get around this
  • Distilabel can handle a list of tasks in the >> syntax, for each task in the previous stage, it sends the task's completed batches to all of the next stage (or in the case of using a router, it will select some set of the next stage per batch)
  • Don't include a routing function step in the load groups, it isn't quite a step and will throw an error, but runs even when left out of the load groups
  • I would like each LM to be able to have its own system prompt, which means they each need their own prompt sampler. I see two ways to do this, either make a step for each LM that has the prompt sampler and connect them properly, or put the prompt sampler with the LM. Making a bunch of steps and connecting them seems annoying and not as clean for writing new pipelines. Putting it with the LM means you don't see the system prompt since it isn't a step input or output, so I have sort of hacked distilabel by inplace updating input, which gets forwarded to LMGenerationTask.format_output().
  • Serialization
    • Initially, I ran into an error trying to hash my Config object (for the caching system) so I overrode the serialization to return an empty dict
    • When I was trying to test the caching, I ran into another error where it couldn't resume from the yaml because the LMGenerationTask has an input_formatter callable attribute. It loads the yaml with yaml.FullLoader which won't allow arbitrary python execution (setting the input_formatter). I found Field(exclude=True) in Pydantic to solve this. Then it occurred to me that I should do the same for the configs I was using rather than erasing their signatures. After this, there was another error in resuming because it couldn't initialize the e.g. LMGenerationTask without providing the configs. So, I gave these default initializations. This uncovered another error which was a bug in distilabel, I had no choice but to modify the actual package to fix it. In DAG.from_dict(), they don't set the routing_batch_function._step which is set during Step.connect(), so I just added the line to do that.
      • The way its resuming works is when you call pipeline.run(), one of the early steps is self._refresh_pipeline_from_cache() which essentially creates an entirely new dag from the cached information. Then, for excluded or secret fields, it sets them using the values of the current dag. Now that I know this, their design seems reasonable, but it is important that you understand the effect of Field(exclude=True) to get resuming working properly. The need for serialization and deserialization also justifies the extensive use of Pydantic in distilabel.
  • Had to set vllm_api field to private so that it didn't try to serialize it in multiprocessing.
  • Might be errors with changing load_groups for a pipeline that you are trying to resume
  • You can't connect a list of steps to a routing batch function, so I have a NoOp step that can serve as a 'junction' before the routing batch function.
  • I made step resources an excluded parameter (from the signature and caching) so that you can change these and the pipeline will resume as normal
  • [IMPORTANT] I ran into a tough error with distilabel hanging when trying to resume. The root cause (or one of them) was probably that I had stopped execution in the vscode debugger, which hard stops the program and distilabel didn't save the batch back to the pipeline's batch manager, making it so that my initial generator step didn't have its batch data and wasn't sending it up the pipeline. I am still not sure entirely how batches are routed, since this is a large and complex system, but anyways, be wary of the hanging issue. Keep in mind the functions _manage_batch_flow, _BatchManagerStep._get_data(), get_batch() and add_batch() and _initialize_pipeline_execution() which are related to batches in distilabel. I am not sure how exactly to solve this if it happens on something expensive to re-run. Maybe try manually editing the cache if you can find the right information.
    • You can run into an error with it not being able to allocate gpus, making os.environ['CUDA_VISIBLE_DEVICES'] fail, when it has been stopped in a weird way. Run it and let it crash, clear the cache and restart it. If you want to avoid clearing the cache, perhaps try finding the device_map file from cuda_device_placement.py and clearing that.
  • Ran into another hanging issue during normal run: the system previously sent LAST_BATCH_SENT_FLAG only to predecessors of a step that had sent a last batch to output. i.e. the step should be done, but only the predecessors get told to quit. When a load stage ends with a step that has multiple replicas, it will wait for the next stage, but not send a signal to tell the other replicas to quit working. I added sending a signal to the step that sent the last batch to output so that all the replicas are closed if needed. (this seems like it should be the way to do things in the first place, but I will settle for adding it on top of existing logic in case there was another reason for it)
  • While debugging, you can set DISTILABEL_LOG_LEVEL=DEBUG to see a lot of helpful info.
  • [IMPORTANT] Another tough deadlock/hanging error: if one of your steps gives a batch size of 0 as output (e.g. you give the wrong system prompt for a given pydantic output and all of them fail to be structured so your filter step drops all rows), then it can hang. The responsible line, if the step is a normal step (as opposed to a convergence step (a step after a list of steps that are routed to with a routing batch function) or an accumulate step), is (probably) if num_rows == 0 and step_name in self.last_batch_received: in _BatchManagerStep (around L:555). Edit: I have made an additional case for this to avoid hanging, but I am unsure if there are unintended consequences.
  • If the vLLM server fails to start, the error will be something like: 'cannot pickle thread'. Check the vllm_logs/ for the appropriate rank.
  • You can run into a deadlock if steps you're trying to route to are split across stages and no batch is sent to a certain stage. e.g. step_0 routes to [step_1, step_2] which route to step_3, load stages are like so [[step_0, step_1], [step_2], [step_3]]. If no batches are sent to step_2, the stage won't load and run properly. Edit: I have patched this.
  • Route steps need to be 1-1 mappings (no dropping or adding rows to batches). Edit: I patched this, it should work.
  • You should still drop None after any LMGenerationTask step because there are other ways than sturctured generation to end up with a None in the response.
  • Say you generate multiple responses to some list of images, then split and later rejoin the images. If a response was a list with duplicates, then when split, the rows will be exact copies of each other and will trigger the warning in join parallel branches. Solution: use a smarter model.
  • The pipeline can crash and throw an error about not being able to pickle _thread.lock. In my case, this was due to not having pynvml installed and it crashed when trying to load the LMGenerationTask.

Citation

@misc{distilabel-argilla-2024,
  author = {Álvaro Bartolomé Del Canto and Gabriel Martín Blázquez and Agustín Piqueres Lajarín and Daniel Vila Suero},
  title = {Distilabel: An AI Feedback (AIF) framework for building datasets with and for LLMs},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/argilla-io/distilabel}}
}

About

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 38

Languages