Skip to content

Conversation

nitsanluke
Copy link
Contributor

@nitsanluke nitsanluke commented May 8, 2025

✨ Description

Migrated from #248; this PR allows a dataset with prompt and completion specifically and in general any pair of text columns (eg: question and answer) to be combined and tokenized. (It is limited to two columns of input covering majority of use-cases). A delimiter can be added this would be added at the start of the completion text. By default it will combine the two inputs "".

Note: Merge after #255

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

  1. Create a new data source schema for prompt & completion style datasets
  2. Include concat func. and loss masking span creation

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

  • 📜 I have read and followed the contributing guidelines.
  • 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
  • 🎉 The functionality is complete, and I have tested the changes.
  • 📝 I have updated the documentation if needed.
  • ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
  • 🧩 I have commented my code, especially in hard-to-understand areas.

Sample config:

loading_workers: 1
tokenize_workers: 1
saving_workers: 1
output_path: ./debug_test/gsm8k
dataset:
  path: openai/gsm8k
  config_name: main
  split: train
  trust_remote_code: true
  source_schema:
    type: prompt_completion
    prompt_column: question
    completion_column: answer
    delimiter: " "

tokenizer:
  path: /mnt/slam_checkpoints/upstream/Mistral-Nemo-Base-2407/

(0, len(str(example[source_schema.prompt_column])) - 1)
]# spans are inclusive
},
batched=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above about num_proc

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sohamparikh @nitsanluke, I was thinking about the tradeoff we're making here between these two options:

  1. (right now) merge the two columns into one, add spans, then split it up again to be tokenized individually. the benefit is that we can reuse the processing for the default schema, so it's a form of input normalization. the downside is that this seems to be not very straightforward or explicit.
  2. (could be) tokenize the two columns individually, then merge.

I feel like with more input schemas being added, some refactor of what we introduce here is unavoidable. Can we be more forward-looking?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with tokenizing columns individually, can do the same when we support chat templates
#262

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nitsanluke, did you see this:

I agree with tokenizing columns individually, can do the same when we support chat templates #262

we prefer this to be done differently

Copy link
Contributor Author

@nitsanluke nitsanluke Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tscholak Changed to tokenize seperatly and then combine.

Base automatically changed from restructure_dataset_config_for_multi_source to main June 16, 2025 14:47
@tscholak tscholak marked this pull request as ready for review July 11, 2025 16:00
@tscholak tscholak requested review from tscholak and sohamparikh July 11, 2025 16:00
@nitsanluke nitsanluke requested a review from jlamypoirier August 6, 2025 18:15
@oleksost
Copy link
Contributor

oleksost commented Aug 6, 2025

@nitsanluke @tscholak @jlamypoirier merging this one would be useful I think.

Little correction to the example config:

source_schema:
    prompt_column: question
    completion_column: answer
    delimiter: " "

misses type: prompt_completion

@nitsanluke nitsanluke merged commit 62c0040 into main Aug 8, 2025
4 checks passed
@nitsanluke nitsanluke deleted the concat_inputs branch August 8, 2025 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants