Skip to content

Conversation

mikex86
Copy link
Member

@mikex86 mikex86 commented Apr 5, 2025

Draft, not ready to merge yet.

Comment on lines -2 to -5

set -e

# Colors for output
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets keep this file

param_group['lr'] = lr


OptimT = TypeVar("OptimT", bound=torch.optim.Optimizer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
OptimT = TypeVar("OptimT", bound=torch.optim.Optimizer)
OptimT : TypAlias = TypeVar("OptimT", bound=torch.optim.Optimizer)

:param optimizer_type the type of optimizer used.
"""

def _validate_exists(to_check: List[Tuple[str, Optional[torch.Tensor]]]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _validate_exists(to_check: List[Tuple[str, Optional[torch.Tensor]]]):
def _validate_exists(to_check: list[tuple[str, torch.Tensor | None]]):

Comment on lines +23 to +30
hf_name="mistralai/Mistral-7B-v0.1",
# print(len(AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", use_fast=True)))
vocab_size=32000,
# print(AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", use_fast=True).bos_token_id)
bot_token=1,
# print(AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", use_fast=True).eos_token_id)
eot_token=2,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove print

Comment on lines +32 to +40
return TokenizerInfo(
hf_name="meta-llama/Meta-Llama-3-8B",
# print(len(AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B", use_fast=True)))
vocab_size=128256,
# print(AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", use_fast=True).bos_token_id)
bot_token=128000,
# print(AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B", use_fast=True).eos_token_id)
eot_token=128001,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove print

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not want to tell people how to re-obtain this number easily?

Comment on lines -1 to -16
import copy
import torch
from zeroband.data import InterleaveDataset, ParquetDataset, SequencePackingDataSet, collate_fn
from torch.utils.data import DataLoader
from zeroband.data import load_all_datasets, DataConfig
from zeroband.utils.logger import get_logger
from collections import Counter
from itertools import chain
import pytest
import logging
import pyarrow as pa
import pyarrow.parquet as pq
from faker import Faker
from typing import List
import string
from torchdata.stateful_dataloader import StatefulDataLoader
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why removing the sequence packing tests ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to re-add them, but they were incompatible post port.

Comment on lines +231 to +233

if __name__ == '__main__':
pytest.main()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants