Fix various small problems #367

janEbert · 2023-02-28T16:38:53Z

Problems:

BERT/T5 dataset handles already corrupted indices incorrectly.
GPT tokenizer vocab size does not include special tokens.
In Eval Harness evaluation, the last token of the target is unconditionally truncated. This results in wrong evaluation metrics and is actually not such a small problem.

This PR fixes these issues. Since the changes are so small, I didn't bother creating 3 PRs.

Did not include additional special tokens.

This corrupts the targets. There is no good reason for this.

janEbert added 3 commits February 28, 2023 17:36

Fix covered index skipping

ce3f6c0

Fix GPT tokenizer vocab size query

f7c583f

Did not include additional special tokens.

Do not remove last token

cfd6374

This corrupts the targets. There is no good reason for this.

Provide feedback