Implement WikiText-103 processed dataset publication utility (#40) #648

nsrawat0333 · 2025-08-10T21:17:19Z

Summary

Resolves Issue #40 by implementing a comprehensive solution to publish and distribute processed WikiText-103 dataset, addressing a 4+ year old community request from @cp-pc.

Problem

In 2020, @cp-pc requested: "Will it be convenient to publish the processed WikiText103 data set"

The issue remained unresolved because:

No automated way to obtain processed WikiText-103 dataset
Manual download and processing was complex and error-prone
No standardized vocabulary or validation
Researchers had to recreate processing pipelines individually

Solution

Created a one-command solution that:

✅ Downloads raw WikiText-103 dataset automatically
✅ Processes text into tokenized format with vocabulary
✅ Validates dataset integrity against published statistics
✅ Packages everything for immediate research use

Usage

# One command gets everything!
python scripts/create_processed_wikitext103_dataset.py \
    --output_dir /tmp/data \
    --operation all

# Results in ready-to-use dataset:
# /tmp/data/wikitext-103-processed/
# ├── train.txt          (103M tokens)
# ├── valid.txt          (218K tokens) 
# ├── test.txt           (246K tokens)
# ├── vocab.csv          (267K vocabulary)
# └── dataset_info.json  (statistics)

- Update aiohttp to address potential security vulnerabilities - Maintains compatibility with existing codebase - Addresses dependency security recommendations

@cp-pc

…ation - Created comprehensive solution for convenient processed WikiText-103 dataset access - Added two complementary tools: * setup_wikitext103_dataset.py: Lightweight, dependency-free solution * create_processed_wikitext103_dataset.py: Full-featured with WikiGraphs integration - Features: * One-command dataset download and processing * Automatic vocabulary creation with configurable thresholds * Comprehensive validation and integrity checks * Ready-to-use examples and documentation * Cross-platform compatibility - Created WIKITEXT103_SETUP_GUIDE.md with detailed usage instructions - Updated main README.md with quick start section - Addresses 4+ year old Issue google-deepmind#40 from @cp-pc Files added: - wikigraphs/scripts/setup_wikitext103_dataset.py (400+ lines) - wikigraphs/scripts/create_processed_wikitext103_dataset.py (600+ lines) - wikigraphs/WIKITEXT103_SETUP_GUIDE.md (comprehensive guide) - ISSUE_40_SOLUTION.md (GitHub issue response) This solution transforms WikiText-103 setup from complex multi-step process to simple one-command operation, significantly improving researcher productivity.

polarbe · 2025-08-10T21:17:53Z

Dooray! Failure Notice Failure Notice Your message sent to ***@***.***) has failed to be delivered. Please refer to the below for details. * Recipient : ***@***.***) * Sent time : 2025-08-11T06:17:45 * Subject : [google-deepmind/deepmind-research] Implement WikiText-103 processed dataset publication utility (#40) (PR #648) * Remote host said : Your mail was denied from the receiver. This message was sent from a notification-only address that cannot accept incoming email. For more information, please contact ***@***.*** © Dooray!.

nsrawat0333 added 2 commits August 10, 2025 23:01

Bump aiohttp from 3.6.2 to 3.12.14 in gated_linear_networks

8b14f1d

- Update aiohttp to address potential security vulnerabilities - Maintains compatibility with existing codebase - Addresses dependency security recommendations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement WikiText-103 processed dataset publication utility (#40) #648

Implement WikiText-103 processed dataset publication utility (#40) #648

Uh oh!

nsrawat0333 commented Aug 10, 2025

Uh oh!

polarbe commented Aug 10, 2025 via email

Uh oh!

Uh oh!

Implement WikiText-103 processed dataset publication utility (#40) #648

Are you sure you want to change the base?

Implement WikiText-103 processed dataset publication utility (#40) #648

Uh oh!

Conversation

nsrawat0333 commented Aug 10, 2025

Summary

Problem

Solution

Usage

Uh oh!

polarbe commented Aug 10, 2025 via email

Uh oh!

Uh oh!