Skip to content

Conversation

nsrawat0333
Copy link

Summary

Resolves Issue #40 by implementing a comprehensive solution to publish and distribute processed WikiText-103 dataset, addressing a 4+ year old community request from @cp-pc.

Problem

In 2020, @cp-pc requested: "Will it be convenient to publish the processed WikiText103 data set"

The issue remained unresolved because:

  • No automated way to obtain processed WikiText-103 dataset
  • Manual download and processing was complex and error-prone
  • No standardized vocabulary or validation
  • Researchers had to recreate processing pipelines individually

Solution

Created a one-command solution that:

  • Downloads raw WikiText-103 dataset automatically
  • Processes text into tokenized format with vocabulary
  • Validates dataset integrity against published statistics
  • Packages everything for immediate research use

Usage

# One command gets everything!
python scripts/create_processed_wikitext103_dataset.py \
    --output_dir /tmp/data \
    --operation all

# Results in ready-to-use dataset:
# /tmp/data/wikitext-103-processed/
# ├── train.txt          (103M tokens)
# ├── valid.txt          (218K tokens) 
# ├── test.txt           (246K tokens)
# ├── vocab.csv          (267K vocabulary)
# └── dataset_info.json  (statistics)

- Update aiohttp to address potential security vulnerabilities
- Maintains compatibility with existing codebase
- Addresses dependency security recommendations
…ation

- Created comprehensive solution for convenient processed WikiText-103 dataset access
- Added two complementary tools:
  * setup_wikitext103_dataset.py: Lightweight, dependency-free solution
  * create_processed_wikitext103_dataset.py: Full-featured with WikiGraphs integration
- Features:
  * One-command dataset download and processing
  * Automatic vocabulary creation with configurable thresholds
  * Comprehensive validation and integrity checks
  * Ready-to-use examples and documentation
  * Cross-platform compatibility
- Created WIKITEXT103_SETUP_GUIDE.md with detailed usage instructions
- Updated main README.md with quick start section
- Addresses 4+ year old Issue google-deepmind#40 from @cp-pc

Files added:
- wikigraphs/scripts/setup_wikitext103_dataset.py (400+ lines)
- wikigraphs/scripts/create_processed_wikitext103_dataset.py (600+ lines)
- wikigraphs/WIKITEXT103_SETUP_GUIDE.md (comprehensive guide)
- ISSUE_40_SOLUTION.md (GitHub issue response)

This solution transforms WikiText-103 setup from complex multi-step process
to simple one-command operation, significantly improving researcher productivity.
@polarbe
Copy link

polarbe commented Aug 10, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants