Skip to content

liveview-native/specs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WHATWG Specification Manager

Single script to download and manage all 22 WHATWG specifications, optimized for LLM token consumption.

Quick Start

# Install dependencies
npm install

# Set your Anthropic API key
export ANTHROPIC_API_KEY="your-api-key-here"

# Download and optimize all 22 specs
node specs.js download

# Or use npm scripts
npm run download

# Check what's downloaded
npm run status

# Remove all specs
npm run clean

Commands

Command Description
node specs.js download Download and optimize all 22 WHATWG specs
node specs.js status Show which specs are downloaded with sizes and token counts
node specs.js list List all 22 available specs with URLs
node specs.js clean Remove all downloaded specs from working directory
node specs.js help Show help message

Or use npm scripts: npm run download, npm run status, npm run list, npm run clean

What It Does

When you run node specs.js download:

  1. ✅ Creates temporary working directory in /tmp
  2. ✅ Downloads each spec HTML to /tmp (not your working directory)
  3. ✅ Converts HTML to markdown in /tmp using pandoc
  4. ✅ Uses opencode AI to intelligently optimize while preserving ALL technical content:
    • Removes table of contents
    • Removes references sections
    • Removes acknowledgments / acknowledgements
    • Removes intellectual property sections
    • Removes licensing / copyright sections
    • Removes index sections
    • Removes all {#anchor-id} and {.class} metadata
    • Removes section numbering [1.2.3]
    • Removes base64 images and SVG diagrams
    • Removes decorative elements
    • Preserves 100% of technical specifications, algorithms, examples, and definitions
  5. ✅ Saves optimized <spec>.md to working directory
  6. ✅ Automatically cleans up entire /tmp directory

Result: Only clean, optimized .md files with complete technical content in your working directory!

All 22 Specifications (Alphabetical Order)

# Spec URL Description
1 compat https://compat.spec.whatwg.org/ Compatibility Standard
2 compression https://compression.spec.whatwg.org/ Compression Standard
3 console https://console.spec.whatwg.org/ Console Standard
4 cookiestore https://cookiestore.spec.whatwg.org/ Cookie Store API
5 dom https://dom.spec.whatwg.org/ DOM Standard
6 encoding https://encoding.spec.whatwg.org/ Encoding Standard
7 fetch https://fetch.spec.whatwg.org/ Fetch Standard
8 fs https://fs.spec.whatwg.org/ File System Standard
9 fullscreen https://fullscreen.spec.whatwg.org/ Fullscreen API
10 html https://html.spec.whatwg.org/ HTML Living Standard
11 infra https://infra.spec.whatwg.org/ Infra Standard
12 mimesniff https://mimesniff.spec.whatwg.org/ MIME Sniffing Standard
13 notifications https://notifications.spec.whatwg.org/ Notifications API
14 quirks https://quirks.spec.whatwg.org/ Quirks Mode Standard
15 storage https://storage.spec.whatwg.org/ Storage Standard
16 streams https://streams.spec.whatwg.org/ Streams Standard
17 testutils https://testutils.spec.whatwg.org/ Test Utils Standard
18 url https://url.spec.whatwg.org/ URL Standard
19 urlpattern https://urlpattern.spec.whatwg.org/ URL Pattern Standard
20 webidl https://webidl.spec.whatwg.org/ Web IDL Standard
21 websockets https://websockets.spec.whatwg.org/ WebSockets Standard
22 xhr https://xhr.spec.whatwg.org/ XMLHttpRequest Standard

Token Optimization Results

Individual Specifications (Estimated)

Spec Original HTML Optimized MD Reduction Tokens (est)
html ~14.0 MB ~5.4 MB ~61% ~1,800,000
webidl ~2.5 MB ~400 KB ~84% ~133,000
streams ~1.8 MB ~375 KB ~79% ~125,000
dom ~2.9 MB ~340 KB ~88% ~113,000
fetch ~1.9 MB ~250 KB ~87% ~83,000
encoding ~450 KB ~105 KB ~77% ~35,000
url ~710 KB ~105 KB ~85% ~35,000
urlpattern ~350 KB ~80 KB ~77% ~27,000
infra ~280 KB ~72 KB ~74% ~24,000
xhr ~340 KB ~70 KB ~79% ~23,000
mimesniff ~260 KB ~53 KB ~80% ~18,000
cookiestore ~240 KB ~50 KB ~79% ~17,000
websockets ~180 KB ~33 KB ~82% ~11,000
storage ~110 KB ~27 KB ~75% ~9,000
quirks ~90 KB ~22 KB ~76% ~7,000
fullscreen ~95 KB ~22 KB ~77% ~7,000
notifications ~90 KB ~21 KB ~77% ~7,000
console ~90 KB ~12 KB ~87% ~4,000
compat ~50 KB ~10 KB ~80% ~3,000
compression ~40 KB ~8 KB ~80% ~3,000
fs ~150 KB ~5 KB ~97% ~2,000
testutils ~10 KB ~1 KB ~90% ~300

Combined Totals

Metric Before After Saved
Total Size ~27.4 MB ~8.5 MB ~18.9 MB (69%)
Total Tokens ~9.1M ~2.8M ~6.3M tokens (69%)

Average reduction: ~70% across all specifications!

Output Files

After running ./specs.sh download, you'll have these 22 files:

compat.md          ~10 KB    ~3K tokens
compression.md     ~8 KB     ~3K tokens
console.md         ~12 KB    ~4K tokens
cookiestore.md     ~50 KB    ~17K tokens
dom.md             ~340 KB   ~113K tokens
encoding.md        ~105 KB   ~35K tokens
fetch.md           ~250 KB   ~83K tokens
fs.md              ~5 KB     ~2K tokens
fullscreen.md      ~22 KB    ~7K tokens
html.md            ~5.4 MB   ~1.8M tokens
infra.md           ~72 KB    ~24K tokens
mimesniff.md       ~53 KB    ~18K tokens
notifications.md   ~21 KB    ~7K tokens
quirks.md          ~22 KB    ~7K tokens
storage.md         ~27 KB    ~9K tokens
streams.md         ~375 KB   ~125K tokens
testutils.md       ~1 KB     ~300 tokens
url.md             ~105 KB   ~35K tokens
urlpattern.md      ~80 KB    ~27K tokens
webidl.md          ~400 KB   ~133K tokens
websockets.md      ~33 KB    ~11K tokens
xhr.md             ~70 KB    ~23K tokens

Total: ~8.5 MB, ~2.8M tokens (down from ~27.4 MB, ~9.1M tokens)

What's Preserved

All specification content:

  • All technical prose and definitions
  • All algorithms and processing models
  • All normative requirements
  • All code examples and IDL interfaces
  • Complete section hierarchy
  • External reference links

100% specification quality - zero loss of technical content

What's Removed

Non-specification sections:

  • Table of contents
  • References sections
  • Acknowledgments / Acknowledgements
  • Intellectual property sections
  • Licensing / Copyright sections
  • Index sections

Metadata and formatting:

  • All {#anchor-id} patterns
  • All {.css-class} attributes
  • All {x-internal="..."} metadata
  • Section numbering [1.2.3]
  • Base64-encoded images
  • SVG diagrams
  • Decorative separator lines
  • Excessive whitespace

Use Cases

For LLM Processing

  • 70% token reduction - fit more specs in context
  • Lower API costs - pay for fewer tokens
  • Faster processing - less data to parse
  • Better context utilization - pure technical content

For Development

  • Clean references - no metadata clutter
  • Easy searching - pure specification prose
  • Version control - efficient diffs
  • Fast loading - smaller file sizes

For Documentation

  • Readable markdown - clean formatting
  • Complete content - all technical details
  • Portable - standard markdown format
  • Focused - specification content only

Requirements

  • curl - for downloading specs
  • pandoc - for HTML to markdown conversion
  • node - for running the script
  • Anthropic API key - for Claude Sonnet 4.5 (1M context window)

Install Dependencies

macOS:

brew install pandoc node
npm install

Ubuntu/Debian:

sudo apt install curl pandoc nodejs npm
npm install

Example Workflow

# List all 22 available specs
node specs.js list

# Download and optimize all specs (takes 5-15 minutes)
node specs.js download

# Your working directory stays clean during the process!
# All temporary work happens in /tmp

# Check what was downloaded with sizes and token counts
node specs.js status

# Use the optimized specs
cat dom.md | head -n 50

# Clean up when done
node specs.js clean

Technical Details

Temporary File Handling

  • Creates unique temp directory: /tmp/tmp.XXXXXX
  • All HTML downloads go to /tmp (never your working directory)
  • All intermediate markdown files stay in /tmp
  • Automatic cleanup via trap on script exit
  • Your working directory only receives final optimized .md files

Optimization Pipeline

  1. Download HTML to /tmp
  2. Convert with pandoc in /tmp
  3. Optimize with opencode SDK:
    • Uses Claude AI to intelligently analyze the specification
    • Removes boilerplate (TOC, references, acknowledgments, metadata)
    • Preserves ALL technical content (specs, algorithms, examples, definitions)
    • Smart detection of what's essential vs. removable
    • Context-aware optimization (not just pattern matching)
  4. Save optimized .md to working directory
  5. Cleanup entire /tmp directory automatically

Context Window Capacity

With 200K token context:

  • HTML spec (~1.8M tokens) alone
  • OR: DOM + Fetch + Streams + WebIDL + URL + 10 more smaller specs

With 1M token context:

  • HTML + all 21 other specs (~2.8M total)

With 2M+ token context:

  • All 22 specs multiple times!

Notes

  • Processing all 22 specs takes 5-15 minutes (depending on connection)
  • Requires active internet connection
  • All temporary files automatically deleted from /tmp
  • Safe to re-run download anytime to update specs
  • clean command removes specs from working directory only
  • No temporary files ever appear in your working directory
  • Version: 2.0.0 (pure Node.js using opencode SDK for intelligent optimization)

Performance Tips

Download Individual Specs

The script downloads all 22 specs, but you can modify the SPECS array in specs.js to download only specific ones:

// Edit specs.js and modify the SPECS array
const SPECS = ["dom", "fetch", "url"]; // Only these 3

Check Before/After

Use status command to see exactly what you have:

node specs.js status

Shows each spec with file size and estimated token count.

Troubleshooting

"pandoc: command not found"

brew install pandoc  # macOS
sudo apt install pandoc  # Linux

"curl: command not found"

# Curl is pre-installed on most systems
# If missing, install via package manager
sudo apt install curl  # Linux

Failed download

  • Check internet connection
  • Verify spec.whatwg.org is accessible
  • Check firewall settings
  • Try again (network issues are transient)

Not enough disk space in /tmp

  • The script needs ~300MB free in /tmp
  • Clean up /tmp manually if needed
  • Downloads happen one at a time to minimize space usage

License

This tool processes publicly available WHATWG specifications.

Processed specifications retain their original WHATWG licenses:

  • Most specs: Creative Commons Attribution 4.0 International License
  • Code portions: BSD 3-Clause License

This tool itself:

  • Use freely for any purpose
  • No warranty provided
  • Provided as-is

Ready to optimize? Run npm run download to get started!

Download all 22 WHATWG specifications, optimized and ready for LLM processing with 70% fewer tokens.

Complete list from https://spec.whatwg.org/ in alphabetical order.

About

WHATWG specs converted to markdown and optimized for tokenization

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published