World's most comprehensive Rust compilation pipeline analysis toolkit - Extract semantic analysis, project structure, and LLVM IR generation data from Rust codebases for machine learning and compiler research.
This toolkit creates unprecedented datasets by analyzing Rust compilation at every level:
Rust Source β rustc β LLVM IR β Optimizations β Machine Code
β β β β β
Semantic Project IR Gen Optimization Assembly
Analysis Analysis Passes
β β β β β
HF Dataset HF Dataset HF Dataset HF Dataset HF Dataset
- β 1.4+ Million Records: Largest Rust analysis dataset ever created
- β Self-Referential Analysis: Tools analyzing their own codebases
- β Complete Pipeline: Source code β LLVM IR generation
- β Production Ready: Used to analyze rust-analyzer, rustc, and llvm-sys.rs
- β HuggingFace Compatible: Ready for ML training and research
- Rust 1.70+ with Cargo
- Git LFS (for large dataset files)
git clone https://github.com/solfunmeme/hf-dataset-validator-rust.git
cd hf-dataset-validator-rust
cargo build --release
# Test on a simple Rust file
echo 'fn main() { println!("Hello, world!"); }' > test.rs
cargo run --bin hf-validator -- analyze-rust-to-ir test.rs output-dataset
Analyze a Rust project with all three extractors:
# Complete analysis: semantic + project + LLVM IR
cargo run --bin hf-validator -- analyze-rust-to-ir /path/to/rust/project output-dataset
# This creates:
# output-dataset/semantic/ - Rust semantic analysis
# output-dataset/cargo/ - Project structure analysis
# output-dataset/llvm-ir/ - LLVM IR generation analysis
# Extract parsing, name resolution, and type inference data
cargo run --bin hf-validator -- generate-hf-dataset /path/to/rust/project semantic-output
# Extract project structure and dependency information
cargo run --bin hf-validator -- analyze-cargo-project /path/to/rust/project cargo-output
# Extract LLVM IR generation across optimization levels
cargo run --bin hf-validator -- analyze-llvm-ir /path/to/rust/project llvm-output
Command | Description | Output |
---|---|---|
analyze-rust-to-ir <source> [output] |
Complete pipeline analysis | Semantic + Project + LLVM IR |
generate-hf-dataset <source> [output] |
Rust semantic analysis | Parsing, name resolution, type inference |
analyze-cargo-project <source> [output] |
Project structure analysis | Cargo metadata and dependencies |
analyze-llvm-ir <source> [output] [opt_levels] |
LLVM IR generation analysis | IR across O0, O1, O2, O3 |
Command | Description |
---|---|
validate-hf-dataset [dataset_dir] |
Validate semantic analysis dataset |
validate-cargo-dataset [dataset_dir] |
Validate cargo analysis dataset |
validate-llvm-dataset [dataset_dir] |
Validate LLVM IR analysis dataset |
Command | Description |
---|---|
validate-solfunmeme <base_path> |
Validate solfunmeme dataset structure |
convert-to-parquet <input> <output> |
Convert datasets to Parquet format |
Extracts deep semantic information using rust-analyzer:
- Parsing Phase: Syntax trees, tokenization, parse errors
- Name Resolution Phase: Symbol binding, scope analysis, imports
- Type Inference Phase: Type checking, inference decisions, errors
Schema: 20+ fields including source snippets, AST data, symbol information
Analyzes project structure and metadata:
- Project Metadata: Cargo.toml analysis, workspace support
- Dependencies: Dependency graphs and version constraints
- Build Configuration: Features, targets, build scripts
Schema: 44+ fields including project info, dependency data, build metadata
Captures Rust β LLVM IR compilation:
- IR Generation: How Rust constructs become LLVM IR
- Optimization Passes: LLVM optimization analysis (planned)
- Code Generation: Target-specific code generation (planned)
- Performance Analysis: Optimization impact measurement (planned)
Schema: 50+ fields including source code, LLVM IR, optimization data
git clone https://github.com/rust-lang/rust-analyzer.git
cargo run --bin hf-validator -- analyze-rust-to-ir rust-analyzer rust-analyzer-dataset
git clone https://github.com/rust-lang/rust.git
cargo run --bin hf-validator -- generate-hf-dataset rust/compiler rustc-dataset
git clone https://gitlab.com/taricorp/llvm-sys.rs.git
cargo run --bin hf-validator -- analyze-rust-to-ir llvm-sys.rs llvm-sys-dataset
- Code Understanding Models: Train on semantic analysis data
- Performance Prediction: Learn from optimization patterns
- Code Generation: Understand compilation patterns
- Bug Detection: Identify problematic code patterns
- Optimization Studies: Analyze real-world optimization impact
- Type System Research: Understand type compilation patterns
- Performance Engineering: Correlate source patterns with performance
- Tool Development: Build better development tools
- Compiler Education: Show real compilation processes
- Rust Learning: Understand professional code patterns
- Research Methods: Example of comprehensive analysis
All tools generate Apache Parquet files optimized for ML workflows:
output-dataset/
βββ semantic/
β βββ parsing-phase/data-*.parquet
β βββ name_resolution-phase/data.parquet
β βββ type_inference-phase/data.parquet
βββ cargo/
β βββ project_metadata-phase/data.parquet
βββ llvm-ir/
β βββ ir_generation-O0-phase/data.parquet
β βββ ir_generation-O1-phase/data.parquet
β βββ ir_generation-O2-phase/data.parquet
β βββ ir_generation-O3-phase/data.parquet
βββ README.md (comprehensive documentation)
import pandas as pd
# Load semantic analysis data
parsing_df = pd.read_parquet('output-dataset/semantic/parsing-phase/data.parquet')
print(f"Loaded {len(parsing_df)} parsing records")
# Load LLVM IR data
ir_df = pd.read_parquet('output-dataset/llvm-ir/ir_generation-O2-phase/data.parquet')
print(f"Loaded {len(ir_df)} LLVM IR records")
use arrow::record_batch::RecordBatch;
use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
// Load data in Rust
let file = std::fs::File::open("output-dataset/semantic/parsing-phase/data.parquet")?;
let builder = ParquetRecordBatchReaderBuilder::try_new(file)?;
let reader = builder.build()?;
for batch_result in reader {
let batch = batch_result?;
println!("Loaded batch with {} records", batch.num_rows());
}
# Analyze specific optimization levels
cargo run --bin hf-validator -- analyze-llvm-ir project.rs output O0,O2,O3
# For projects with 1000+ files, use semantic analysis only first
cargo run --bin hf-validator -- generate-hf-dataset large-project semantic-only
# Then add project analysis
cargo run --bin hf-validator -- analyze-cargo-project large-project cargo-analysis
# Validate generated datasets
cargo run --bin hf-validator -- validate-hf-dataset output-dataset/semantic
cargo run --bin hf-validator -- validate-cargo-dataset output-dataset/cargo
cargo run --bin hf-validator -- validate-llvm-dataset output-dataset/llvm-ir
rust_analyzer_extractor
: Semantic analysis using rust-analyzercargo2hf_extractor
: Project structure analysis with workspace supportllvm_ir_extractor
: LLVM IR generation and optimization analysisvalidator
: Dataset validation and quality assurance
- Source Analysis: Parse and analyze Rust source files
- Data Extraction: Extract relevant information for each phase
- Schema Validation: Ensure data consistency and quality
- Parquet Generation: Create ML-optimized output files
- Documentation: Generate comprehensive README files
We welcome contributions! Areas for improvement:
- New Analysis Phases: Add more compilation stages
- Performance Optimization: Handle larger codebases
- Schema Enhancement: Add more semantic information
- Documentation: Improve usage examples and tutorials
If you use this toolkit in research, please cite:
@software{rust_compilation_analyzer,
title={Comprehensive Rust Compilation Analysis Toolkit},
author={HF Dataset Validator Team},
year={2025},
url={https://github.com/solfunmeme/hf-dataset-validator-rust},
note={World's first comprehensive Rust compilation pipeline analysis}
}
- π World's Largest Rust Dataset: 1.4+ million semantic analysis records
- π¬ Self-Referential Analysis: rust-analyzer analyzing itself (533K records)
- β‘ Compiler Analysis: Complete rustc analysis (835K records)
- π LLVM Bridge: llvm-sys.rs pipeline analysis (9K records)
- π HuggingFace Ready: Available at
huggingface.co/datasets/introspector/rust
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Dataset: HuggingFace Hub
AGPL-3.0 - See LICENSE for details.
π Ready to revolutionize Rust analysis and ML-powered development tools!