DataMap

A high-performance data processing pipeline for large-scale text datasets. (Note, readme generated by Claude, seems ...okay -mj)

Overview

DataMap is a Rust-based toolkit designed for efficient processing, filtering, and resharding of large text datasets, primarily in JSONL format. It provides a flexible pipeline architecture for text data transformations with various filters and modifiers.

Key features:

Multi-threaded processing with Rayon
Configurable processing pipeline via JSON/YAML configuration
Comprehensive set of text filters and modifiers
Data resharding capabilities
Utilities for S3/GCP/WEKA integration

Components

Rust Core (`src/main.rs`, `src/map_fxn.rs`)

The core functionality is implemented in Rust for high performance:

Main Module (src/main.rs):
- Command-line interface with subcommands
- Pipeline execution logic
- I/O and file operations
Data Processors (src/map_fxn.rs):
- Pipeline processor architecture
- Text filters (length, language, URL, etc.)
- Content modifiers (newline removal, ID generation, etc.)
- Analytics processors (FastText annotation, etc.)

Python Utilities (`utils/s5cmd_wrapper.py`)

Python utilities for cloud storage operations:

S3/GCP/WEKA integration via s5cmd
Parallel file download/upload capabilities
Progress tracking

Usage

Data Mapping

Process data through a filtering/modification pipeline:

datamap map --input_dir ./data/input --output_dir ./data/output --config pipeline_config.yaml [--err_dir ./data/errors] [--threads 16]

Data Resharding

Reshard files into specific size or line count chunks:

datamap reshard --input_dir ./data/input --output_dir ./data/output --max_lines 10000 --max_size 100000000 [--subsample 0.1] [--threads 16]

Cloud Storage Integration

Upload/download files from cloud storage:

python utils/s5cmd_wrapper.py download --src s3://bucket/path --dst ./local/path [--part 0 --num-parts 4]
python utils/s5cmd_wrapper.py upload --src ./local/path --dst s3://bucket/path

Configuration

Pipelines are defined using YAML or JSON configuration files. Example config:

text_field: "text"
pipeline:
  - name: "text_len_filter"
    kwargs:
      lower_bound: 100
      upper_bound: 100000
  - name: "subsample"
    kwargs:
      subsample_rate: 0.8
  - name: "stop_word_filter"
    kwargs:
      min_stop_word: 3
  - name: "word_count_adder"
    kwargs:
      word_count_field: "word_count"

Available Processors

The toolkit includes many processors for various text transformation and filtering needs:

Filters

text_len_filter: Filter by text length
page_len_filter: Filter by length of words, sentences, etc.
word_len_filter: Filter by average word length
subsample: Randomly subsample documents
url_substring_filter: Filter URLs by domain, subdomain, etc.
float_filter: Filter by float field values
symbol_ratio_filter: Filter by symbol density
bullet_filter: Filter by bullet point density
ellipsis_line_ratio_filter: Filter by ellipsis usage
alphabetic_word_ratio_filter: Filter by non-alphabetic word ratio
stop_word_filter: Filter by presence of stop words
massive_web_repetition_filter: Filter by content repetition patterns
word_removal_ratio_filter: Filter by word removal ratio
madlad400_sentence_filter: Multi-criteria sentence filter from Madlad400

Modifiers

add_id: Add UUID to documents
newline_removal_modifier: Control consecutive newlines
ratio_line_modifier: Filter lines by uppercase or digit ratio
regex_line_modifier: Filter lines using regex
line_len_modifier: Filter lines by word count
substring_line_modifier: Filter or modify lines with banned substrings
word_count_adder: Add word count field

Annotators

fasttext_annotator: Add language classification with FastText

Dependencies

Rust

rayon (parallel processing)
clap (command-line parsing)
serde_json/serde_yaml (config parsing)
anyhow (error handling)
dashmap (concurrent hashmap)
zstd (compression)

Python

boto3
click
tqdm

Installation

Install Rust: https://www.rust-lang.org/tools/install
Clone the repository
Build the project:
```
cargo build --release
```
Install Python dependencies:
```
pip install boto3 click tqdm
```
Install s5cmd if using cloud storage utilities:
```
# Instructions vary by platform
```

License

[Insert your license information here]

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
examples		examples
python		python
src		src
tests		tests
.gitignore		.gitignore
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DataMap

Overview

Components

Rust Core (`src/main.rs`, `src/map_fxn.rs`)

Python Utilities (`utils/s5cmd_wrapper.py`)

Usage

Data Mapping

Data Resharding

Cloud Storage Integration

Configuration

Available Processors

Filters

Modifiers

Annotators

Dependencies

Rust

Python

Installation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

allenai/datamap-rs

Folders and files

Latest commit

History

Repository files navigation

DataMap

Overview

Components

Rust Core (src/main.rs, src/map_fxn.rs)

Python Utilities (utils/s5cmd_wrapper.py)

Usage

Data Mapping

Data Resharding

Cloud Storage Integration

Configuration

Available Processors

Filters

Modifiers

Annotators

Dependencies

Rust

Python

Installation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Rust Core (`src/main.rs`, `src/map_fxn.rs`)

Python Utilities (`utils/s5cmd_wrapper.py`)

Packages