Skip to content

hplt-project/HPLT-textpipes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HPLT TextPipes

This is a schematic step-by-step description of data processing (text extraction and cleaning) pipeline used to create HPLT v2 datasets.

Each step is accompanied by a link to the corresponding code base.

See more details in the Deliverable Report 7.2.

Data ingestion

The output of this stage consists of WARC files.

Text extraction

The output of this stage consists mostly of HTMLs.

The output of this stage is plain text and metadata (separately) in JSONL format.

Deduplication, cleaning and filtering

The output of this stage is plain text merged with metadata in JSONL format. It comes in the deduplicated and cleaned varieties.

About

Step-by-step schematic description of data processing in HPLT

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 6