This is a schematic step-by-step description of data processing (text extraction and cleaning) pipeline used to create HPLT v2 datasets.
Each step is accompanied by a link to the corresponding code base.
See more details in the Deliverable Report 7.2.
The output of this stage consists of WARC files.
The output of this stage consists mostly of HTMLs.
- Stage 2: Extracting raw text (
html2text
)- Trafilatura (running text extraction and boilerplate removal)
- Document language identification with OpenLid
The output of this stage is plain text and metadata (separately) in JSONL format.
The output of this stage is plain text merged with metadata in JSONL format.
It comes in the deduplicated
and cleaned
varieties.