HPLT TextPipes

This is a schematic step-by-step description of data processing (text extraction and cleaning) pipeline used to create HPLT v2 datasets.

Each step is accompanied by a link to the corresponding code base.

See more details in the Deliverable Report 7.2.

Data ingestion

The output of this stage consists of WARC files.

Text extraction

Stage 1: Extracting HTML and metadata from WARC files (warc2thml)

The output of this stage consists mostly of HTMLs.

Stage 2: Extracting raw text (html2text)
- Trafilatura (running text extraction and boilerplate removal)
- Document language identification with OpenLid

The output of this stage is plain text and metadata (separately) in JSONL format.

Deduplication, cleaning and filtering

Monotextor

The output of this stage is plain text merged with metadata in JSONL format. It comes in the deduplicated and cleaned varieties.

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
data_website		data_website
download/cc		download/cc
package		package
tools		tools
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HPLT TextPipes

Data ingestion

Text extraction

Deduplication, cleaning and filtering

About

Uh oh!

Releases

Packages

Contributors 6

Uh oh!

Languages

hplt-project/HPLT-textpipes

Folders and files

Latest commit

History

Repository files navigation

HPLT TextPipes

Data ingestion

Text extraction

Deduplication, cleaning and filtering

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Uh oh!

Languages

Packages