Skip to content

qurator-spk/page2tsv

Repository files navigation

TSV - Processing Tools

Create .tsv files that can be viewed and edited with neat.

Installation:

Required python version is 3.11. Consider use of pyenv if that python version is not available on your system.

Activate virtual environment (virtualenv):

source venv/bin/activate

or (pyenv):

pyenv activate my-python-3.11-virtualenv

Update pip:

pip install -U pip

Install tsvtools:

pip install git+https://github.com/qurator-spk/page2tsv.git

PAGE-XML to TSV Transformation:

Create a TSV file from OCR in PAGE-XML format (with word segmentation):

page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1

In order to create a TSV file for multiple PAGE XML files just perform successive calls of the tool using the same TSV file:

page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1
page2tsv PAGE2.xml PAGE.tsv --image-url=http://link-to-corresponding-image-2
page2tsv PAGE3.xml PAGE.tsv --image-url=http://link-to-corresponding-image-3
page2tsv PAGE4.xml PAGE.tsv --image-url=http://link-to-corresponding-image-4
page2tsv PAGE5.xml PAGE.tsv --image-url=http://link-to-corresponding-image-5
...
...
...

For instance, for the file example.xml:

page2tsv example.xml example.tsv --image-url=http://content.staatsbibliothek-berlin.de/zefys/SNP27646518-18800101-0-3-0-0/left,top,width,height/full/0/default.jpg

Processing of already existing TSV files:

Create a URL-annotated TSV file from an existing TSV file:

annotate-tsv enp_DE.tsv enp_DE-annotated.tsv

Command-line interface:

page2tsv --help
Usage: page2tsv [OPTIONS] PAGE_XML_FILE TSV_OUT_FILE

  Converts a page-XML file into a TSV file that can be edited with neat.
  Optionally the tool also accepts NER and Entitiy Linking API-Endpoints as
  parameters and performs NER and EL and the document if these are provided.

  PAGE_XML_FILE: The source page-XML file. TSV_OUT_FILE: Resulting TSV file.

Options:
  --purpose [NERD|OCR]       Purpose of output tsv file.
                             
                             NERD: NER/NED application/ground-truth creation.
                             
                             OCR: OCR application/ground-truth creation.
                             
                             default: NERD.
  --image-url TEXT           An image retrieval link that enables neat to show
                             the scan images corresponding to the text tokens.
                             Example: https://content.staatsbibliothek-berlin.
                             de/zefys/SNP26824620-18371109-0-1-0-0/left,top,wi
                             dth,height/full/0/default.jpg
  --ner-rest-endpoint TEXT   REST endpoint of sbb_ner service. See
                             https://github.com/qurator-spk/sbb_ner for
                             details. Only applicable in case of NERD.
  --ned-rest-endpoint TEXT   REST endpoint of sbb_ned service. See
                             https://github.com/qurator-spk/sbb_ned for
                             details. Only applicable in case of NERD.
  --noproxy                  disable proxy. default: enabled.
  --scale-factor FLOAT       default: 1.0
  --ned-threshold FLOAT
  --min-confidence FLOAT
  --max-confidence FLOAT
  --ned-priority INTEGER
  --normalization-file PATH
  --help                     Show this message and exit.
tsv2tsv --help
Usage: tsv2tsv [OPTIONS] TSV_IN_FILE

Options:
  --tsv-out-file PATH          Write modified TSV to this file.
  --ner-rest-endpoint TEXT     REST endpoint of sbb_ner service. See
                               https://github.com/qurator-spk/sbb_ner for
                               details.
  --noproxy                    disable proxy. default: enabled.
  --num-tokens                 Print number of tokens in input/output file.
  --sentence-count             Print sentence count in input/output file.
  --max-sentence-len           Print maximum sentence len for input/output
                               file.
  --keep-tokenization          Keep the word tokenization exactly as it is.
  --sentence-split-only        Do only sentence splitting.
  --show-urls                  Print contained visualization URLs.
  --just-zero                  Process only files that have max sentence
                               length zero,i.e., that do not have sentence
                               splitting.
  --sanitize-sentence-numbers  Sanitize sentence numbering.
  --show-columns               Show TSV columns.
  --drop-column TEXT           Drop column
  --help                       Show this message and exit.
alto2tsv --help
Usage: alto2tsv [OPTIONS] ALTO_XML_FILE TSV_OUT_FILE

  Converts a ALTO-XML file into a TSV file that can be edited with neat.
  Optionally the tool also accepts NER and Entitiy Linking API-Endpoints as
  parameters and performs NER and EL and the document if these are provided.

  ALTO_XML_FILE: The source ALTO-XML file. 
  TSV_OUT_FILE: Resulting TSV file.

Options:
  --purpose [NERD|OCR]      Purpose of output tsv file.
                            
                            NERD: NER/NED application/ground-truth creation.
                            
                            OCR: OCR application/ground-truth creation.
                            
                            default: NERD.
  --image-url TEXT          An image retrieval link that enables neat to show
                            the scan images corresponding to the text tokens.
                            Example: https://content.staatsbibliothek-berlin.d
                            e/zefys/SNP26824620-18371109-0-1-0-0/left,top,widt
                            h,height/full/0/default.jpg
  --ner-rest-endpoint TEXT  REST endpoint of sbb_ner service. See
                            https://github.com/qurator-spk/sbb_ner for
                            details. Only applicable in case of NERD.
  --ned-rest-endpoint TEXT  REST endpoint of sbb_ned service. See
                            https://github.com/qurator-spk/sbb_ned for
                            details. Only applicable in case of NERD.
  --noproxy                 disable proxy. default: enabled.
  --scale-factor FLOAT      default: 1.0
  --ned-threshold FLOAT
  --ned-priority INTEGER
  --help                    Show this message and exit.