Create .tsv files that can be viewed and edited with neat.
Required python version is 3.11. Consider use of pyenv if that python version is not available on your system.
Activate virtual environment (virtualenv):
source venv/bin/activate
or (pyenv):
pyenv activate my-python-3.11-virtualenv
Update pip:
pip install -U pip
Install tsvtools:
pip install git+https://github.com/qurator-spk/page2tsv.git
Create a TSV file from OCR in PAGE-XML format (with word segmentation):
page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1
In order to create a TSV file for multiple PAGE XML files just perform successive calls of the tool using the same TSV file:
page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1
page2tsv PAGE2.xml PAGE.tsv --image-url=http://link-to-corresponding-image-2
page2tsv PAGE3.xml PAGE.tsv --image-url=http://link-to-corresponding-image-3
page2tsv PAGE4.xml PAGE.tsv --image-url=http://link-to-corresponding-image-4
page2tsv PAGE5.xml PAGE.tsv --image-url=http://link-to-corresponding-image-5
...
...
...
For instance, for the file example.xml:
page2tsv example.xml example.tsv --image-url=http://content.staatsbibliothek-berlin.de/zefys/SNP27646518-18800101-0-3-0-0/left,top,width,height/full/0/default.jpg
Create a URL-annotated TSV file from an existing TSV file:
annotate-tsv enp_DE.tsv enp_DE-annotated.tsv
page2tsv --help
Usage: page2tsv [OPTIONS] PAGE_XML_FILE TSV_OUT_FILE
Converts a page-XML file into a TSV file that can be edited with neat.
Optionally the tool also accepts NER and Entitiy Linking API-Endpoints as
parameters and performs NER and EL and the document if these are provided.
PAGE_XML_FILE: The source page-XML file. TSV_OUT_FILE: Resulting TSV file.
Options:
--purpose [NERD|OCR] Purpose of output tsv file.
NERD: NER/NED application/ground-truth creation.
OCR: OCR application/ground-truth creation.
default: NERD.
--image-url TEXT An image retrieval link that enables neat to show
the scan images corresponding to the text tokens.
Example: https://content.staatsbibliothek-berlin.
de/zefys/SNP26824620-18371109-0-1-0-0/left,top,wi
dth,height/full/0/default.jpg
--ner-rest-endpoint TEXT REST endpoint of sbb_ner service. See
https://github.com/qurator-spk/sbb_ner for
details. Only applicable in case of NERD.
--ned-rest-endpoint TEXT REST endpoint of sbb_ned service. See
https://github.com/qurator-spk/sbb_ned for
details. Only applicable in case of NERD.
--noproxy disable proxy. default: enabled.
--scale-factor FLOAT default: 1.0
--ned-threshold FLOAT
--min-confidence FLOAT
--max-confidence FLOAT
--ned-priority INTEGER
--normalization-file PATH
--help Show this message and exit.
tsv2tsv --help
Usage: tsv2tsv [OPTIONS] TSV_IN_FILE
Options:
--tsv-out-file PATH Write modified TSV to this file.
--ner-rest-endpoint TEXT REST endpoint of sbb_ner service. See
https://github.com/qurator-spk/sbb_ner for
details.
--noproxy disable proxy. default: enabled.
--num-tokens Print number of tokens in input/output file.
--sentence-count Print sentence count in input/output file.
--max-sentence-len Print maximum sentence len for input/output
file.
--keep-tokenization Keep the word tokenization exactly as it is.
--sentence-split-only Do only sentence splitting.
--show-urls Print contained visualization URLs.
--just-zero Process only files that have max sentence
length zero,i.e., that do not have sentence
splitting.
--sanitize-sentence-numbers Sanitize sentence numbering.
--show-columns Show TSV columns.
--drop-column TEXT Drop column
--help Show this message and exit.
alto2tsv --help
Usage: alto2tsv [OPTIONS] ALTO_XML_FILE TSV_OUT_FILE
Converts a ALTO-XML file into a TSV file that can be edited with neat.
Optionally the tool also accepts NER and Entitiy Linking API-Endpoints as
parameters and performs NER and EL and the document if these are provided.
ALTO_XML_FILE: The source ALTO-XML file.
TSV_OUT_FILE: Resulting TSV file.
Options:
--purpose [NERD|OCR] Purpose of output tsv file.
NERD: NER/NED application/ground-truth creation.
OCR: OCR application/ground-truth creation.
default: NERD.
--image-url TEXT An image retrieval link that enables neat to show
the scan images corresponding to the text tokens.
Example: https://content.staatsbibliothek-berlin.d
e/zefys/SNP26824620-18371109-0-1-0-0/left,top,widt
h,height/full/0/default.jpg
--ner-rest-endpoint TEXT REST endpoint of sbb_ner service. See
https://github.com/qurator-spk/sbb_ner for
details. Only applicable in case of NERD.
--ned-rest-endpoint TEXT REST endpoint of sbb_ned service. See
https://github.com/qurator-spk/sbb_ned for
details. Only applicable in case of NERD.
--noproxy disable proxy. default: enabled.
--scale-factor FLOAT default: 1.0
--ned-threshold FLOAT
--ned-priority INTEGER
--help Show this message and exit.