Metadata Processing

You don't need to run the commands below. They only work for old task or dataset. The documentation is kept for historical reason.

A metadata_processing/parseMETS.sh is provided to explore METS XML metadata. You can run this against the METS.xml files to extract parsed data that includes document id, publication date, and text language.
- document id example: <PREMIS:objectIdentifierValue>loc.ark:/13960/t0000758s</PREMIS:objectIdentifierValue>
- publication date and language are part of a fixed-length field, 7-14, and 35-37 respectively: <controlfield tag="008">881111s1898 nyu 000 0 eng </controlfield>
getLangFreq.py and getYearFreq.py:
- These two files were only used to get basic descriptive statistics from metadata which were used in our first presentation.
- If you want to try them out, copy them to the same directory of the XML or Hathifile file, change file paths(METADATA_PATH_XML and METADATA_PATH_HATHIFILE ), then run command: python getLangFreq.py X or python getYearFreq.py X. X tells the program to run on sandbox XML metadata. If you want the program to run on production stack Hathifile metadata, use H instead of X.
Get dependent variable from hathitrust XML metadata (non_google_pd_pdus.xml), then import it into mongoDB:
- Make sure that pymongo is installed.
- Run following command:
```
 python getDV_HT.py path/to/metadata.xml
```
  Note that db name is assumed to be HTRC, collection name of metadata is assumed to be metadata, a dv collection will be created in HTRC after running the command.
- If you encounter signal 11 (Segmentation fault: 11) error, upgrade your mongoDB to latest version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata Processing

You don't need to run the commands below. They only work for old task or dataset. The documentation is kept for historical reason.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally