Skip to content

Metadata Processing

Zach Guo edited this page Apr 8, 2014 · 29 revisions

You don't need to run the commands below. They only work for old task or dataset. The documentation is kept for historical reason.

  • A metadata_processing/parseMETS.sh is provided to explore METS XML metadata. You can run this against the METS.xml files to extract parsed data that includes document id, publication date, and text language.

    • document id example: <PREMIS:objectIdentifierValue>loc.ark:/13960/t0000758s</PREMIS:objectIdentifierValue>
    • publication date and language are part of a fixed-length field, 7-14, and 35-37 respectively: <controlfield tag="008">881111s1898 nyu 000 0 eng </controlfield>
  • getLangFreq.py and getYearFreq.py:

    • These two files were only used to get basic descriptive statistics from metadata which were used in our first presentation.
    • If you want to try them out, copy them to the same directory of the XML or Hathifile file, change file paths(METADATA_PATH_XML and METADATA_PATH_HATHIFILE ), then run command: python getLangFreq.py X or python getYearFreq.py X. X tells the program to run on sandbox XML metadata. If you want the program to run on production stack Hathifile metadata, use H instead of X.
  • Get dependent variable from hathitrust XML metadata (non_google_pd_pdus.xml), then import it into mongoDB:

    • Make sure that pymongo is installed.

    • Run following command:

       python getDV_HT.py path/to/metadata.xml

      Note that db name is assumed to be HTRC, collection name of metadata is assumed to be metadata, a dv collection will be created in HTRC after running the command.

    • If you encounter signal 11 (Segmentation fault: 11) error, upgrade your mongoDB to latest version.

Clone this wiki locally