-
Notifications
You must be signed in to change notification settings - Fork 5
Metadata Processing
Zach Guo edited this page Apr 8, 2014
·
29 revisions
You don't need to run the commands below. They only work for old task or dataset. The documentation is kept for historical reason.
-
A
metadata_processing/parseMETS.shis provided to explore METS XML metadata. You can run this against the METS.xml files to extract parsed data that includes document id, publication date, and text language.- document id example:
<PREMIS:objectIdentifierValue>loc.ark:/13960/t0000758s</PREMIS:objectIdentifierValue> - publication date and language are part of a fixed-length field, 7-14, and 35-37 respectively:
<controlfield tag="008">881111s1898 nyu 000 0 eng </controlfield>
- document id example:
-
getLangFreq.pyandgetYearFreq.py:- These two files were only used to get basic descriptive statistics from metadata which were used in our first presentation.
- If you want to try them out, copy them to the same directory of the XML or Hathifile file, change file paths(
METADATA_PATH_XMLandMETADATA_PATH_HATHIFILE), then run command:python getLangFreq.py Xorpython getYearFreq.py X.Xtells the program to run on sandbox XML metadata. If you want the program to run on production stack Hathifile metadata, useHinstead ofX.
-
Get dependent variable from hathitrust XML metadata (
non_google_pd_pdus.xml), then import it into mongoDB:-
Make sure that
pymongois installed. -
Run following command:
python getDV_HT.py path/to/metadata.xml
Note that
dbname is assumed to beHTRC, collection name of metadata is assumed to bemetadata, advcollection will be created inHTRCafter running the command. -
If you encounter
signal 11 (Segmentation fault: 11)error, upgrade your mongoDB to latest version.
-