Skip to content

Datasets

Zach Guo edited this page Apr 8, 2014 · 13 revisions

HathiTrust

Metadata

  • Data sources:
    • MARC XML:
      • HathiTrust provides an incomplete version, which can be found here.
      • HTRC provides complete metadata via Solr API:
        • A script(downloadMetadata.py) for downloading the metadata is provided in the metadata_processing folder.
        • To use the script:
          1. Create an empty folder to store your downloaded files, I recommend use a folder location outside the repo. If you prefer put it in the repo, make sure that the folder name is added into .gitignore.
          2. Run python metadata_processing/downloadMetadata.py <id-filename> <zip filepath>. <zip filepath> is the empty folder you just created, <id-filename> is the volume id file(e.g. vid_splitaa).
    • METS XML: Modify the DownloadVolumes.py file according to [this document] (http://wiki.htrc.illinois.edu/download/attachments/15040514/Help-HTRC-OpenOpen-corpus.pdf?api=v2) to retrieve METS xml data with volumes.
    • Hathifile can be found here.
  • Specifications:
    • Refer to this page for additional detail on MARC21 mini data elements. This is the data dictionary that will help you identify key elements in the METS.xml descriptive files.
    • Library of Congress also provides full and concise versions of the MARC 21 Format for Bibliographic Data

Textfiles

  • Concatenated rather than paged version is used. To download concatenated version, a VOLUME_PARAMETERS = {'concat':'true'} parameter should be specified in DownloadVolumes.py.

Supplementary

Clone this wiki locally