-
Notifications
You must be signed in to change notification settings - Fork 5
Datasets
Zach Guo edited this page Apr 8, 2014
·
13 revisions
-
Data sources:
- MARC XML:
- HathiTrust provides an incomplete version, which can be found here.
- HTRC provides complete metadata via Solr API:
- A script(
downloadMetadata.py) for downloading the metadata is provided in themetadata_processingfolder. - To use the script:
- Create an empty folder to store your downloaded files, I recommend use a folder location outside the repo. If you prefer put it in the repo, make sure that the folder name is added into
.gitignore. - Run
python metadata_processing/downloadMetadata.py <id-filename> <zip filepath>.<zip filepath>is the empty folder you just created,<id-filename>is the volume id file(e.g.vid_splitaa).
- Create an empty folder to store your downloaded files, I recommend use a folder location outside the repo. If you prefer put it in the repo, make sure that the folder name is added into
- A script(
- METS XML: Modify the
DownloadVolumes.pyfile according to [this document] (http://wiki.htrc.illinois.edu/download/attachments/15040514/Help-HTRC-OpenOpen-corpus.pdf?api=v2) to retrieve METS xml data with volumes. - Hathifile can be found here.
- MARC XML:
-
Specifications:
- Refer to this page for additional detail on MARC21 mini data elements. This is the data dictionary that will help you identify key elements in the METS.xml descriptive files.
- Library of Congress also provides full and concise versions of the MARC 21 Format for Bibliographic Data
- Concatenated rather than paged version is used. To download concatenated version, a
VOLUME_PARAMETERS = {'concat':'true'}parameter should be specified inDownloadVolumes.py.
- LibraryThing API
- WorldCat API
- Open Library Data Dump
- Gutenberg?
- WikiBio, WikiEvents?