|
| 1 | +# Metadata Dumps |
| 2 | + |
| 3 | +For users who need to download large subsets of Zenodo records, making individual API calls for each record can be inefficient. |
| 4 | +To address this, we provide bulk metadata dumps that contain exports of Zenodo record metadata in DataCite XML and JSON formats, as well as a CSV list of deleted records. |
| 5 | + |
| 6 | +These dumps are generated **monthly** at the beginning of each month and are available via the `/api/exporter` endpoint. |
| 7 | +We maintain the latest 3 snapshots of each dump variant, including creation timestamp, version ID, file size, checksum, and download links. |
| 8 | + |
| 9 | +## List available dumps |
| 10 | + |
| 11 | +List all available metadata dumps with their version history. |
| 12 | + |
| 13 | +```python |
| 14 | +import requests |
| 15 | +resp = requests.get('https://zenodo.org/api/exporter') |
| 16 | +``` |
| 17 | + |
| 18 | +```shell |
| 19 | +curl https://zenodo.org/api/exporter |
| 20 | +``` |
| 21 | + |
| 22 | +```json |
| 23 | +{ |
| 24 | + "records-xml.tar.gz": [ |
| 25 | + { |
| 26 | + "version_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890", |
| 27 | + "created": "2025-09-13T06:25:32.054451+00:00", |
| 28 | + "is_head": true, |
| 29 | + "size": 3970234567, |
| 30 | + "checksum": "md5:1c1fd4ab805d52729cdee94d199f7729", |
| 31 | + "links": { |
| 32 | + "self": "https://zenodo.org/api/exporter/records-xml.tar.gz/a1b2c3d4-e5f6-7890-abcd-ef1234567890", |
| 33 | + "self_head": "https://zenodo.org/api/exporter/records-xml.tar.gz" |
| 34 | + } |
| 35 | + } |
| 36 | + ], |
| 37 | + "records-deleted.csv.gz": [ |
| 38 | + { |
| 39 | + "version_id": "e2d16608-5d00-41ae-905e-11aa44643228", |
| 40 | + "created": "2025-09-01T23:24:03.867786+00:00", |
| 41 | + "is_head": true, |
| 42 | + "size": 26820485, |
| 43 | + "checksum": "md5:4605fbea12cab96f3a79d91a9f69f286", |
| 44 | + "links": { |
| 45 | + "self": "https://zenodo.org/api/exporter/records-deleted.csv.gz/e2d16608-5d00-41ae-905e-11aa44643228", |
| 46 | + "self_head": "https://zenodo.org/api/exporter/records-deleted.csv.gz" |
| 47 | + } |
| 48 | + } |
| 49 | + ] |
| 50 | +} |
| 51 | +``` |
| 52 | + |
| 53 | +#### HTTP Request |
| 54 | + |
| 55 | +`GET /api/exporter` |
| 56 | + |
| 57 | +#### Success Response |
| 58 | + |
| 59 | +* **Code:** `200 OK` |
| 60 | +* **Body**: JSON object where each key represents a dump variant, and the value is an array of the 3 most recent versions. |
| 61 | + |
| 62 | +#### Response format |
| 63 | + |
| 64 | +Each dump variant includes an array of versions with: |
| 65 | + |
| 66 | +| Field | Description | |
| 67 | +|:------|:------------| |
| 68 | +| `version_id` | Unique identifier for this version | |
| 69 | +| `created` | ISO 8601 timestamp of when the dump was created | |
| 70 | +| `is_head` | Boolean indicating if this is the latest version | |
| 71 | +| `size` | File size in bytes | |
| 72 | +| `checksum` | File checksum for verification | |
| 73 | +| `links.self` | URL to download this specific version | |
| 74 | +| `links.self_head` | URL to download the latest version (only present for `is_head: true`) | |
| 75 | + |
| 76 | +## Download a dump |
| 77 | + |
| 78 | +Download the latest version of a specific metadata dump. |
| 79 | + |
| 80 | +```python |
| 81 | +import requests |
| 82 | + |
| 83 | +# Download latest version |
| 84 | +resp = requests.get('https://zenodo.org/api/exporter/records-xml.tar.gz') |
| 85 | + |
| 86 | +# Download specific version |
| 87 | +resp = requests.get('https://zenodo.org/api/exporter/records-xml.tar.gz/a1b2c3d4-e5f6-7890-abcd-ef1234567890') |
| 88 | +``` |
| 89 | + |
| 90 | +```shell |
| 91 | +# Download latest version |
| 92 | +curl -O https://zenodo.org/api/exporter/records-xml.tar.gz |
| 93 | + |
| 94 | +# Download specific version |
| 95 | +curl -O https://zenodo.org/api/exporter/records-xml.tar.gz/a1b2c3d4-e5f6-7890-abcd-ef1234567890 |
| 96 | +``` |
| 97 | + |
| 98 | +#### HTTP Request |
| 99 | + |
| 100 | +`GET /api/exporter/:key` |
| 101 | + |
| 102 | +`GET /api/exporter/:key/:version_id` |
| 103 | + |
| 104 | +#### URL Parameters |
| 105 | + |
| 106 | +| Parameter | Required | Description | |
| 107 | +|:----------|:---------|:------------| |
| 108 | +| `key` | required | The dump variant key (e.g., `records-xml.tar.gz`) | |
| 109 | +| `version_id` | optional | Specific version UUID. If omitted, returns the latest version. | |
| 110 | + |
| 111 | +#### Success Response |
| 112 | + |
| 113 | +* **Code:** `200 OK` |
| 114 | +* **Body**: Binary file content (typically `.tar.gz` archive) |
| 115 | + |
| 116 | +#### Archive structure |
| 117 | + |
| 118 | +**Metadata dumps** (`.tar.gz` archives) contain one file per record: |
| 119 | + |
| 120 | +- Filename format: `<record_id>.<extension>` (e.g., `8435696.xml`) |
| 121 | +- Each file contains the complete metadata for that record in the specified format |
| 122 | + |
| 123 | +**Deleted records dump** (`records-deleted.csv.gz`) is a gzip-compressed CSV file with the following columns: |
| 124 | + |
| 125 | +| Column | Description | |
| 126 | +|:-------|:------------| |
| 127 | +| `record_id` | Zenodo record ID | |
| 128 | +| `doi` | Record DOI | |
| 129 | +| `parent_id` | Parent record ID (concept record) | |
| 130 | +| `parent_doi` | Parent record DOI | |
| 131 | +| `removal_note` | Reason for removal | |
| 132 | +| `removal_reason` | Category (e.g., "spam") | |
| 133 | +| `removal_date` | Date the record was removed | |
| 134 | +| `citation_text` | Citation text (if available) | |
| 135 | + |
| 136 | +## Streaming processing example |
| 137 | + |
| 138 | +```python |
| 139 | +import requests |
| 140 | +import tarfile |
| 141 | +import itertools |
| 142 | +import xml.etree.ElementTree as ET |
| 143 | + |
| 144 | +url = "https://zenodo.org/api/exporter/records-xml.tar.gz" |
| 145 | +resp = requests.get(url, stream=True) |
| 146 | +resp.raw.decode_content = True |
| 147 | + |
| 148 | +namespaces = { |
| 149 | + 'datacite': 'http://datacite.org/schema/kernel-4', |
| 150 | + 'oai_datacite': 'http://schema.datacite.org/oai/oai-1.1/' |
| 151 | +} |
| 152 | + |
| 153 | +with tarfile.open(fileobj=resp.raw, mode="r|gz") as tar: |
| 154 | + for member in itertools.islice(tar, 10): |
| 155 | + if member.isfile(): |
| 156 | + content = tar.extractfile(member).read() |
| 157 | + root = ET.fromstring(content) |
| 158 | + |
| 159 | + # Extract DOI and title from DataCite XML |
| 160 | + resource = root.find('.//datacite:resource', namespaces) |
| 161 | + doi = resource.find('datacite:identifier', namespaces).text |
| 162 | + title = resource.find('datacite:titles/datacite:title', namespaces).text |
| 163 | + |
| 164 | + print(f"{member.name}: {doi} - {title}") |
| 165 | + |
| 166 | +# Outputs: |
| 167 | +# 12345.xml: 10.5281/zenodo.12345 - Dataset for XYZ |
| 168 | +# ... |
| 169 | +``` |
| 170 | + |
| 171 | +```shell |
| 172 | +# List files without extracting |
| 173 | +curl -s https://zenodo.org/api/exporter/records-xml.tar.gz | tar -tzf - | head -10 |
| 174 | + |
| 175 | +# Outputs: |
| 176 | +# 12345.xml |
| 177 | +# 12346.xml |
| 178 | +# ... |
| 179 | +``` |
| 180 | + |
| 181 | +For large dumps, use streaming to avoid loading the entire file into memory. |
| 182 | +This example shows how to download and process a metadata dump by streaming the `tar.gz` archive and extracting individual record files on-the-fly. |
0 commit comments