Skip to content

Commit ecfa228

Browse files
committed
global: document metadata dumps /api/exporter endpoint
1 parent 71acaf1 commit ecfa228

File tree

2 files changed

+183
-0
lines changed

2 files changed

+183
-0
lines changed
Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
# Metadata Dumps
2+
3+
For users who need to download large subsets of Zenodo records, making individual API calls for each record can be inefficient.
4+
To address this, we provide bulk metadata dumps that contain exports of Zenodo record metadata in DataCite XML and JSON formats, as well as a CSV list of deleted records.
5+
6+
These dumps are generated **monthly** at the beginning of each month and are available via the `/api/exporter` endpoint.
7+
We maintain the latest 3 snapshots of each dump variant, including creation timestamp, version ID, file size, checksum, and download links.
8+
9+
## List available dumps
10+
11+
List all available metadata dumps with their version history.
12+
13+
```python
14+
import requests
15+
resp = requests.get('https://zenodo.org/api/exporter')
16+
```
17+
18+
```shell
19+
curl https://zenodo.org/api/exporter
20+
```
21+
22+
```json
23+
{
24+
"records-xml.tar.gz": [
25+
{
26+
"version_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
27+
"created": "2025-09-13T06:25:32.054451+00:00",
28+
"is_head": true,
29+
"size": 3970234567,
30+
"checksum": "md5:1c1fd4ab805d52729cdee94d199f7729",
31+
"links": {
32+
"self": "https://zenodo.org/api/exporter/records-xml.tar.gz/a1b2c3d4-e5f6-7890-abcd-ef1234567890",
33+
"self_head": "https://zenodo.org/api/exporter/records-xml.tar.gz"
34+
}
35+
}
36+
],
37+
"records-deleted.csv.gz": [
38+
{
39+
"version_id": "e2d16608-5d00-41ae-905e-11aa44643228",
40+
"created": "2025-09-01T23:24:03.867786+00:00",
41+
"is_head": true,
42+
"size": 26820485,
43+
"checksum": "md5:4605fbea12cab96f3a79d91a9f69f286",
44+
"links": {
45+
"self": "https://zenodo.org/api/exporter/records-deleted.csv.gz/e2d16608-5d00-41ae-905e-11aa44643228",
46+
"self_head": "https://zenodo.org/api/exporter/records-deleted.csv.gz"
47+
}
48+
}
49+
]
50+
}
51+
```
52+
53+
#### HTTP Request
54+
55+
`GET /api/exporter`
56+
57+
#### Success Response
58+
59+
* **Code:** `200 OK`
60+
* **Body**: JSON object where each key represents a dump variant, and the value is an array of the 3 most recent versions.
61+
62+
#### Response format
63+
64+
Each dump variant includes an array of versions with:
65+
66+
| Field | Description |
67+
|:------|:------------|
68+
| `version_id` | Unique identifier for this version |
69+
| `created` | ISO 8601 timestamp of when the dump was created |
70+
| `is_head` | Boolean indicating if this is the latest version |
71+
| `size` | File size in bytes |
72+
| `checksum` | File checksum for verification |
73+
| `links.self` | URL to download this specific version |
74+
| `links.self_head` | URL to download the latest version (only present for `is_head: true`) |
75+
76+
## Download a dump
77+
78+
Download the latest version of a specific metadata dump.
79+
80+
```python
81+
import requests
82+
83+
# Download latest version
84+
resp = requests.get('https://zenodo.org/api/exporter/records-xml.tar.gz')
85+
86+
# Download specific version
87+
resp = requests.get('https://zenodo.org/api/exporter/records-xml.tar.gz/a1b2c3d4-e5f6-7890-abcd-ef1234567890')
88+
```
89+
90+
```shell
91+
# Download latest version
92+
curl -O https://zenodo.org/api/exporter/records-xml.tar.gz
93+
94+
# Download specific version
95+
curl -O https://zenodo.org/api/exporter/records-xml.tar.gz/a1b2c3d4-e5f6-7890-abcd-ef1234567890
96+
```
97+
98+
#### HTTP Request
99+
100+
`GET /api/exporter/:key`
101+
102+
`GET /api/exporter/:key/:version_id`
103+
104+
#### URL Parameters
105+
106+
| Parameter | Required | Description |
107+
|:----------|:---------|:------------|
108+
| `key` | required | The dump variant key (e.g., `records-xml.tar.gz`) |
109+
| `version_id` | optional | Specific version UUID. If omitted, returns the latest version. |
110+
111+
#### Success Response
112+
113+
* **Code:** `200 OK`
114+
* **Body**: Binary file content (typically `.tar.gz` archive)
115+
116+
#### Archive structure
117+
118+
**Metadata dumps** (`.tar.gz` archives) contain one file per record:
119+
120+
- Filename format: `<record_id>.<extension>` (e.g., `8435696.xml`)
121+
- Each file contains the complete metadata for that record in the specified format
122+
123+
**Deleted records dump** (`records-deleted.csv.gz`) is a gzip-compressed CSV file with the following columns:
124+
125+
| Column | Description |
126+
|:-------|:------------|
127+
| `record_id` | Zenodo record ID |
128+
| `doi` | Record DOI |
129+
| `parent_id` | Parent record ID (concept record) |
130+
| `parent_doi` | Parent record DOI |
131+
| `removal_note` | Reason for removal |
132+
| `removal_reason` | Category (e.g., "spam") |
133+
| `removal_date` | Date the record was removed |
134+
| `citation_text` | Citation text (if available) |
135+
136+
## Streaming processing example
137+
138+
```python
139+
import requests
140+
import tarfile
141+
import itertools
142+
import xml.etree.ElementTree as ET
143+
144+
url = "https://zenodo.org/api/exporter/records-xml.tar.gz"
145+
resp = requests.get(url, stream=True)
146+
resp.raw.decode_content = True
147+
148+
namespaces = {
149+
'datacite': 'http://datacite.org/schema/kernel-4',
150+
'oai_datacite': 'http://schema.datacite.org/oai/oai-1.1/'
151+
}
152+
153+
with tarfile.open(fileobj=resp.raw, mode="r|gz") as tar:
154+
for member in itertools.islice(tar, 10):
155+
if member.isfile():
156+
content = tar.extractfile(member).read()
157+
root = ET.fromstring(content)
158+
159+
# Extract DOI and title from DataCite XML
160+
resource = root.find('.//datacite:resource', namespaces)
161+
doi = resource.find('datacite:identifier', namespaces).text
162+
title = resource.find('datacite:titles/datacite:title', namespaces).text
163+
164+
print(f"{member.name}: {doi} - {title}")
165+
166+
# Outputs:
167+
# 12345.xml: 10.5281/zenodo.12345 - Dataset for XYZ
168+
# ...
169+
```
170+
171+
```shell
172+
# List files without extracting
173+
curl -s https://zenodo.org/api/exporter/records-xml.tar.gz | tar -tzf - | head -10
174+
175+
# Outputs:
176+
# 12345.xml
177+
# 12346.xml
178+
# ...
179+
```
180+
181+
For large dumps, use streaming to avoid loading the entire file into memory.
182+
This example shows how to download and process a metadata dump by streaming the `tar.gz` archive and extracting individual record files on-the-fly.

source/index.html.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ includes:
4747
- resources/licenses/retrieve
4848
- resources/changes
4949
- oai-pmh/root
50+
- metadata-dumps/root
5051
- github/root
5152
- rate-limit/root
5253
search: true

0 commit comments

Comments
 (0)