From 1baa09247fa3b177eedd07655a30423f4c91b84f Mon Sep 17 00:00:00 2001 From: "John T. Wodder II" Date: Mon, 18 Mar 2024 15:46:43 -0400 Subject: [PATCH 01/11] Design doc for generating Zarr Manifest Files Signed-off-by: Yaroslav Halchenko --- doc/design/zarr-manifests.md | 150 +++++++++++++++++++++++++++++++++++ 1 file changed, 150 insertions(+) create mode 100644 doc/design/zarr-manifests.md diff --git a/doc/design/zarr-manifests.md b/doc/design/zarr-manifests.md new file mode 100644 index 000000000..07374d0a4 --- /dev/null +++ b/doc/design/zarr-manifests.md @@ -0,0 +1,150 @@ +Zarr Manifest Files +=================== + +This document specifies *Zarr manifest files*, each of which describes a Zarr +in the Dandi Archive, including the Zarr's internal directory structure and +details on all of the Zarr's *entries* (regular, non-directory files). The +Dandi Archive is to automatically generate these files and serve them via S3. + +@yarikoptic has already produced proof-of-concept manifest files for all Zarrs +in the Dandi Archive at . Except +where noted, the manifest file format defined herein matches the format used by +the proof of concept. + + +Archive Behavior +---------------- + +Whenever Dandi Archive calculates the checksum for a Zarr in the Archive, it +shall additionally produce a *manifest file* listing various information about +the Zarr and its entries in the format described in the next section. This +manifest file shall be stored in the Archive's S3 bucket at the path +`zarr-manifest/{zarr_id}.json`, where `{zarr_id}` is replaced by the ID of the +Zarr. The manifest file shall be world-readable, unless the Zarr is embargoed +or belongs to an embargoed Dandiset, in which case appropriate steps shall be +taken to limit read access to the file. + +If a manifest file is generated for a Zarr for which an earlier manifest file +was already generated, the newer file shall replace the older. + +Manifest files shall also be generated for all Zarrs already in the Archive +when this feature is first implemented. + + +Manifest File Format +-------------------- + +A Zarr manifest file is a JSON document consisting of a JSON object with the +following fields: + +- `fields` (array of strings) — A list of the names of the fields provided for + each entry in the `entries` tree. The possible field names, along with + descriptions of the entry fields, are as follows: + + - `"versionId"` — The S3 version ID (as a string) of the current version of + the S3 object in which the entry is stored in the Archive's S3 bucket + + - `"lastModified"` — The `LastModified` timestamp of the entry's S3 object + as a string of the form `"YYYY-MM-DDTHH:MM:SS±HH:MM"` + + - `"size"` — The size in bytes of the entry as an integer + + - `"ETag"` — The `ETag` of the entry's S3 object as a string with leading & + trailing double quotation marks (U+0022) removed (not counting the double + quotation marks used by the JSON serialization) + + - This value is the same as the lowercase hexadecimal encoding of the + entry's MD5 digest. + + It is **highly recommended** that `fields` always has a value of + `["versionId", "lastModified", "size", "ETag"]`, in that order. + +- `statistics` (object) — An object containing the following fields describing + the Zarr as a whole: + + - `entries` — The total number of entries in the Zarr as an integer + + - `depth` — The maximum number of directory levels deep at which an entry + can be found in the Zarr, as an integer + + - A Zarr containing only entries, no directories, has a depth of 0. + + - A Zarr that contains one or more top-level directories, all which + contain only entries, has a depth of 1. + + - `totalSize` — The sum of the sizes of all entries in the Zarr + + - `lastModified` — The date & time at which any change was made to the + Zarr's contents as a string of the form `"YYYY-MM-DDTHH:MM:SS±HH:MM"` + + - `zarrChecksum` — The Zarr's Dandi Zarr checksum + +- `entries` (object) — A tree of values mirroring the directory & entry + structure of the Zarr. + + - Each entry in the Zarr is represented as an array of the same length as + the top-level `fields` field in which each element gives the Zarr entry's + value for the field whose name is at the same location in `fields`. + + For example, if `fields` had a value of `["versionId", "lastModified", + "size", "ETag"]`, then a possible entry array could be: + + ```json + [ + "VI067uTlzPTTyL750Ibkx3hAUm67A_sI", + "2022-03-16T02:39:36+00:00", + 27935, + "fc3d1270cd950f1e5430226db4c38c0e" + ] + ``` + + Here, the first element of the array is the entry's `versionId`, the + second element is the entry's `lastModified` timestamp, the third + element is the entry's size, and the fourth entry is the entry's ETag. + + - Each directory in the Zarr is represented as an object in which each key + is the name of an entry or subdirectory inside the directory and the + corresponding value is either an entry array or a directory object. + + - The `entries` object itself represents the top level directory of the + Zarr. + + For example, a Zarr with the following structure: + + ```text + . + ├── .zgroup + ├── arr_0/ + │   ├── .zarray + │   └── 0 + └── arr_1/ + ├── .zarray + └── 0 + ``` + + would have an `entries` field as follows (with elements of the entry arrays + omitted): + + ```json + { + ".zgroup": [ ... ], + "arr_0": { + ".zarray": [ ... ], + "0": [ ... ] + }, + "arr_1": { + ".zarray": [ ... ], + "0": [ ... ] + } + } + ``` + +> [!NOTE] +> The manifest files created by @yarikoptic contain the following fields which +> are not present in the format described above: +> +> - A top-level `schemaVersion` key with a constant value of `2` +> +> - A `zarrChecksumMismatch` field inside the `statistics` object, used to +> store the checksum that the API reports for a Zarr when it disagrees with +> the checksum calculated by the manifest-generation code From a676d35942998b65c6c4e9059a2517ac04412bfa Mon Sep 17 00:00:00 2001 From: "John T. Wodder II" Date: Tue, 19 Mar 2024 07:42:01 -0400 Subject: [PATCH 02/11] Add versioning Signed-off-by: Yaroslav Halchenko --- doc/design/zarr-manifests.md | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/doc/design/zarr-manifests.md b/doc/design/zarr-manifests.md index 07374d0a4..785044803 100644 --- a/doc/design/zarr-manifests.md +++ b/doc/design/zarr-manifests.md @@ -19,13 +19,11 @@ Whenever Dandi Archive calculates the checksum for a Zarr in the Archive, it shall additionally produce a *manifest file* listing various information about the Zarr and its entries in the format described in the next section. This manifest file shall be stored in the Archive's S3 bucket at the path -`zarr-manifest/{zarr_id}.json`, where `{zarr_id}` is replaced by the ID of the -Zarr. The manifest file shall be world-readable, unless the Zarr is embargoed -or belongs to an embargoed Dandiset, in which case appropriate steps shall be -taken to limit read access to the file. - -If a manifest file is generated for a Zarr for which an earlier manifest file -was already generated, the newer file shall replace the older. +`zarr-manifest/{zarr_id}/{checksum}.json`, where `{zarr_id}` is replaced by the +ID of the Zarr and `{checksum}` is replaced by the Dandi Zarr checksum of the +Zarr at that point in time. The manifest file shall be world-readable, unless +the Zarr is embargoed or belongs to an embargoed Dandiset, in which case +appropriate steps shall be taken to limit read access to the file. Manifest files shall also be generated for all Zarrs already in the Archive when this feature is first implemented. From d1836b417a3fd2eeff83938ab320941d94c61b27 Mon Sep 17 00:00:00 2001 From: "John T. Wodder II" Date: Tue, 19 Mar 2024 10:29:38 -0400 Subject: [PATCH 03/11] Mention S3 API calls for getting object version IDs Signed-off-by: Yaroslav Halchenko --- doc/design/zarr-manifests.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/doc/design/zarr-manifests.md b/doc/design/zarr-manifests.md index 785044803..aec4c570b 100644 --- a/doc/design/zarr-manifests.md +++ b/doc/design/zarr-manifests.md @@ -42,6 +42,12 @@ following fields: - `"versionId"` — The S3 version ID (as a string) of the current version of the S3 object in which the entry is stored in the Archive's S3 bucket + - **Implementation Note:** Obtaining an S3 object's current version ID + requires using either (a) the `GetObject` S3 API call (for a single + object) or (b) the `ListObjectVersions` S3 API call, including + client-side filtering out of all non-latest entries (for all objects + under a given common S3 prefix). + - `"lastModified"` — The `LastModified` timestamp of the entry's S3 object as a string of the form `"YYYY-MM-DDTHH:MM:SS±HH:MM"` From 705d0290d407e5bfb41c5517753eca0ce7ffa1df Mon Sep 17 00:00:00 2001 From: "John T. Wodder II" Date: Fri, 22 Mar 2024 09:00:50 -0400 Subject: [PATCH 04/11] Outline Signed-off-by: Yaroslav Halchenko --- doc/design/zarr-manifests.md | 59 +++++++++++++++++++++++++++++++----- 1 file changed, 52 insertions(+), 7 deletions(-) diff --git a/doc/design/zarr-manifests.md b/doc/design/zarr-manifests.md index aec4c570b..c8ec541a5 100644 --- a/doc/design/zarr-manifests.md +++ b/doc/design/zarr-manifests.md @@ -12,18 +12,30 @@ where noted, the manifest file format defined herein matches the format used by the proof of concept. -Archive Behavior ----------------- +Creating & Storing Manifest Files +--------------------------------- Whenever Dandi Archive calculates the checksum for a Zarr in the Archive, it shall additionally produce a *manifest file* listing various information about the Zarr and its entries in the format described in the next section. This manifest file shall be stored in the Archive's S3 bucket at the path -`zarr-manifest/{zarr_id}/{checksum}.json`, where `{zarr_id}` is replaced by the -ID of the Zarr and `{checksum}` is replaced by the Dandi Zarr checksum of the -Zarr at that point in time. The manifest file shall be world-readable, unless -the Zarr is embargoed or belongs to an embargoed Dandiset, in which case -appropriate steps shall be taken to limit read access to the file. +`zarr-manifest/{dir1}/{dir2}/{zarr_id}/{checksum}.json`, where: + +- `{dir1}` is replaced by the first three characters of the Zarr ID +- `{dir2}` is replaced by the next three characters of the Zarr ID +- `{zarr_id}` is replaced by the ID of the Zarr +- `{checksum}` is replaced by the Dandi Zarr checksum of the Zarr at that point + in time + +This directory structure (a) will allow `dandidav` to change the data source +for its `/zarr/` hierarchy from the proof-of-concept to the S3 bucket with +minimal code changes and (b) ensures that the number of entries within each +directory in the bucket under `zarr-manifest/` is not colossal, thereby +avoiding tremendous resource usage by `dandidav`. + +The manifest file shall be world-readable, unless the Zarr is embargoed or +belongs to an embargoed Dandiset, in which case appropriate steps shall be +taken to limit read access to the file. Manifest files shall also be generated for all Zarrs already in the Archive when this feature is first implemented. @@ -152,3 +164,36 @@ following fields: > - A `zarrChecksumMismatch` field inside the `statistics` object, used to > store the checksum that the API reports for a Zarr when it disagrees with > the checksum calculated by the manifest-generation code + + +Archive API Changes +------------------- + +***WIP*** + +* Zarr version IDs equal the Zarr checksum + +* Asset properties gain `zarr_version: str | null` field (absent or null if Zarr is not yet ingested or asset is not a Zarr) + - Not settable by client + - Mint new asset when version changes? + +* Add `zarr_version` field to …/assets/path/ results + +* Zarr `contentUrl`s: + - Make API download URLs for Zarrs redirect to dandidav + - Replace S3 URLs with webdav.{archive_domain}/zarr/ URLs? + - Document needed changes to dandidav? + - The bucket for the Archive instance will now be given on the command line (only required if a custom/non-default API URL is given) + - The bucket's region will have to be looked up & stored before starting the webserver + - Zarrs under `/dandisets/` will no longer determine their S3 location via `contentUrl`; instead, they will combine the Archive's bucket & region with the Zarr ID in the asset properties (templated into "zarr/{zarr_id}/") + +* Getting specific Zarr versions & their files from API endpoints + - `GET /zarr/versions/` (paginated) + - `GET /zarr/versions/{version_id}/` ? + - `GET /zarr/versions/{version_id}/files/[?prefix=...]` (paginated) + - The Zarr entry objects returned in `…/files/` responses (with & without `versions/{version_id}/`) will need to gain a `VersionId` field containing the S3 object version ID + - Nothing under /zarr/versions/ is writable over the API + +* Publishing Zarrs: Just ensure that the `zarr_version` in Zarr assets is frozen and that no entries/S3 object versions from the referenced version are ever deleted ? + +* Does garbage collection of old Zarr versions need to be discussed? From 52368df3df357d6b71fc9becdbac317bb367e34c Mon Sep 17 00:00:00 2001 From: "John T. Wodder II" Date: Fri, 22 Mar 2024 16:11:02 -0400 Subject: [PATCH 05/11] /zarr/{zarr_id}/ Signed-off-by: Yaroslav Halchenko --- doc/design/zarr-manifests.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/doc/design/zarr-manifests.md b/doc/design/zarr-manifests.md index c8ec541a5..8d08a45b5 100644 --- a/doc/design/zarr-manifests.md +++ b/doc/design/zarr-manifests.md @@ -188,9 +188,10 @@ Archive API Changes - Zarrs under `/dandisets/` will no longer determine their S3 location via `contentUrl`; instead, they will combine the Archive's bucket & region with the Zarr ID in the asset properties (templated into "zarr/{zarr_id}/") * Getting specific Zarr versions & their files from API endpoints - - `GET /zarr/versions/` (paginated) - - `GET /zarr/versions/{version_id}/` ? - - `GET /zarr/versions/{version_id}/files/[?prefix=...]` (paginated) + - The current `/zarr/{zarr_id}/…` endpoints operate on the most recent version of the Zarr + - `GET /zarr/{zarr_id}/versions/` (paginated) + - `GET /zarr/{zarr_id}/versions/{version_id}/` ? + - `GET /zarr/{zarr_id}/versions/{version_id}/files/[?prefix=...]` (paginated) - The Zarr entry objects returned in `…/files/` responses (with & without `versions/{version_id}/`) will need to gain a `VersionId` field containing the S3 object version ID - Nothing under /zarr/versions/ is writable over the API From df500f10f57cb428d43a28597600908776d7e732 Mon Sep 17 00:00:00 2001 From: Yaroslav Halchenko Date: Tue, 13 Aug 2024 15:27:35 -0400 Subject: [PATCH 06/11] Extend Zarr design doc with pointers/examples on current implementation Signed-off-by: Yaroslav Halchenko --- doc/design/zarr-manifests.md | 38 +++++++++++++++++++++++++++--------- 1 file changed, 29 insertions(+), 9 deletions(-) diff --git a/doc/design/zarr-manifests.md b/doc/design/zarr-manifests.md index 8d08a45b5..8c001a39e 100644 --- a/doc/design/zarr-manifests.md +++ b/doc/design/zarr-manifests.md @@ -1,21 +1,41 @@ -Zarr Manifest Files -=================== +# Zarr Manifest Files This document specifies *Zarr manifest files*, each of which describes a Zarr in the Dandi Archive, including the Zarr's internal directory structure and details on all of the Zarr's *entries* (regular, non-directory files). The Dandi Archive is to automatically generate these files and serve them via S3. -@yarikoptic has already produced proof-of-concept manifest files for all Zarrs -in the Dandi Archive at . Except -where noted, the manifest file format defined herein matches the format used by -the proof of concept. +## Current prototype -Creating & Storing Manifest Files ---------------------------------- +### Creating manifest files -Whenever Dandi Archive calculates the checksum for a Zarr in the Archive, it +Proof-of-concept implementation to produce manifest files for all Zarrs +in the Dandi Archive, and actual produced manifest files are provided from https://datasets.datalad.org/?dir=/dandi/zarr-manifests, which is a [DataLad dataset](https://handbook.datalad.org/en/latest/glossary.html#term-DataLad-dataset) with individual manifest files are annexed. + +**Note:** https://datasets.datalad.org/dandi/zarr-manifests/zarr-manifests-v2-sorted/ and subfolders provides ad-hoc json record listing folders/files to avoid parsing stock apache2 index. + +CRON job runs daily on typhon (server at Dartmouth). +Except where noted, the manifest file format defined herein matches the format used by the proof of concept. + +### Data access using manifest files + +[dandidav](https://github.com/dandi/dandidav)---a WebDAV server for the DANDI---serves Zarrs from the Archive using the manifest files. +Actual data is served from the Archive's S3 bucket, but the WebDAV server uses the manifest files to determine the structure of the Zarrs and the versions of the Zarrs' entries. +Two "end-points" within that namespace are provided: + +- [webdav.dandiarchive.org/zarrs](https://webdav.dandiarchive.org/zarrs) -- all Zarrs across all dandisets, possibly with multiple versions. E.g. see [zarrs/057/f84/057f84d5-a88b-490a-bedf-06f3f50e9e62](https://webdav.dandiarchive.org/zarrs/057/f84/057f84d5-a88b-490a-bedf-06f3f50e9e62) which ATM has 3 versions. +- [webdav.dandiarchive.org/dandisets/](https://webdav.dandiarchive.org/dandisets/)`{dandiset_id}/{version}/{path}/`. E.g. for aforementioned Zarr - https://webdav.dandiarchive.org/dandisets/000026/draft/sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS09_stain-Somatostatin_SPIM.ome.zarr/ -- a specific version (the latest, currently [6efea0a8e95e67ecb5af7aa028dece14-18147--30560865836](https://webdav.dandiarchive.org/zarrs/057/f84/057f84d5-a88b-490a-bedf-06f3f50e9e62/6efea0a8e95e67ecb5af7aa028dece14-18147--30560865836.zarr/)). + +Tools which support following redirections for individual files within Zarr can be pointed to those locations to "consume" zarrs of specific versions. +ATM dandisets do not support publishing (versioning) of Zarrs, so there would be only `/draft/` versions of dandisets with Zarrs. +If this design is supported/implemented, particular versions of Zarrs would be made available from within particular versions of the `/dandisets/{dandiset_id}/`s. + +## Design details + +### Creating & Storing Manifest Files + +Whenever DANDI Archive calculates the checksum for a Zarr in the Archive, it shall additionally produce a *manifest file* listing various information about the Zarr and its entries in the format described in the next section. This manifest file shall be stored in the Archive's S3 bucket at the path From 573e72b5c95ca6f98c0101ed1c3e306337e5bc37 Mon Sep 17 00:00:00 2001 From: Yaroslav Halchenko Date: Tue, 20 Aug 2024 16:38:59 -0400 Subject: [PATCH 07/11] Some more rewording and expansion in the design doc Signed-off-by: Yaroslav Halchenko --- doc/design/zarr-manifests.md | 48 +++++++++++++++++++++++++----------- 1 file changed, 33 insertions(+), 15 deletions(-) diff --git a/doc/design/zarr-manifests.md b/doc/design/zarr-manifests.md index 8c001a39e..7ad3fa5c6 100644 --- a/doc/design/zarr-manifests.md +++ b/doc/design/zarr-manifests.md @@ -1,10 +1,14 @@ -# Zarr Manifest Files +# Zarr Versioning/Publishing support via Manifest Files -This document specifies *Zarr manifest files*, each of which describes a Zarr +This document specifies + +1. [x] *Zarr manifest files*, each of which describes a Zarr in the Dandi Archive, including the Zarr's internal directory structure and details on all of the Zarr's *entries* (regular, non-directory files). The Dandi Archive is to automatically generate these files and serve them via S3. - +2. [ ] Changes needed to the DANDI Archive's API, DB Data Model, and internal logic. +3. [ ] Changes needed to AWS (S3 in particular; likely TerraForm) configuration. +4. [ ] Changes needed (if any) to dandischema. ## Current prototype @@ -15,23 +19,24 @@ in the Dandi Archive, and actual produced manifest files are provided from https **Note:** https://datasets.datalad.org/dandi/zarr-manifests/zarr-manifests-v2-sorted/ and subfolders provides ad-hoc json record listing folders/files to avoid parsing stock apache2 index. -CRON job runs daily on typhon (server at Dartmouth). +[CRON job](https://github.com/dandi/zarr-manifests/blob/master/cronjob) runs daily on typhon (server at Dartmouth) to create manifest files (only) for new/updated zarrs in the archive. Except where noted, the manifest file format defined herein matches the format used by the proof of concept. +As embargoed access to Zarrs is not implemented yet, embargo-related designs here might be incomplete. ### Data access using manifest files [dandidav](https://github.com/dandi/dandidav)---a WebDAV server for the DANDI---serves Zarrs from the Archive using the manifest files. Actual data is served from the Archive's S3 bucket, but the WebDAV server uses the manifest files to determine the structure of the Zarrs and the versions of the Zarrs' entries. -Two "end-points" within that namespace are provided: +Two "end-points" to access Zarrs within that namespace are provided, but only one of them uses Zarr manifests: -- [webdav.dandiarchive.org/zarrs](https://webdav.dandiarchive.org/zarrs) -- all Zarrs across all dandisets, possibly with multiple versions. E.g. see [zarrs/057/f84/057f84d5-a88b-490a-bedf-06f3f50e9e62](https://webdav.dandiarchive.org/zarrs/057/f84/057f84d5-a88b-490a-bedf-06f3f50e9e62) which ATM has 3 versions. -- [webdav.dandiarchive.org/dandisets/](https://webdav.dandiarchive.org/dandisets/)`{dandiset_id}/{version}/{path}/`. E.g. for aforementioned Zarr - https://webdav.dandiarchive.org/dandisets/000026/draft/sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS09_stain-Somatostatin_SPIM.ome.zarr/ -- a specific version (the latest, currently [6efea0a8e95e67ecb5af7aa028dece14-18147--30560865836](https://webdav.dandiarchive.org/zarrs/057/f84/057f84d5-a88b-490a-bedf-06f3f50e9e62/6efea0a8e95e67ecb5af7aa028dece14-18147--30560865836.zarr/)). +- [webdav.dandiarchive.org/zarrs](https://webdav.dandiarchive.org/zarrs) -- **uses manifests** for all Zarrs across all dandisets, possibly with multiple versions. E.g. see [zarrs/057/f84/057f84d5-a88b-490a-bedf-06f3f50e9e62](https://webdav.dandiarchive.org/zarrs/057/f84/057f84d5-a88b-490a-bedf-06f3f50e9e62) which ATM has 3 versions. +- [webdav.dandiarchive.org/dandisets/](https://webdav.dandiarchive.org/dandisets/)`{dandiset_id}/{version}/{path}/` -- **does not** use manifest files but gets listing directly from S3, so provides access only to the current version (possibly not even finalized yet during upload) of the zarr at that path. -Tools which support following redirections for individual files within Zarr can be pointed to those locations to "consume" zarrs of specific versions. +Tools which support following redirections for individual files within Zarr can be pointed to the locations under the former end-point to "consume" zarr of a specific version. ATM dandisets do not support publishing (versioning) of Zarrs, so there would be only `/draft/` versions of dandisets with Zarrs. If this design is supported/implemented, particular versions of Zarrs would be made available from within particular versions of the `/dandisets/{dandiset_id}/`s. -## Design details +## Proposed design ### Creating & Storing Manifest Files @@ -53,7 +58,7 @@ minimal code changes and (b) ensures that the number of entries within each directory in the bucket under `zarr-manifest/` is not colossal, thereby avoiding tremendous resource usage by `dandidav`. -The manifest file shall be world-readable, unless the Zarr is embargoed or +**Embargo.** The manifest file shall be world-readable, unless the Zarr is embargoed or belongs to an embargoed Dandiset, in which case appropriate steps shall be taken to limit read access to the file. @@ -61,8 +66,7 @@ Manifest files shall also be generated for all Zarrs already in the Archive when this feature is first implemented. -Manifest File Format --------------------- +### Manifest File Format A Zarr manifest file is a JSON document consisting of a JSON object with the following fields: @@ -186,8 +190,9 @@ following fields: > the checksum calculated by the manifest-generation code -Archive API Changes -------------------- +### Archive Changes + +#### API Changes ***WIP*** @@ -217,4 +222,17 @@ Archive API Changes * Publishing Zarrs: Just ensure that the `zarr_version` in Zarr assets is frozen and that no entries/S3 object versions from the referenced version are ever deleted ? -* Does garbage collection of old Zarr versions need to be discussed? +#### Garbage collection + +* GC of Manifests: manifests older than X days (e.g. 30) can be deleted if not referenced by any Zarr asset (draft or published). +* GC of Manifests should trigger analysis/deletion of S3 objects based on their content: + * if it is the last manifest(s) to be removed for a zarr, the zarr asset and `/zarr/{zarr_id}/` "folder" should be removed as well (including all versions of all keys); + * upon deletion of a set of manifests for a `zarr_id`, collect key and versionId's referenced in those manifests but not in any other manifest for that Zarr, and delete those particular versions of those Keys from S3. If a key has no other versions, delete that key fully (do not keep lonely`DeleteMarker`) + +### AWS Configuration Changes + +`zarr/` prefix must be excluded from "trailing delete". +This necessary because a file within Zarr could be deleted in subsequent version, while still accessed by its VersionId in the previous one. +ATM there is no filter in [terraform/modules/dandiset_bucket/main.tf (expire_deleted_objects)](https://github.com/dandi/dandi-infrastructure/blob/master/terraform/modules/dandiset_bucket/main.tf#L310). + +### dandi-schema From 1e15c75a048c3d503778cc1a4223d676a3bfa29b Mon Sep 17 00:00:00 2001 From: Yaroslav Halchenko Date: Tue, 27 Aug 2024 15:17:56 -0400 Subject: [PATCH 08/11] More to Zarr versioning design: need some "ZarrVersion" or "Upload" id; can reuse metadata --- doc/design/zarr-manifests.md | 47 +++++++++++++++++++++++++++++------- 1 file changed, 38 insertions(+), 9 deletions(-) diff --git a/doc/design/zarr-manifests.md b/doc/design/zarr-manifests.md index 7ad3fa5c6..216844741 100644 --- a/doc/design/zarr-manifests.md +++ b/doc/design/zarr-manifests.md @@ -198,15 +198,43 @@ following fields: * Zarr version IDs equal the Zarr checksum -* Asset properties gain `zarr_version: str | null` field (absent or null if Zarr is not yet ingested or asset is not a Zarr) - - Not settable by client - - Mint new asset when version changes? - -* Add `zarr_version` field to …/assets/path/ results - +- `Zarr` model has `.checksum` + - (?) Not settable by client + - (?) Upon changes to zarr asset initiated, `Zarr.checksum` reset to None, which stays such until Zarr is finalized + - (?) Zarr should be denied new changes if `Zarr.checksum` is already None, and until it is finalized + - Make `/finalize` to return new Zarr checksum: + - might take awhile, so we might want to return some upload ID to be able to re-request checksum for specific upload + - at this point we have not minted yet a new asset! + - **Alternative**: do establish ZarrVersion + - `many-to-many` between `zarr_id` and `zarr_version`. + - `/finalize` would return new `zarr_version_id` + - **Alternatives**: + - PUT/PATCH/POST calls in API expecting `zarr_id` should be changed to provide `zarr_version_id` instead + - We just add `/zarr/{zarr_id}/{zarr_version_id}/` call which would return `checksum` for that version. + +* Side discussion: new Zarr version/checksum compute is relatively expensive. + It could be "cheap" if we rely on prior manifest + changes (new files with checksums) or DELETEs. But it would require 'fsck' style re-check + and possibly "fixing" the version. Fragile since there would be no state to describe some prior state of Zarr to "checksum" it. + +* To not change DB model, to not breed zarr specific DB model fields, rely on `metadata.digest.dandi:dandi-zarr-checksum` for Zarr checksum. + - Add `zarr_checksum` to `Zarr` model, but it must be just a convenience duplicate of the checksum in the metadata. But then some return of the API would need to be adjusted to return this dedicated `zarr_checksum` in addition to value in `metadata` + - We mint new asset when metadata changes, so new asset is produced when metadata record with a new version of Zarr (new checksum) is provided + - we verify that checksum is consistent with the the `checksum` of zarr_id provided + - NOTE: this means we would not be able to re-use versioned zarr from released version! + +* …/assets/ results gain `zarr_checksum` + - they can only optionally contain `metadata` hence, we want to have `zarr_checksum` in the response + - Q: What is "Version" int returned now for each asset? + likely internal DB Version.id - unclear why it is in API response in such a form. +* …/assets/paths/ -- no change since point to `asset_id` + +* …/assets/{asset_id}/download/ -- point to versioned version based on checksum in metadata + * `webdav.{archive_domain}/zarrs/{dir1}/{dir2}/{zarr_id}/{checksum}/` URLs + ([...redirect /download/ for zarrs to webdav](https://github.com/dandi/dandi-archive/issues/1993)) * Zarr `contentUrl`s: - Make API download URLs for Zarrs redirect to dandidav - - Replace S3 URLs with webdav.{archive_domain}/zarr/ URLs? + - Replace S3 URLs with `webdav.{archive_domain}/zarrs/{dir1}/{dir2}/{zarr_id}/{checksum}/` URLs + ([...redirect /download/ for zarrs to webdav](https://github.com/dandi/dandi-archive/issues/1993)) ? - Document needed changes to dandidav? - The bucket for the Archive instance will now be given on the command line (only required if a custom/non-default API URL is given) - The bucket's region will have to be looked up & stored before starting the webserver @@ -220,9 +248,10 @@ following fields: - The Zarr entry objects returned in `…/files/` responses (with & without `versions/{version_id}/`) will need to gain a `VersionId` field containing the S3 object version ID - Nothing under /zarr/versions/ is writable over the API -* Publishing Zarrs: Just ensure that the `zarr_version` in Zarr assets is frozen and that no entries/S3 object versions from the referenced version are ever deleted ? +* Publishing Dandisets with Zarrs: Just ensure that no entries/S3 object versions from the referenced version are ever deleted (see GC section below) + -#### Garbage collection +#### Garbage collection (GC) * GC of Manifests: manifests older than X days (e.g. 30) can be deleted if not referenced by any Zarr asset (draft or published). * GC of Manifests should trigger analysis/deletion of S3 objects based on their content: From d2658a4d6c6a06fb0e4a2659ff195bcfa833d6cd Mon Sep 17 00:00:00 2001 From: Yaroslav Halchenko Date: Wed, 28 Aug 2024 15:45:49 -0400 Subject: [PATCH 09/11] Aim to remove ZarrArchive.dandiset and related --- doc/design/zarr-manifests.md | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/doc/design/zarr-manifests.md b/doc/design/zarr-manifests.md index 216844741..fe606dae8 100644 --- a/doc/design/zarr-manifests.md +++ b/doc/design/zarr-manifests.md @@ -60,7 +60,8 @@ avoiding tremendous resource usage by `dandidav`. **Embargo.** The manifest file shall be world-readable, unless the Zarr is embargoed or belongs to an embargoed Dandiset, in which case appropriate steps shall be -taken to limit read access to the file. +taken to limit read access to the file. Related issues/aspects on zarrbargo: +- [? avoid dedicated EmbargoedZarrArchive](https://github.com/dandi/dandi-archive/issues/2003#issuecomment-2315718976) Manifest files shall also be generated for all Zarrs already in the Archive when this feature is first implemented. @@ -192,7 +193,7 @@ following fields: ### Archive Changes -#### API Changes +#### Model/API Changes ***WIP*** @@ -216,7 +217,7 @@ following fields: It could be "cheap" if we rely on prior manifest + changes (new files with checksums) or DELETEs. But it would require 'fsck' style re-check and possibly "fixing" the version. Fragile since there would be no state to describe some prior state of Zarr to "checksum" it. -* To not change DB model, to not breed zarr specific DB model fields, rely on `metadata.digest.dandi:dandi-zarr-checksum` for Zarr checksum. +* To not change DB model too much, to not breed zarr specific DB model fields, rely on `metadata.digest.dandi:dandi-zarr-checksum` for Zarr checksum. - Add `zarr_checksum` to `Zarr` model, but it must be just a convenience duplicate of the checksum in the metadata. But then some return of the API would need to be adjusted to return this dedicated `zarr_checksum` in addition to value in `metadata` - We mint new asset when metadata changes, so new asset is produced when metadata record with a new version of Zarr (new checksum) is provided - we verify that checksum is consistent with the the `checksum` of zarr_id provided @@ -250,6 +251,10 @@ following fields: * Publishing Dandisets with Zarrs: Just ensure that no entries/S3 object versions from the referenced version are ever deleted (see GC section below) +* Remove `.dandiset` attribute from [*ZarrArchive](https://github.com/dandi/dandi-archive/blob/HEAD/dandiapi/zarr/models.py#L101): + - It should be possible to associate Zarr with multiple dandisets + - GC should take care about picking up stale Zarrs as it does Blobs + - Would remove `ingest_dandiset_zarrs` (seems to be just a service helper ATM anyways) #### Garbage collection (GC) From a60a6a76d74fe8c0790694cb6fa193bc19931808 Mon Sep 17 00:00:00 2001 From: Yaroslav Halchenko Date: Tue, 3 Sep 2024 14:51:01 -0400 Subject: [PATCH 10/11] Some more of outloud thinking etc which was not committed --- doc/design/zarr-manifests.md | 40 ++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+) diff --git a/doc/design/zarr-manifests.md b/doc/design/zarr-manifests.md index fe606dae8..2ffb0b255 100644 --- a/doc/design/zarr-manifests.md +++ b/doc/design/zarr-manifests.md @@ -256,6 +256,46 @@ following fields: - GC should take care about picking up stale Zarrs as it does Blobs - Would remove `ingest_dandiset_zarrs` (seems to be just a service helper ATM anyways) +* Remove `.name` attribute from `BaseZarrArchive`. zarr_id is unique identifier for the mutable Zarr. + +#### Some outloud thinking + +* `Asset` -- (largely) a CoW entry binding together *content* and metadata. +* ATM *content* can be immutable `AssetBlob` (in `.blob`) or mutable `ZarrArchive` (in `.zarr`). +* `blob_id` is UUID (not checksum) but just a unique identifier for the **immutable** blob which later assigned a computed `checksum`: + * storage on S3 is not "content-addressable" but after `blob_id` + * changes to the blob are not possible, but new blobs can be created + * Upload of a blob involves + * producing `upload_id` (and urls to use for upload) + * `/blobs/{upload_id}/complete/` endpoint to complete which returns `complete_url` + * also there is `/blobs/{upload_id}/validate/` to finally get `blob_id` and `etag` and trigger compute of sha256 checksum to be filled out later + * `blob_id` (thus pointing to immutable content) is provided to create a new `Asset` +* `zarr_id` is UUID for a **mutable** content, with `.checksum` also being computed "async" by `/zarr/{zarr_id}/finalize` + * changes to Zarr could be done, resulting in a `.checksum` being updated + * **there is no notion of `upload_id`** for Zarrs: multiple PUT/DELETE requests could be submitted in parallel (?). + * `/zarr/{zarr_id}/finalize` does not return anything +* Although upload procedures differ significantly between blobs and zarrs, they could be "uniformized" as upon completion, the **new** `_id` which identifies that particular (immutable) **content** is returned. +* We use UUIDs for all the API-accessible `_id`s so there **already** should be no overlaps between `blob_id` and `zarr_id`. + * In the model and API for interactions with Assets, we could use generic **`content_id`** which would be some UUID resolvable to a `blob_id` or `zarr_id`. + * That later would allow to extend into other types of content, possibly requiring different upload or download procedures, such as hypothetical: + * `RemoteAssets` - blobs or Zarrs on other DANDI instances for which we provide interfaces to get "registered" + * ... +* We could have a `AssetContent` model/table with `content_id` and `content_type` (blob, zarr, remote, …) and then `Asset` pointing to `content_id` (instead of separate `blob` and `zarr`) and may be duplicate `content_type` for convenience (or just make DBM do needed joint). + * Yarik does not know on DBM efficient way to orchestrate such linkage into multiple external tables, but there must be some design pattern. +* **content** needs `size` and `etag` (or `checksum`) + +#### Some inconsistencies + +which we can either resolve and/or take advantage off (to avoid breaking interface "in-place") + +- There is `Asset` +- API has all endpoints in plulal `/blobs/`, `/dandisets/`, `/assets/` but a singular `/zarr/`. + - We could add/use `/zarrs/` in parallel to (being deprecated) `/zarr/` e.g. for support of versioned zarrs operations +- We have no `Blob` model -- `blob_id` for a `AssetBlob` (not just `Blob`) +- We have no `Zarr` model -- `zarr_id` for a `ZarrArchive` (not just `Zarr`) + - We could come up with `AssetZarrArchive` for an **immutable** (version specific) `ZarrArchive` + - **note** we need a new dedicated `azarr_id` (for "Asset" zarr_id) or `vzarr_id` (for "Versioned" zarr_id) to distinguish from mutable `zarr_id`. + #### Garbage collection (GC) * GC of Manifests: manifests older than X days (e.g. 30) can be deleted if not referenced by any Zarr asset (draft or published). From 0ea7dc19eaddda4238a05131a7e0b909bea1540d Mon Sep 17 00:00:00 2001 From: Yaroslav Halchenko Date: Tue, 3 Sep 2024 15:23:32 -0400 Subject: [PATCH 11/11] moved updated outloud thinking up before "changes needed" --- doc/design/zarr-manifests.md | 95 ++++++++++++++++++------------------ 1 file changed, 48 insertions(+), 47 deletions(-) diff --git a/doc/design/zarr-manifests.md b/doc/design/zarr-manifests.md index 2ffb0b255..bd42790ee 100644 --- a/doc/design/zarr-manifests.md +++ b/doc/design/zarr-manifests.md @@ -3,14 +3,14 @@ This document specifies 1. [x] *Zarr manifest files*, each of which describes a Zarr -in the Dandi Archive, including the Zarr's internal directory structure and +in the DANDI Archive, including the Zarr's internal directory structure and details on all of the Zarr's *entries* (regular, non-directory files). The Dandi Archive is to automatically generate these files and serve them via S3. 2. [ ] Changes needed to the DANDI Archive's API, DB Data Model, and internal logic. 3. [ ] Changes needed to AWS (S3 in particular; likely TerraForm) configuration. 4. [ ] Changes needed (if any) to dandischema. -## Current prototype +## Current prototype elements ### Creating manifest files @@ -193,6 +193,43 @@ following fields: ### Archive Changes +#### Some outloud thinking + +* `Asset` -- (largely) a CoW entry binding together *content* and metadata. +* ATM *content* can be immutable `AssetBlob` (in `.blob`) or mutable `ZarrArchive` (in `.zarr`). +* `blob_id` is UUID (not checksum) but just a unique identifier for the **immutable** blob which later assigned a computed `checksum`: + * storage on S3 is not "content-addressable" but location is based on `blob_id` + * changes to the blob are not possible, but new blobs can be created + * Upload of a blob involves + * producing `upload_id` (and urls to use for upload; Q: could have been `blob_id`?) + * `/blobs/{upload_id}/complete/` endpoint to complete which returns `complete_url` + * also there is `/blobs/{upload_id}/validate/` to finally get `blob_id` and `etag` and trigger compute of sha256 checksum to be filled out later + * `blob_id` (thus pointing to immutable content) is provided to create a new `Asset` +* `zarr_id` is UUID for a **mutable** content, with `.checksum` also being computed "async" by `/zarr/{zarr_id}/finalize` + * changes to Zarr could be done, resulting in a `.checksum` being updated + * **there is no notion of `upload_id`** for Zarrs: multiple PUT/DELETE requests could be submitted in parallel (?). + * `/zarr/{zarr_id}/finalize` does not return anything (could have returned some `vzarr_id`, see below) +* Although upload procedures differ significantly between blobs and zarrs, they could be "uniformized" as upon completion, the **new** `_id` which identifies that particular (immutable) **content** is returned. +* We use UUIDs for all the API-accessible `_id`s so there **already** should be no overlaps between `blob_id` and `zarr_id`. + * In the model and API for interactions with Assets, we could use generic **`content_id`** which would be some UUID resolvable to a `blob_id` or `zarr_id`. + * That later would allow to extend into other types of content, possibly requiring different upload or download procedures, such as hypothetical: + * `RemoteAssets` - blobs or Zarrs on other DANDI instances for which we provide interfaces to get "registered". "upload" procedure and underlying model would differ + * ... +* We could have a `Content` model/table with `content_id` and `content_type` (blob, zarr, remote, …) and then `Asset` to point to `Content` (via `content_id`, instead of separate `blob` and `zarr`) and may be duplicate `content_type` for convenience (or just make DBM do needed joint). + * Yarik does not know on DBM efficient way to orchestrate such linkage into multiple external tables, but there must be some design pattern. +* **content** (`blob` or `zarr`) uniformly should have `size` and some `etag` (or `checksum`) + +#### Some inconsistencies + +which we can either resolve and/or take advantage off (to avoid breaking interface "in-place") + +- API has all endpoints in plural `/blobs/`, `/dandisets/`, `/assets/` but a singular `/zarr/`. + - We could add/use `/zarrs/` in parallel to (being deprecated) `/zarr/` e.g. for support of versioned zarrs operations +- We have no `Blob` model -- `blob_id` for a `AssetBlob` (not just `Blob`) +- We have no `Zarr` model -- `zarr_id` for a `ZarrArchive` (not just `Zarr`) + - We could come up with `AssetZarrArchive` for an **immutable** (version specific) `ZarrArchive` + - **note** we need a new dedicated `azarr_id` (for "Asset" zarr_id) or `vzarr_id` (for "Versioned" zarr_id) to distinguish from mutable `zarr_id`. + #### Model/API Changes ***WIP*** @@ -201,17 +238,19 @@ following fields: - `Zarr` model has `.checksum` - (?) Not settable by client + - Zarr .checksum should not identify the zarr (we could have multiple zarrs which would "arrive" at the same checksum) + - We cannot/should not deduplicate based on Zarr checksum similarly to how we do for the blobs + - Zarrs are mutable, so even if we deduplicate, user might not be able to update the Zarr etc. - (?) Upon changes to zarr asset initiated, `Zarr.checksum` reset to None, which stays such until Zarr is finalized - (?) Zarr should be denied new changes if `Zarr.checksum` is already None, and until it is finalized - - Make `/finalize` to return new Zarr checksum: - - might take awhile, so we might want to return some upload ID to be able to re-request checksum for specific upload + - Make `/finalize` to return some `upload_id` or even `vzarr_id` to be able to re-request checksum for specific upload - at this point we have not minted yet a new asset! - - **Alternative**: do establish ZarrVersion - - `many-to-many` between `zarr_id` and `zarr_version`. - - `/finalize` would return new `zarr_version_id` + - **Alternative**: do establish VersionedZarr (or ZarrVersion, `zarrv_id`) + - `many-to-many` between `zarr_id` and `vzarr_id`. + - `/finalize` would return new `vzarr_id` - **Alternatives**: - - PUT/PATCH/POST calls in API expecting `zarr_id` should be changed to provide `zarr_version_id` instead - - We just add `/zarr/{zarr_id}/{zarr_version_id}/` call which would return `checksum` for that version. + - PUT/PATCH/POST calls in API expecting `zarr_id` should be changed to provide `vzarr_id` instead + - We just add `/zarr/{zarr_id}/{vzarr_id}/` call which would return `checksum` for that version. (note, could have been `/zarr/{vzarr_id}` since no overlap among ids, so may be `/zarrs/{vzarr_id}` or `/vzarrs/{vzarr_id}`?) * Side discussion: new Zarr version/checksum compute is relatively expensive. It could be "cheap" if we rely on prior manifest + changes (new files with checksums) or DELETEs. But it would require 'fsck' style re-check @@ -258,44 +297,6 @@ following fields: * Remove `.name` attribute from `BaseZarrArchive`. zarr_id is unique identifier for the mutable Zarr. -#### Some outloud thinking - -* `Asset` -- (largely) a CoW entry binding together *content* and metadata. -* ATM *content* can be immutable `AssetBlob` (in `.blob`) or mutable `ZarrArchive` (in `.zarr`). -* `blob_id` is UUID (not checksum) but just a unique identifier for the **immutable** blob which later assigned a computed `checksum`: - * storage on S3 is not "content-addressable" but after `blob_id` - * changes to the blob are not possible, but new blobs can be created - * Upload of a blob involves - * producing `upload_id` (and urls to use for upload) - * `/blobs/{upload_id}/complete/` endpoint to complete which returns `complete_url` - * also there is `/blobs/{upload_id}/validate/` to finally get `blob_id` and `etag` and trigger compute of sha256 checksum to be filled out later - * `blob_id` (thus pointing to immutable content) is provided to create a new `Asset` -* `zarr_id` is UUID for a **mutable** content, with `.checksum` also being computed "async" by `/zarr/{zarr_id}/finalize` - * changes to Zarr could be done, resulting in a `.checksum` being updated - * **there is no notion of `upload_id`** for Zarrs: multiple PUT/DELETE requests could be submitted in parallel (?). - * `/zarr/{zarr_id}/finalize` does not return anything -* Although upload procedures differ significantly between blobs and zarrs, they could be "uniformized" as upon completion, the **new** `_id` which identifies that particular (immutable) **content** is returned. -* We use UUIDs for all the API-accessible `_id`s so there **already** should be no overlaps between `blob_id` and `zarr_id`. - * In the model and API for interactions with Assets, we could use generic **`content_id`** which would be some UUID resolvable to a `blob_id` or `zarr_id`. - * That later would allow to extend into other types of content, possibly requiring different upload or download procedures, such as hypothetical: - * `RemoteAssets` - blobs or Zarrs on other DANDI instances for which we provide interfaces to get "registered" - * ... -* We could have a `AssetContent` model/table with `content_id` and `content_type` (blob, zarr, remote, …) and then `Asset` pointing to `content_id` (instead of separate `blob` and `zarr`) and may be duplicate `content_type` for convenience (or just make DBM do needed joint). - * Yarik does not know on DBM efficient way to orchestrate such linkage into multiple external tables, but there must be some design pattern. -* **content** needs `size` and `etag` (or `checksum`) - -#### Some inconsistencies - -which we can either resolve and/or take advantage off (to avoid breaking interface "in-place") - -- There is `Asset` -- API has all endpoints in plulal `/blobs/`, `/dandisets/`, `/assets/` but a singular `/zarr/`. - - We could add/use `/zarrs/` in parallel to (being deprecated) `/zarr/` e.g. for support of versioned zarrs operations -- We have no `Blob` model -- `blob_id` for a `AssetBlob` (not just `Blob`) -- We have no `Zarr` model -- `zarr_id` for a `ZarrArchive` (not just `Zarr`) - - We could come up with `AssetZarrArchive` for an **immutable** (version specific) `ZarrArchive` - - **note** we need a new dedicated `azarr_id` (for "Asset" zarr_id) or `vzarr_id` (for "Versioned" zarr_id) to distinguish from mutable `zarr_id`. - #### Garbage collection (GC) * GC of Manifests: manifests older than X days (e.g. 30) can be deleted if not referenced by any Zarr asset (draft or published).