Skip to content

Conversation

mvandenburgh
Copy link
Member

Outlines a design for Asset garbage collection. Implementing this will also unblock garbage collection from being run in general, so the idea is that we will begin running the collect_garbage script for assets/uploads/asset blobs after this feature lands.

@mvandenburgh
Copy link
Member Author

@satra @yarikoptic PTAL and let me know if you have any questions or objections to this. I've also included the current counts of garbage-collectible assets and their total file sizes; it looks like we will be able to clean up a significant amount of stale data (almost 50%!).

@yarikoptic yarikoptic added the design-doc Involves creating or discussing a design document label Jun 5, 2025
@mvandenburgh mvandenburgh added the documentation Changes only affect the documentation label Jun 19, 2025
@dandi dandi deleted a comment from netlify bot Jul 29, 2025
@dandi dandi deleted a comment from netlify bot Jul 29, 2025
@mvandenburgh mvandenburgh requested a review from satra July 29, 2025 17:33
@mvandenburgh
Copy link
Member Author

@satra @yarikoptic do you have any other questions or objections for this design?

@satra
Copy link
Member

satra commented Aug 5, 2025

overall this looks good to me. perhaps the one thing i would add in the manual process to start with is to ensure that after it's run perhaps we can do to ensure that every asset/blob of a published dandiset cannot be deleted. this would be equivalent to a terminate instance protection on aws ec2.

@yarikoptic
Copy link
Member

yarikoptic commented Aug 5, 2025

we can do to ensure that every asset/blob of a published dandiset cannot be deleted.

do you mean @satra to similarly to our tagging for embargoed data/keys preventing any non-authtenticated access, we would add another tag, e.g. read-only and adjust S3 rules to Deny all Delete operations on keys with such a tag in https://github.com/dandi/dandi-infrastructure/blob/6287b22cb7d13ff12c6772b41ac41991eb9c9f7a/terraform/modules/dandiset_bucket/main.tf#L175 ? FWIW if so -- sounds good to me. Besides that it might be yet another check to add to eventual fsck style set of commands to verify congruence between metadata in DB and state of the S3 (e.g. now it should raise errors for all blobs associated with released assets, and potentially fix that if we add fix mode).

edit1: this might be really nice for paranoids like me so we could at least have some assurance that bug would not wipe out all data -- only unreleased ;)

edit2: I guess above is "out of scope" for this particular design PR since deals with blobs not assets, and for blobs we already have https://github.com/dandi/dandi-archive/blob/master/doc/design/garbage-collection-uploads-asset-blobs-2.md , so we need to work out something on top of that separate if decide it is worth doing.

edit3: actually it is very "in scope" since GC of Blobs virtually "not effective" much (besides some abandoned uploads) until we implement this design on GC of assets. So if we decide to be paranoid, implementing tagging/protection of blobs before we implement/deploy GC of assets would be correct sequence of actions.

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly comments/questions with a single "typo" to fix

It’s important to precisely define what it means for an asset to be eligible for garbage collection (an “orphaned asset”).
An orphaned asset has the following properties:
- The asset is not currently associated with any Versions
- The asset is older than 30 days
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The asset is older than 30 days
- The asset was not modified for more than 30 days

or

Suggested change
- The asset is older than 30 days
- The asset is older than 30 days based on `modified` datetime-stamp

as opposed to created, since we do have operations which modify assets in place and modified does differ from created

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think that is also wrong. I believe age or last-modified time has nothing to do with orphaned status: orphaned assets are precisely those that are not part of any Version; the 30 day delay in permanently deleting the S3 object is already covered by the bucket lifecycle policy.

Suggested change
- The asset is older than 30 days


## Integration with existing garbage collection routines

Garbage collection routines for Asset Blobs and Uploads are already implemented and deployed (though not run yet). Currently, the number of Asset Blobs and Uploads that need to be garbage collected is not substantial. However, the act of running the Asset garbage collection routine will result in a significant amount of Asset Blobs being orphaned, which can then in turn be collected by the Asset Blob garbage collection routine.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting... so we do not even know on staging if blobs/uploads GC works?

why not to deploy them in stages and not to enable already so we could decouple two changes and have more times for observing that first stage GC in action for longer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty much how we plan to roll this out. First, run the blob GC on sandbox, and make sure it works as expected. Next, after asset GC is implemented and merged (as described in this design doc), we can run asset GC on sandbox, watch that it exposes the expected asset blobs, then run the blob GC again to make sure this works "end to end".

We can add a section to the design doc that describes this approach.


We will introduce the feature initially by making it manually runnable via `manage.py collect_garbage` . This script [already exists](https://github.com/dandi/dandi-archive/blob/master/dandiapi/api/management/commands/collect_garbage.py) and currently supports asset blob and upload garbage collection, so it’s just a matter of integrating the `garbage_collection.asset` module into it once it’s implemented. Additionally, the script supports a dry run mode where it will log the number of items to be garbage collected, but not actually perform the delete operations. Once we’ve run it in production a few times and we’re ready, we’ll automate it with a cron via a Celery Beat task that calls the relevant service functions.

As is the case with asset blobs and uploads, an orphaned asset will remain recoverable by DANDI Archive admins for 30 days after it is garbage collected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will there be some interface and test implemented to verify that such "recovery" is actually possible? or otherwise - details of how recovery would be done?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GarbageCollectionEventRecord table contains a JSON-serialized snapshot of the records being deleted, so we're keeping the maximum amount of information possible, which means it should be possible to recreate the deleted record. Currently, we're thinking of this as a manual intervention in case someone wants to "undelete" something, so we don't have immediate plans to create such an interface.

We will add a plan to carry out such a manual test to the design doc.

@satra
Copy link
Member

satra commented Aug 5, 2025

FWIW if so -- sounds good to me.

@yarikoptic - yes, that would be the plan to put a tag "published" and deny removing any object that has such a tag.

@mvandenburgh
Copy link
Member Author

FWIW if so -- sounds good to me.

@yarikoptic - yes, that would be the plan to put a tag "published" and deny removing any object that has such a tag.

There's an even simpler solution, which would be to use S3's built in "legal holds" to prevent deletion of those blobs - see #2501

Co-authored-by: Yaroslav Halchenko <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design-doc Involves creating or discussing a design document documentation Changes only affect the documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants