-
Notifications
You must be signed in to change notification settings - Fork 16
Add Asset garbage collection design doc #2367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@satra @yarikoptic PTAL and let me know if you have any questions or objections to this. I've also included the current counts of garbage-collectible assets and their total file sizes; it looks like we will be able to clean up a significant amount of stale data (almost 50%!). |
12fa1de
to
15465e3
Compare
@satra @yarikoptic do you have any other questions or objections for this design? |
overall this looks good to me. perhaps the one thing i would add in the manual process to start with is to ensure that after it's run perhaps we can do to ensure that every asset/blob of a published dandiset cannot be deleted. this would be equivalent to a terminate instance protection on aws ec2. |
do you mean @satra to similarly to our tagging for embargoed data/keys preventing any non-authtenticated access, we would add another tag, e.g. edit1: this might be really nice for paranoids like me so we could at least have some assurance that bug would not wipe out all data -- only unreleased ;) edit2: I guess above is "out of scope" for this particular design PR since deals with blobs not assets, and for blobs we already have https://github.com/dandi/dandi-archive/blob/master/doc/design/garbage-collection-uploads-asset-blobs-2.md , so we need to work out something on top of that separate if decide it is worth doing. edit3: actually it is very "in scope" since GC of Blobs virtually "not effective" much (besides some abandoned uploads) until we implement this design on GC of assets. So if we decide to be paranoid, implementing tagging/protection of blobs before we implement/deploy GC of assets would be correct sequence of actions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly comments/questions with a single "typo" to fix
It’s important to precisely define what it means for an asset to be eligible for garbage collection (an “orphaned asset”). | ||
An orphaned asset has the following properties: | ||
- The asset is not currently associated with any Versions | ||
- The asset is older than 30 days |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The asset is older than 30 days | |
- The asset was not modified for more than 30 days |
or
- The asset is older than 30 days | |
- The asset is older than 30 days based on `modified` datetime-stamp |
as opposed to created
, since we do have operations which modify assets in place and modified
does differ from created
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think that is also wrong. I believe age or last-modified time has nothing to do with orphaned status: orphaned assets are precisely those that are not part of any Version; the 30 day delay in permanently deleting the S3 object is already covered by the bucket lifecycle policy.
- The asset is older than 30 days |
|
||
## Integration with existing garbage collection routines | ||
|
||
Garbage collection routines for Asset Blobs and Uploads are already implemented and deployed (though not run yet). Currently, the number of Asset Blobs and Uploads that need to be garbage collected is not substantial. However, the act of running the Asset garbage collection routine will result in a significant amount of Asset Blobs being orphaned, which can then in turn be collected by the Asset Blob garbage collection routine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting... so we do not even know on staging if blobs/uploads GC works?
why not to deploy them in stages and not to enable already so we could decouple two changes and have more times for observing that first stage GC in action for longer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty much how we plan to roll this out. First, run the blob GC on sandbox, and make sure it works as expected. Next, after asset GC is implemented and merged (as described in this design doc), we can run asset GC on sandbox, watch that it exposes the expected asset blobs, then run the blob GC again to make sure this works "end to end".
We can add a section to the design doc that describes this approach.
|
||
We will introduce the feature initially by making it manually runnable via `manage.py collect_garbage` . This script [already exists](https://github.com/dandi/dandi-archive/blob/master/dandiapi/api/management/commands/collect_garbage.py) and currently supports asset blob and upload garbage collection, so it’s just a matter of integrating the `garbage_collection.asset` module into it once it’s implemented. Additionally, the script supports a dry run mode where it will log the number of items to be garbage collected, but not actually perform the delete operations. Once we’ve run it in production a few times and we’re ready, we’ll automate it with a cron via a Celery Beat task that calls the relevant service functions. | ||
|
||
As is the case with asset blobs and uploads, an orphaned asset will remain recoverable by DANDI Archive admins for 30 days after it is garbage collected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will there be some interface and test implemented to verify that such "recovery" is actually possible? or otherwise - details of how recovery would be done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GarbageCollectionEventRecord
table contains a JSON-serialized snapshot of the records being deleted, so we're keeping the maximum amount of information possible, which means it should be possible to recreate the deleted record. Currently, we're thinking of this as a manual intervention in case someone wants to "undelete" something, so we don't have immediate plans to create such an interface.
We will add a plan to carry out such a manual test to the design doc.
@yarikoptic - yes, that would be the plan to put a tag "published" and deny removing any object that has such a tag. |
There's an even simpler solution, which would be to use S3's built in "legal holds" to prevent deletion of those blobs - see #2501 |
Co-authored-by: Yaroslav Halchenko <[email protected]>
Co-authored-by: Roni Choudhury <[email protected]>
Outlines a design for Asset garbage collection. Implementing this will also unblock garbage collection from being run in general, so the idea is that we will begin running the
collect_garbage
script for assets/uploads/asset blobs after this feature lands.