Skip to content

Conversation

rescrv
Copy link
Contributor

@rescrv rescrv commented Aug 28, 2025

Description of changes

Amazon periodically tells us to slow down and that manifests as a
gruesome error trace. Make a specific error type so callers can handle
the error.

Test plan

CI

Migration plan

N/A

Observability plan

N/A

Documentation Changes

N/A

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Copy link
Contributor

propel-code-bot bot commented Aug 28, 2025

Introduce StorageError::Backoff for S3 SlowDown Errors

This pull request introduces a new StorageError::Backoff enum variant to explicitly represent Amazon S3 'SlowDown' (HTTP 429) responses encountered during storage operations. The 'SlowDown' error returned by S3 is now consistently converted into the Backoff error, allowing callers to identify rate limiting scenarios and handle them appropriately, rather than treating these as generic unexpected errors. The PR modifies relevant S3 client methods to implement the new error mapping and updates the error code logic accordingly.

Key Changes

• Added a new StorageError::Backoff variant with error mapping to ErrorCodes::ResourceExhausted.
• Updated s3.rs to detect SlowDown codes in multiple S3 operations (get_object, fetch_range, delete_object, delete_many, oneshot_upload) and return StorageError::Backoff instead of a generic error.
• Made minor logic fixes in error branching related to PreconditionFailed and SlowDown.
• Expanded the error mapping and code comments to clarify the semantics of the new error.
• Updated lib.rs to declare the new StorageError variant and code mapping.

Affected Areas

• rust/storage/src/s3.rs (S3 storage error handling, especially for request throttling/rate limiting)
• rust/storage/src/lib.rs (StorageError enum definition and error code mapping)

This summary was automatically generated by @propel-code-bot

Comment on lines +149 to +151
// Back off and retry---usually indicates an explicit 429/SlowDown.
#[error("Back off and retry---usually indicates an explicit 429/SlowDown.")]
Backoff,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Documentation]

The new StorageError::Backoff variant lacks documentation. Consider adding a doc comment to clarify when this error should be used and how callers should handle it:

Suggested change
// Back off and retry---usually indicates an explicit 429/SlowDown.
#[error("Back off and retry---usually indicates an explicit 429/SlowDown.")]
Backoff,
/// Amazon S3 requests that we back off and retry, typically due to rate limiting (429/SlowDown)
#[error("Back off and retry---usually indicates an explicit 429/SlowDown.")]
Backoff,

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Comment on lines 613 to 614
} else if err.meta().code() == Some("PreconditionFailed") {
StorageError::Backoff
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CriticalError]

There's a logical error in the oneshot_upload method. Line 613 checks for "PreconditionFailed" twice, but the second check (line 613-614) should be checking for "SlowDown" to match the intended behavior:

Suggested change
} else if err.meta().code() == Some("PreconditionFailed") {
StorageError::Backoff
} else if err.meta().code() == Some("SlowDown") {
StorageError::Backoff

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

})
}
_ => {
Err(e) => match e {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Great job adding the Backoff error for S3's SlowDown responses. It looks like this check has been added to get, put, delete, and delete_many.

However, to make the change complete, the same check should be applied to other S3 operations where this error could occur. I've identified a few places where it seems to be missing:

  • copy()
  • prepare_multipart_upload()
  • upload_part()
  • finish_multipart_upload()

Since rename() uses copy(), it would also benefit from this change. Applying the SlowDown check consistently across all S3 client calls will make the error handling more robust and fully achieve the goal of this PR.

Comment on lines +259 to 268
_ => {
if inner.code() == Some("SlowDown") {
Err(StorageError::Backoff)
} else {
tracing::error!("error: {}", inner.to_string());
Err(StorageError::Generic {
source: Arc::new(inner),
})
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The logic to check for SlowDown errors is duplicated across multiple functions (get_stream_and_e_tag, fetch_range, oneshot_upload, etc.). To keep the code DRY and ensure consistency, consider extracting this into a private helper function.

For example, you could add this helper to s3.rs:

fn check_common_s3_errors(err: &dyn ProvideErrorMetadata) -> Option<StorageError> {
    if err.code() == Some("SlowDown") {
        return Some(StorageError::Backoff);
    }
    // Add other common checks here in the future
    None
}

Then you could simplify the error handling blocks like this one:

Suggested change
_ => {
if inner.code() == Some("SlowDown") {
Err(StorageError::Backoff)
} else {
tracing::error!("error: {}", inner.to_string());
Err(StorageError::Generic {
source: Arc::new(inner),
})
}
}
_ => {
if let Some(storage_err) = check_common_s3_errors(&inner) {
Err(storage_err)
} else {
tracing::error!("error: {}", inner.to_string());
Err(StorageError::Generic {
source: Arc::new(inner),
})
}
}

This would make the code cleaner and easier to maintain.

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

@rescrv rescrv requested a review from jasonvigil August 29, 2025 22:21
Comment on lines +357 to +363
if inner.code() == Some("SlowDown") {
Err(StorageError::Backoff)
} else {
Err(StorageError::Generic {
source: Arc::new(inner),
})
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

To improve observability and for consistency with the error handling in get_stream_and_e_tag, consider adding an error log in this generic fallback case. While the outer SdkError is logged, logging the specific inner error here provides more direct insight into unexpected S3 issues from this path.

@blacksmith-sh blacksmith-sh bot deleted a comment from rescrv Sep 4, 2025
@rescrv rescrv merged commit bb62e31 into main Sep 4, 2025
114 of 117 checks passed
@rescrv rescrv deleted the rescrv/slow-down branch September 4, 2025 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants