-
Notifications
You must be signed in to change notification settings - Fork 316
feat: add File.to_tempfile method and optimize range requests #5226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR adds a new to_tempfile()
method to the Daft File class that creates a temporary file with the contents of the original file. This enables better integration with external libraries (like Docling for document processing and Whisper for audio transcription) that require actual file paths rather than file-like objects.
The implementation includes smart optimization logic - it uses shutil.copyfileobj()
for sources that support range requests and falls back to file.read()
for sources that don't. The method consumes the original file by closing it after copying, preventing resource leaks.
To support this functionality, the PR adds a new source_type()
method to the ObjectSource
trait across all storage backend implementations (HTTP, S3, GCS, local files, Unity, Hugging Face). This allows the system to identify different source types and handle them appropriately.
The most significant internal change is in the HTTP file handling logic, which now includes sophisticated range request optimization with fallback caching. When HTTP servers don't support range requests (returning 416 errors), the implementation gracefully falls back to downloading and caching the entire file, then serves subsequent reads from the cache.
Confidence score: 3/5
- This PR introduces complex HTTP caching logic that could cause issues with concurrent access or memory usage for large files
- Score reflects concerns about the range request detection method potentially having side effects and the significant behavioral changes in ObjectSourceReader
- Pay close attention to
src/daft-file/src/python.rs
for the HTTP caching implementation and range request handling
10 files reviewed, 2 comments
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5226 +/- ##
==========================================
+ Coverage 72.82% 74.29% +1.46%
==========================================
Files 972 973 +1
Lines 125913 125875 -38
==========================================
+ Hits 91701 93513 +1812
+ Misses 34212 32362 -1850
🚀 New features to boost your workflow:
|
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
…/file-tempfile
…Daft into cory/file-tempfile
// If we already know range requests aren't supported, read full content | ||
if self.supports_range == Some(false) { | ||
// Read entire file and cache it | ||
let content = self.read_full_content()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
admittedly, I'm a bit on the fence about this. I think it could cause some unexpected memory usage. But it does make the api easier to use.
I've been thinking about if this should be configurable, or opt-in/out.
something like
daft.set_execution_config(file_on_unsupported_range_request="download" | "error")
daft.set_execution_config(file_on_unsupported_range_request_max_download_size=1024 * 50) # 50MB max
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just one small thing
|
||
rt.block_within_async_context(async move { | ||
source.supports_range(&uri).await.map_err(DaftError::from) | ||
})?? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a little hacky to hardcode it for these two source types. Is there a way to move this logic to the individual ObjectSource
implementations themselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I was initially thinking that as well, but that also felt kinda hacky as then we need to do downcast_ref()
and also add a source_type() -> SourceType
method. I actually had that solution coded out in an earlier revision and this one felt slightly less hacky to me.
FWIW, I think technically some "s3like" apis could also return false here, depending on the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think this could serve as an alternative to #5188. My pr focuses solely on usage within daft.file, but we could expand the usage of this for gracefully handling elsewhere when those pesky 416's pop up.
Changes Made
Adds a new
.to_tempfile()
on daft.file.Since many apis don't work with readable objects, but expect literal file paths, This allows us better integrations with these tools.
such as docling
or whisper
Notes for reviewers.
I also had to add some internal buffering for http backed files. Previously it was erroring if you attempted to do a range request and that server didnt support them (
416
). So instead, we now try to do a range request, if we get the416
then we instead buffer the entire data.Related Issues
Checklist
docs/mkdocs.yml
navigation