Skip to content

Conversation

universalmind303
Copy link
Contributor

Changes Made

Adds a new .to_tempfile() on daft.file.

Since many apis don't work with readable objects, but expect literal file paths, This allows us better integrations with these tools.

such as docling

from docling.document_converter import DocumentConverter

@daft.func
def process_document(doc: daft.File) -> str:
    with doc.to_tempfile() as temp_file:
        converter = DocumentConverter()
        result = converter.convert(temp_file.name)
    return result.document.export_to_text()

df.select(process_document(F.file(df["url"]))).collect()

or whisper

import whisper

@daft.func(return_dtype=dt.list(dt.struct({
    "text": dt.string(),
    "start": dt.float64(),
    "end": dt.float64(),
    "id": dt.int64()
})))
def extract_dialogue_segments(file: daft.File):
    """
    Transcribes audio using whisper.
    """
    with file.to_tempfile() as tmpfile:
        model = whisper.load_model("turbo")

        result = model.transcribe(tmpfile)

        segments = []
        for segment in result["segments"]:
            segment_obj = {
                "text": segment["text"],
                "start": segment["start"],
                "end": segment["end"],
                "id": segment["id"]
            }
            segments.append(segment_obj)

        return segments

Notes for reviewers.

I also had to add some internal buffering for http backed files. Previously it was erroring if you attempted to do a range request and that server didnt support them (416). So instead, we now try to do a range request, if we get the 416 then we instead buffer the entire data.

Related Issues

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

@github-actions github-actions bot added the feat label Sep 17, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR adds a new to_tempfile() method to the Daft File class that creates a temporary file with the contents of the original file. This enables better integration with external libraries (like Docling for document processing and Whisper for audio transcription) that require actual file paths rather than file-like objects.

The implementation includes smart optimization logic - it uses shutil.copyfileobj() for sources that support range requests and falls back to file.read() for sources that don't. The method consumes the original file by closing it after copying, preventing resource leaks.

To support this functionality, the PR adds a new source_type() method to the ObjectSource trait across all storage backend implementations (HTTP, S3, GCS, local files, Unity, Hugging Face). This allows the system to identify different source types and handle them appropriately.

The most significant internal change is in the HTTP file handling logic, which now includes sophisticated range request optimization with fallback caching. When HTTP servers don't support range requests (returning 416 errors), the implementation gracefully falls back to downloading and caching the entire file, then serves subsequent reads from the cache.

Confidence score: 3/5

  • This PR introduces complex HTTP caching logic that could cause issues with concurrent access or memory usage for large files
  • Score reflects concerns about the range request detection method potentially having side effects and the significant behavioral changes in ObjectSourceReader
  • Pay close attention to src/daft-file/src/python.rs for the HTTP caching implementation and range request handling

10 files reviewed, 2 comments

Edit Code Review Bot Settings | Greptile

Copy link

codecov bot commented Sep 17, 2025

Codecov Report

❌ Patch coverage is 82.53968% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.29%. Comparing base (bb9454d) to head (ada415c).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-file/src/python.rs 87.14% 18 Missing ⚠️
src/daft-io/src/azure_blob.rs 0.00% 3 Missing ⚠️
src/daft-io/src/google_cloud.rs 0.00% 3 Missing ⚠️
src/daft-io/src/huggingface.rs 0.00% 3 Missing ⚠️
src/daft-io/src/s3_like.rs 0.00% 3 Missing ⚠️
src/daft-io/src/unity.rs 0.00% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5226      +/-   ##
==========================================
+ Coverage   72.82%   74.29%   +1.46%     
==========================================
  Files         972      973       +1     
  Lines      125913   125875      -38     
==========================================
+ Hits        91701    93513    +1812     
+ Misses      34212    32362    -1850     
Files with missing lines Coverage Δ
daft/file.py 72.97% <100.00%> (+9.63%) ⬆️
src/daft-io/src/http.rs 50.45% <100.00%> (+4.10%) ⬆️
src/daft-io/src/local.rs 82.12% <100.00%> (+0.20%) ⬆️
src/daft-io/src/object_io.rs 70.58% <ø> (ø)
src/daft-io/src/azure_blob.rs 0.00% <0.00%> (-6.34%) ⬇️
src/daft-io/src/google_cloud.rs 36.53% <0.00%> (-0.27%) ⬇️
src/daft-io/src/huggingface.rs 44.65% <0.00%> (-0.33%) ⬇️
src/daft-io/src/s3_like.rs 62.82% <0.00%> (-0.19%) ⬇️
src/daft-io/src/unity.rs 0.00% <0.00%> (ø)
src/daft-file/src/python.rs 78.31% <87.14%> (+5.29%) ⬆️

... and 46 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

universalmind303 and others added 4 commits September 17, 2025 16:14
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
// If we already know range requests aren't supported, read full content
if self.supports_range == Some(false) {
// Read entire file and cache it
let content = self.read_full_content()?;
Copy link
Contributor Author

@universalmind303 universalmind303 Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

admittedly, I'm a bit on the fence about this. I think it could cause some unexpected memory usage. But it does make the api easier to use.

I've been thinking about if this should be configurable, or opt-in/out.

something like

daft.set_execution_config(file_on_unsupported_range_request="download" | "error")
daft.set_execution_config(file_on_unsupported_range_request_max_download_size=1024 * 50) # 50MB max 

Copy link
Member

@kevinzwang kevinzwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just one small thing


rt.block_within_async_context(async move {
source.supports_range(&uri).await.map_err(DaftError::from)
})??
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a little hacky to hardcode it for these two source types. Is there a way to move this logic to the individual ObjectSource implementations themselves?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I was initially thinking that as well, but that also felt kinda hacky as then we need to do downcast_ref() and also add a source_type() -> SourceType method. I actually had that solution coded out in an earlier revision and this one felt slightly less hacky to me.

FWIW, I think technically some "s3like" apis could also return false here, depending on the implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think this could serve as an alternative to #5188. My pr focuses solely on usage within daft.file, but we could expand the usage of this for gracefully handling elsewhere when those pesky 416's pop up.

@universalmind303 universalmind303 merged commit 10086b2 into main Sep 19, 2025
45 checks passed
@universalmind303 universalmind303 deleted the cory/file-tempfile branch September 19, 2025 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants