Commit c2424bb
Rewrite
# Which issue does this PR close?
- Part of #7983
- Part of #8000
- closes #8677
I am also working on a blog post about this
- #8035
# TODOs
- [x] Rewrite `test_cache_projection_excludes_nested_columns` in terms
of higher level APIs (#8754)
- [x] Benchmarks
- [x] Benchmarks with DataFusion:
apache/datafusion#18385
# Rationale for this change
A new ParquetPushDecoder was implemented here
- #7997
I need to refactor the async and sync readers to use the new push
decoder in order to:
1. avoid the [xkcd standards effect](https://xkcd.com/927/) (aka there
are now three control loops)
3. Prove that the push decoder works (by passing all the tests of the
other two)
4. Set the stage for improving filter pushdown more with a single
control loop
<img width="400" alt="image"
src="https://github.com/user-attachments/assets/e6886ee9-58b3-4a1e-8e88-9d2d03132b19"
/>
# What changes are included in this PR?
1. Refactor the `ParquetRecordBatchStream` to use `ParquetPushDecoder`
# Are these changes tested?
Yes, by the existing CI tests
I also ran several benchmarks, both in arrow-rs and in DataFusion and I
do not see any substantial performance difference (as expected):
- apache/datafusion#18385
# Are there any user-facing changes?
No
---------
Co-authored-by: Vukasin Stefanovic <[email protected]>ParquetRecordBatchStream in terms of the PushDecoder (#8159)1 parent 43c7637 commit c2424bb
File tree
6 files changed
+428
-666
lines changed- parquet
- src
- arrow
- arrow_reader
- async_reader
- push_decoder
- reader_builder
- util
- tests/encryption
6 files changed
+428
-666
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
99 | 99 | | |
100 | 100 | | |
101 | 101 | | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
102 | 109 | | |
103 | 110 | | |
104 | 111 | | |
| |||
0 commit comments