Skip to content

Commit c2424bb

Browse files
alambvustef
andauthored
Rewrite ParquetRecordBatchStream in terms of the PushDecoder (#8159)
# Which issue does this PR close? - Part of #7983 - Part of #8000 - closes #8677 I am also working on a blog post about this - #8035 # TODOs - [x] Rewrite `test_cache_projection_excludes_nested_columns` in terms of higher level APIs (#8754) - [x] Benchmarks - [x] Benchmarks with DataFusion: apache/datafusion#18385 # Rationale for this change A new ParquetPushDecoder was implemented here - #7997 I need to refactor the async and sync readers to use the new push decoder in order to: 1. avoid the [xkcd standards effect](https://xkcd.com/927/) (aka there are now three control loops) 3. Prove that the push decoder works (by passing all the tests of the other two) 4. Set the stage for improving filter pushdown more with a single control loop <img width="400" alt="image" src="https://github.com/user-attachments/assets/e6886ee9-58b3-4a1e-8e88-9d2d03132b19" /> # What changes are included in this PR? 1. Refactor the `ParquetRecordBatchStream` to use `ParquetPushDecoder` # Are these changes tested? Yes, by the existing CI tests I also ran several benchmarks, both in arrow-rs and in DataFusion and I do not see any substantial performance difference (as expected): - apache/datafusion#18385 # Are there any user-facing changes? No --------- Co-authored-by: Vukasin Stefanovic <[email protected]>
1 parent 43c7637 commit c2424bb

File tree

6 files changed

+428
-666
lines changed

6 files changed

+428
-666
lines changed

parquet/src/arrow/arrow_reader/mod.rs

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,13 @@ pub mod statistics;
9999
/// [`StatisticsConverter`]: statistics::StatisticsConverter
100100
/// [Querying Parquet with Millisecond Latency]: https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
101101
pub struct ArrowReaderBuilder<T> {
102+
/// The "input" to read parquet data from.
103+
///
104+
/// Note in the case of the [`ParquetPushDecoderBuilder`], there
105+
/// is no underlying input, which is indicated by a type parameter of [`NoInput`]
106+
///
107+
/// [`ParquetPushDecoderBuilder`]: crate::arrow::push_decoder::ParquetPushDecoderBuilder
108+
/// [`NoInput`]: crate::arrow::push_decoder::NoInput
102109
pub(crate) input: T,
103110

104111
pub(crate) metadata: Arc<ParquetMetaData>,

0 commit comments

Comments
 (0)