-
Notifications
You must be signed in to change notification settings - Fork 947
Ensure page encoding statistics are written to Parquet file #7643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @JigaoLuo |
Thanks I will try it once merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @etseidl
I think the code looks good but I don't think the test covers the code
Specifically, when I reverted the code change
diff --git a/parquet/src/file/writer.rs b/parquet/src/file/writer.rs
index 5961f10ec5..c05bd2d5c8 100644
--- a/parquet/src/file/writer.rs
+++ b/parquet/src/file/writer.rs
@@ -689,9 +689,6 @@ impl<'a, W: Write + Send> SerializedRowGroupWriter<'a, W> {
if let Some(statistics) = metadata.statistics() {
builder = builder.set_statistics(statistics.clone())
}
- if let Some(page_encoding_stats) = metadata.page_encoding_stats() {
- builder = builder.set_page_encoding_stats(page_encoding_stats.clone())
- }
builder = self.set_column_crypto_metadata(builder, &metadata);
close.metadata = builder.build()?;
The test still passes 🤔
$ cargo test -p parquet --features=arrow -- test_page_encoding_statistics_roundtrip
Compiling parquet v55.1.0 (/Users/andrewlamb/Software/arrow-rs/parquet)
Finished `test` profile [unoptimized + debuginfo] target(s) in 5.74s
Running unittests src/lib.rs (target/debug/deps/parquet-1f71ccbaab8d67d7)
running 1 test
test file::writer::tests::test_page_encoding_statistics_roundtrip ... ok
...
😱 Let me see if I can get something to trigger the bug then. Odd. |
Ugh. I need to figure out how |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @etseidl -- looks great to me and thank you for the fix
@@ -3835,4 +3836,48 @@ mod tests { | |||
assert_eq!(stats.max_value.unwrap(), "Bm".as_bytes()); | |||
assert_eq!(stats.min_value.unwrap(), "Bl".as_bytes()); | |||
} | |||
|
|||
#[test] | |||
fn test_page_encoding_statistics_roundtrip() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now when I run this test without the code change it fails like this (as expected)
assertion failed: chunk_meta.encoding_stats.is_some()
thread 'arrow::arrow_writer::tests::test_page_encoding_statistics_roundtrip' panicked at parquet/src/arrow/arrow_writer/mod.rs:3865:9:
assertion failed: chunk_meta.encoding_stats.is_some()
stack backtrace:
0: __rustc::rust_begin_unwind
at /rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/std/src/panicking.rs:697:5
1: core::panicking::panic_fmt
at /rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/core/src/panicking.rs:75:14
2: core::panicking::panic
at /rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/core/src/panicking.rs:145:5
3: parquet::arrow::arrow_writer::tests::test_page_encoding_statistics_roundtrip
at ./src/arrow/arrow_writer/mod.rs:3865:9
4: parquet::arrow::arrow_writer::tests::test_page_encoding_statistics_roundtrip::{{closure}}
at ./src/arrow/arrow_writer/mod.rs:3841:49
5: core::ops::function::FnOnce::call_once
at /Users/andrewlamb/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
6: core::ops::function::FnOnce::call_once
at /rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
👍
Which issue does this PR close?
encoding_stats
not present in Parquet generated byparquet-rewrite
#7616.Rationale for this change
Page encoding statistics are not copied from the column chunk result to the final output metadata.
What changes are included in this PR?
Make sure stats are copied and add a round trip test to make sure they end up in the Parquet file.
Are there any user-facing changes?
No