Skip to content

Ensure page encoding statistics are written to Parquet file #7643

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 12, 2025

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Jun 11, 2025

Which issue does this PR close?

Rationale for this change

Page encoding statistics are not copied from the column chunk result to the final output metadata.

What changes are included in this PR?

Make sure stats are copied and add a round trip test to make sure they end up in the Parquet file.

Are there any user-facing changes?

No

@etseidl etseidl added parquet Changes to the parquet crate bug labels Jun 11, 2025
@etseidl
Copy link
Contributor Author

etseidl commented Jun 11, 2025

cc @JigaoLuo

@JigaoLuo
Copy link
Contributor

Thanks I will try it once merged

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl

I think the code looks good but I don't think the test covers the code

Specifically, when I reverted the code change

diff --git a/parquet/src/file/writer.rs b/parquet/src/file/writer.rs
index 5961f10ec5..c05bd2d5c8 100644
--- a/parquet/src/file/writer.rs
+++ b/parquet/src/file/writer.rs
@@ -689,9 +689,6 @@ impl<'a, W: Write + Send> SerializedRowGroupWriter<'a, W> {
         if let Some(statistics) = metadata.statistics() {
             builder = builder.set_statistics(statistics.clone())
         }
-        if let Some(page_encoding_stats) = metadata.page_encoding_stats() {
-            builder = builder.set_page_encoding_stats(page_encoding_stats.clone())
-        }
         builder = self.set_column_crypto_metadata(builder, &metadata);
         close.metadata = builder.build()?;

The test still passes 🤔

$ cargo test -p parquet --features=arrow -- test_page_encoding_statistics_roundtrip
   Compiling parquet v55.1.0 (/Users/andrewlamb/Software/arrow-rs/parquet)
    Finished `test` profile [unoptimized + debuginfo] target(s) in 5.74s
     Running unittests src/lib.rs (target/debug/deps/parquet-1f71ccbaab8d67d7)

running 1 test
test file::writer::tests::test_page_encoding_statistics_roundtrip ... ok
...

@etseidl
Copy link
Contributor Author

etseidl commented Jun 11, 2025

The test still passes 🤔

😱 Let me see if I can get something to trigger the bug then. Odd.

@etseidl
Copy link
Contributor Author

etseidl commented Jun 11, 2025

Ugh. I need to figure out how SerializedFileWriter handles metadata. I've changed the test to use the arrow writer, which is what parquet-rewrite does anyway, and confirmed the test now fails if the encoding stats aren't copied. Please try again at your earliest convenience @alamb.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl -- looks great to me and thank you for the fix

@@ -3835,4 +3836,48 @@ mod tests {
assert_eq!(stats.max_value.unwrap(), "Bm".as_bytes());
assert_eq!(stats.min_value.unwrap(), "Bl".as_bytes());
}

#[test]
fn test_page_encoding_statistics_roundtrip() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now when I run this test without the code change it fails like this (as expected)


assertion failed: chunk_meta.encoding_stats.is_some()
thread 'arrow::arrow_writer::tests::test_page_encoding_statistics_roundtrip' panicked at parquet/src/arrow/arrow_writer/mod.rs:3865:9:
assertion failed: chunk_meta.encoding_stats.is_some()
stack backtrace:
   0: __rustc::rust_begin_unwind
             at /rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/std/src/panicking.rs:697:5
   1: core::panicking::panic_fmt
             at /rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/core/src/panicking.rs:75:14
   2: core::panicking::panic
             at /rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/core/src/panicking.rs:145:5
   3: parquet::arrow::arrow_writer::tests::test_page_encoding_statistics_roundtrip
             at ./src/arrow/arrow_writer/mod.rs:3865:9
   4: parquet::arrow::arrow_writer::tests::test_page_encoding_statistics_roundtrip::{{closure}}
             at ./src/arrow/arrow_writer/mod.rs:3841:49
   5: core::ops::function::FnOnce::call_once
             at /Users/andrewlamb/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
   6: core::ops::function::FnOnce::call_once
             at /rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

👍

@etseidl
Copy link
Contributor Author

etseidl commented Jun 12, 2025

Thanks for the reviews @mbrobbel and @alamb!

@etseidl etseidl merged commit 8d6cd76 into apache:main Jun 12, 2025
16 checks passed
@etseidl etseidl deleted the encoding_stats branch June 12, 2025 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

encoding_stats not present in Parquet generated by parquet-rewrite
4 participants