forked from apache/arrow-rs
-
Notifications
You must be signed in to change notification settings - Fork 0
chore: rebase on upstream #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
maksym-iv-ef
wants to merge
503
commits into
main
Choose a base branch
from
chore-rebase-upstream
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* arrow-select: Implement concat for `RunArray`s It's unclear whether the added work and complexity of merging the last run between consecutive arrays with the first one of the next array being concatenanted is worth the potential benefit. For now this implementation focuses on the ability to concatenate `RunArray`s in the first place. * Address comments
* Speedup arith * Cleanup * Fmt * Remove more unsafe * Remove more unsafe
* Add arrow_reader_clickbench * update comments, fix Q1 bug * Polish comments * fix typo
* Improve documentation for Parquet `WriterProperties` * typo * Apply suggestions from code review Co-authored-by: Ed Seidl <[email protected]> --------- Co-authored-by: Ed Seidl <[email protected]>
* Improve Field docs, add missing `Field::{set_name, set_data_type, set_nullable}`
* Apply suggestions from code review
Co-authored-by: Ed Seidl <[email protected]>
---------
Co-authored-by: Ed Seidl <[email protected]>
* Fix Clippy in CI for Rust 1.87 release * fmt * Allow * Add another allow
…ows to decode (apache#7502) * Introduce `ReadPlan` to encapsulate the calculation of what rows to decode * Update parquet/src/arrow/arrow_reader/read_plan.rs Co-authored-by: Ed Seidl <[email protected]> --------- Co-authored-by: Ed Seidl <[email protected]>
…` property is enabled (apache#7509) * RecordBatchDecoder: respect skip_validation property * Update arrow-ipc/src/reader.rs Co-authored-by: Andrew Lamb <[email protected]> * Update arrow-ipc/src/reader.rs Co-authored-by: Andrew Lamb <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]>
…r of columns (~ -15%) (apache#7530) * Unroll sort up to 5 columns * Unroll sort up to 5 columns * Unroll sort up to 5 columns * Unroll sort up to 5 columns * Unroll sort up to 5 columns * Unroll sort up to 5 columns * Unroll sort up to 5 columns * Update arrow-ord/src/sort.rs Co-authored-by: Andrew Lamb <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]>
…7511) * Add support for records with int32 encoding of TIME_MILLIS * Add test --------- Co-authored-by: Andrew Lamb <[email protected]>
…pache#7522) * Add single row filter test * DRI * Update parquet/src/arrow/async_reader/mod.rs * Improve comments for other tests
* Factor ArrayDataBuilder into a new struct * fit
* Support Utf8View for Avro * code review improvements
apache#7517) Struct array concatenation previously fell back to concat_fallback, which uses MutableArrayData. This is an issue because any nested types within the struct will not benefit from optimized concatenation. Dictionaries within structs, for example, would not be merged and could result in excessive memory usage. This commit adds no tests because there are already existing tests for struct concatenation.
* unit test: filter empty struct * When filtering struct Array's always build with a length since one exists from the predicate. This prevents panics in situations where a stucts array has empty columns.
* Added support for reading Avro Maps types * Fixed lint errors, improved readability of `read_blockwise_items`, added `Map` comments and improved `Map` nullability handling in `data_type` in codec.rs
apache#7534) * fix: Panic in pretty_format function when displaying DurationSecondsArray with i64::MIN / i64::MAX * Address comments * Avoid changes for infallible functions * Fix compilation * fix docs --------- Co-authored-by: Andrew Lamb <[email protected]>
…1st stream element is an Err (apache#7492) * Fix bug and add tests to verify * Remove unnecessary let assignment. * Add error callback for `do_put` * cargo fmt --------- Co-authored-by: Andrew Lamb <[email protected]>
…che#7427) * Add support for creating random Decimal128 and Decimal256 arrays * chore * chore: clippy
…pache#7481) * Improve error messages if schema hint mismatches with parquet scheam * clippy * Add vestigal metadata check
* chore: qualified results in parquet derive as std::result * testing: added test for hygenic macro in parquet derive * suggestion: harden test on parquet derive
…pache#7922) # Which issue does this PR close? - Part of apache#7896 # Rationale for this change In apache#7896, we saw that inserting a large amount of field names takes a long time -- in this case ~45s to insert 2**24 field names. The bulk of this time is spent just allocating the strings, but we also see quite a bit of time spent reallocating the `IndexSet` that we're inserting into. `with_field_names` is an optimization to declare the field names upfront which avoids having to reallocate and rehash the entire `IndexSet` during field name insertion. Using this method requires at least 2 string allocations for each field name -- 1 to declare field names upfront and 1 to insert the actual field name during object building. This PR adds a new method `with_field_name_capacity` which allows you to reserve space to the metadata builder, without needing to allocate the field names themselves upfront. In this case, we see a modest performance improvement when inserting the field names during object building Before: <img width="1512" height="829" alt="Screenshot 2025-07-13 at 12 08 43 PM" src="https://github.com/user-attachments/assets/6ef0d9fe-1e08-4d3a-8f6b-703de550865c" /> After: <img width="1512" height="805" alt="Screenshot 2025-07-13 at 12 08 55 PM" src="https://github.com/user-attachments/assets/2faca4cb-0a51-441b-ab6c-5baa1dae84b3" />
…che#7914) # Which issue does this PR close? - Fixes apache#7907 # Rationale for this change When trying to append `VariantObject` or `VariantList`s directly on the `VariantBuilder`, it will panic. # Changes to the public API `VariantBuilder` now has these additional methods: - `append_object`, will panic if shallow validation fails or the object has duplicate field names - `try_append_object`, will perform full validation on the object before appending - `append_list`, will panic if shallow validation fails - `try_append_list`, will perform full validation on the list before appending --------- Co-authored-by: Andrew Lamb <[email protected]>
# Which issue does this PR close? - Closes apache#7893 # What changes are included in this PR? In parquet-variant: - Add a new function `Variant::get_path`: this traverses the path to create a new Variant (does not cast any of it). - Add a new module `parquet_variant::path`: adds structs/enums to define a path to access a variant value deeply. In parquet-variant-compute: - Add a new compute kernel `variant_get`: does the path traversal over a `VariantArray`. In the future, this would also cast the values to a specified type. - Includes some basic unit tests. Not comprehensive. - Includes a simple micro-benchmark for reference. Current limitations: - It can only return another VariantArray. Casts are not implemented yet. - Only top-level object/list access is supported. It panics on finding a nested object/list. Needs apache#7914 to fix this. - Perf is a TODO. # Are these changes tested? Some basic unit tests are added. # Are there any user-facing changes? Yes --------- Co-authored-by: Andrew Lamb <[email protected]>
…he#7774) # Which issue does this PR close? - Part of apache#7762 # Rationale for this change As part of apache#7762 I want to optimize applying filters by adding a new code path. To ensure that works well, let's ensure the filtered code path is well covered with tests # What changes are included in this PR? 1. Add tests for filtering batches with 0.01%, 1%, 10% and 90% and varying data types # Are these changes tested? Only tests, no functional changes # Are there any user-facing changes?
…ath (apache#7942) # Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Follow on to apache#7919 # Rationale for this change While reviewing apache#7919 from @Samyak2 I found I wanted to write some additional tests directly for `Variant::get_path` When I started doing that I found it was somewhat awkward to write examples, so I added some new conversion routines to make it easier. # What changes are included in this PR? 1. Add doc examples (and thus tests) of `VaraintGet` and `VariantPath` 2. Add more documentation # Are these changes tested? Yes, by doc examples which run in CI # Are there any user-facing changes? If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out.
# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Closes apache#7901 . # Rationale for this change Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. # What changes are included in this PR? There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. # Are these changes tested? We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? # Are there any user-facing changes? If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --------- Signed-off-by: codephage2020 <[email protected]>
…pache#7931) # Which issue does this PR close? - Closes #NNN. This is minor but I can create an issue if needed. # Rationale for this change `make_builder` currently errors with `Data type Utf8View is not currently supported`. # What changes are included in this PR? Support `DataType::Utf8View` and `DataType::BinaryView` in `make_builder`. # Are these changes tested? Only via the exhaustive enum match. It doesn't look like there are any tests for `make_builder` in that file? # Are there any user-facing changes? No
# Rationale for this change - Closes apache#7948 This PR introduces a custom implementation of `PartialEq` for variant objects. According to the spec, field values are not required to be in the same order as the field IDs, to enable flexibility when constructing Variant values. Instead of comparing the raw bytes of 2 variant objects, this implementation recursively checks whether the field values are equal -- regardless of their order
# Which issue does this PR close?
Optimize `partition_validity` function used in sort kernels
- Preallocate vectors based on known null counts
- Avoid dynamic dispatch by calling `NullBuffer::is_valid` instead of
`Array::is_valid`
- Avoid capacity checks inside loop by writing to `spare_capacity_mut`
instead of using `push`
- Closes apache#7936.
# Rationale for this change
Microbenchmark results for `sort_kernels` compared to `main`, only
looking at benchmarks matching "nulls to indices":
```
sort i32 nulls to indices 2^10
time: [4.9325 µs 4.9370 µs 4.9422 µs]
change: [−20.303% −20.133% −19.974%] (p = 0.00 < 0.05)
Performance has improved.
sort i32 nulls to indices 2^12
time: [20.096 µs 20.209 µs 20.327 µs]
change: [−26.819% −26.275% −25.697%] (p = 0.00 < 0.05)
Performance has improved.
sort f32 nulls to indices 2^12
time: [26.329 µs 26.366 µs 26.406 µs]
change: [−29.487% −29.331% −29.146%] (p = 0.00 < 0.05)
Performance has improved.
sort string[0-10] nulls to indices 2^12
time: [70.667 µs 70.762 µs 70.886 µs]
change: [−20.057% −19.935% −19.819%] (p = 0.00 < 0.05)
Performance has improved.
sort string[0-100] nulls to indices 2^12
time: [101.98 µs 102.44 µs 102.99 µs]
change: [−0.3501% +0.0835% +0.4918%] (p = 0.71 > 0.05)
No change in performance detected.
sort string[0-400] nulls to indices 2^12
time: [84.952 µs 85.024 µs 85.102 µs]
change: [−5.3969% −4.9827% −4.6421%] (p = 0.00 < 0.05)
Performance has improved.
sort string[10] nulls to indices 2^12
time: [72.486 µs 72.664 µs 72.893 µs]
change: [−14.937% −14.781% −14.599%] (p = 0.00 < 0.05)
Performance has improved.
sort string[100] nulls to indices 2^12
time: [71.354 µs 71.606 µs 71.902 µs]
change: [−17.207% −16.795% −16.373%] (p = 0.00 < 0.05)
Performance has improved.
sort string[1000] nulls to indices 2^12
time: [73.088 µs 73.195 µs 73.311 µs]
change: [−16.705% −16.599% −16.483%] (p = 0.00 < 0.05)
Performance has improved.
sort string_view[10] nulls to indices 2^12
time: [32.592 µs 32.654 µs 32.731 µs]
change: [−15.722% −15.512% −15.310%] (p = 0.00 < 0.05)
Performance has improved.
sort string_view[0-400] nulls to indices 2^12
time: [32.981 µs 33.074 µs 33.189 µs]
change: [−25.570% −25.132% −24.700%] (p = 0.00 < 0.05)
Performance has improved.
sort string_view_inlined[0-12] nulls to indices 2^12
time: [28.467 µs 28.496 µs 28.529 µs]
change: [−22.978% −22.786% −22.574%] (p = 0.00 < 0.05)
Performance has improved.
sort string[10] dict nulls to indices 2^12
time: [94.463 µs 94.503 µs 94.542 µs]
change: [−11.386% −11.165% −10.961%] (p = 0.00 < 0.05)
Performance has improved.
```
# Are these changes tested?
Covered by existing tests
# Are there any user-facing changes?
No, the method is internal to the sort kernels.
# Which issue does this PR close? - Closes apache#7949 # Rationale for this change I would like it to be easier / more ergonomic to make objects # What changes are included in this PR? 1. Add `ObjectBuilder::with_field` 2. Add documentation w/ examples 3. Rewrite some tests # Are these changes tested? Yes, by doc tests # Are there any user-facing changes? Yes a new API
…ntArray (apache#7945) # Which issue does this PR close? - Closes apache#7920. # Are these changes tested? Tests were already implemented # Are there any user-facing changes? None
…che#7956) # Which issue does this PR close? - Follow-up to apache#7901 # Rationale for this change - apache#7934 Introduced a minor regression, in (accidentally?) forbidding the empty string as a dictionary key. Fix the bug and simplify the code a bit further while we're at it. # What changes are included in this PR? Revert the unsorted dictionary check back to what it had been (it just uses `Iterator::is_sorted_by` now, instead of `primitive.slice::is_sorted_by`). Remove the redundant offset monotonicity check from the ordered dictionary path, relying on the fact that string slice extraction will anyway fail if the offsets are not monotonic. Improve the error message now that it does double duty. # Are these changes tested? New unit tests for dictionaries containing the empty string. As a side effect, we now have at least a little coverage for sorted dictionaries -- somehow, I couldn't find any existing unit test that creates a sorted dictionary?? # Are there any user-facing changes? No --------- Co-authored-by: Andrew Lamb <[email protected]>
# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Part of apache#7911 - Part of apache#6736 - Follow on to apache#7905 # Rationale for this change I wrote benchmark some changes to the json decoder in apache#7911 but they are non trivial. To keep apache#7911 easier to review I have pulled the benchmarks out to their own PR # What changes are included in this PR? 1. Add new json benchmark 2. Include the `variant_get` benchmark added in apache#7919 by @Samyak2 # Are these changes tested? I tested them manually and clippy CI coverage ensures they compile # Are there any user-facing changes? No these are only benchmarks
# Which issue does this PR close? - Closes apache#7951 . # Rationale for this change # What changes are included in this PR? # Are these changes tested? Yes # Are there any user-facing changes? New API Signed-off-by: codephage2020 <[email protected]>
# Which issue does this PR close? - Follow on to apache#7943 - Part of apache#7948 # Rationale for this change I found a few more tests I would like to have seen while reviewing apache#7943 # What changes are included in this PR? Add some list equality tests # Are these changes tested? It is only tests, no functionality changes # Are there any user-facing changes? No
# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Closes apache#7947 . # Rationale for this change Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. # What changes are included in this PR? There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. # Are these changes tested? We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? # Are there any user-facing changes? If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. Signed-off-by: codephage2020 <[email protected]>
# Which issue does this PR close? - Related to apache#6736 # Rationale for this change I noticed in apache#7956 that some Clippy errors were introduced but not caught by CI. # What changes are included in this PR? Add `parquet-variant-compute` to the CI for parqet-variant related PRs # Are these changes tested? It is only tests # Are there any user-facing changes? No
# Which issue does this PR close? Does not yet close, but contributes towards: - apache#6356 - apache#5981 - apache#1206 # Rationale for this change See the above issues. And this is a follow up to * apache#6711 * apache#6873 This was also split out from: apache#7929 # What changes are included in this PR? This removes the API to allow preserving `dict_id` set in the `Schema`'s `Field` within arrow-ipc and arrow-flight. This is in an effort to remove the `dict_id` field entirely and make it an IPC/flight-only concern. # Are these changes tested? Yes, all existing tests continue to pass. # Are there any user-facing changes? Yes, these previously (in 54.0.0) deprecated functions/fields are removed: * `arrow_ipc::DictionaryTracker.set_dict_id` * `arrow_ipc::DictionaryTracker::new_with_preserve_dict_id` * `arrow_ipc::IpcWriteOptions.with_preserve_dict_id` * `arrow_ipc::IpcWriteOptions.preserve_dict_id` (function and field) * `arrow_ipc::schema_to_fb` * `arrow_ipc::schema_to_bytes`
# Which issue does this PR close? - Part of apache#4886 - Related to apache#6965 # Rationale for this change This change introduces support for Avro files generated by systems like Impala, which have a specific convention for representing nullable fields. In Avro, nullability is typically represented by a union of a type and a type. This PR updates the Avro reader to correctly interpret these schemas, ensuring proper handling of nullable data and improving interoperability with Impala-generated data. `null` # What changes are included in this PR? This pull request introduces several changes to support Impala-style nullability in the Avro reader: - The Avro schema parser has been updated to recognize unions where is the second type (e.g., `['type', 'null']`) as a nullable field. `null` - Logic has been added to handle this nullability convention during Avro decoding. - New tests are included to verify that Avro files using this nullability format are read correctly while ensuring that strict mode properly identifies them. # Are these changes tested? Yes, I added new test cases covering these changes to the tests named: `test_nonnullable_impala`, `test_nonnullable_impala_strict`, `test_nullable_impala` and `test_nullable_impala_strict`. # Are there any user-facing changes? N/A --------- Co-authored-by: Connor Sanders <[email protected]>
# Which issue does this PR close? Part of apache#4886 Completes the breaking down/porting of the changes in apache#6965. This PR will be closed upon merge of this PR. # Rationale for this change This change brings over the remaining integration tests present in the original PR, which validate the reader logic against the files from `testing/data/avro`. PRs containing this logic have already been merged (but are not yet released) which these tests now validate. # What changes are included in this PR? The following files are now read in: - alltypes_dictionary.avro - alltypes_nulls_plain.avro - binary.avro - dict-page-offset-zero.avro - avro/list_columns.avro - nested_lists.snappy.avro - single_nan.avro - datapage_v2.snappy.avro - nested_records.avro - repeated_no_annotation.avro # Are these changes tested? This PR consists of integration tests validating code merged recently into this crate. No changes in functionality are included. # Are there any user-facing changes? N/A
…pache#7968) # Which issue does this PR close? Closes apache#6356 # Rationale for this change Now that apache#7940 is merged, nothing useful can be done with the `dict_id` field, therefore, it is now safe to be removed from this requirement. This was also split out from: apache#7467 # What changes are included in this PR? No longer require the `dict_id` fields of two `Field`s of schemas being merged to be equal, as at this point the `dict_id` is only an IPC concern, and the fact that it is still in the struct definition is just legacy, marked for removal, we're just going through the proper procedure of deprecating and replacing the APIs that use it. # Are these changes tested? Tests passing. # Are there any user-facing changes? No API changes, just a behavior change, that was to be expected and desired due to the deprecations around the `dict_id` field. @alamb @adriangb @tustvold
…e#7966) # Which issue does this PR close? - Part of apache#4886 - Follow up to apache#7834 # Rationale for this change The initial Avro reader implementation contained an under-developed and temporary safeguard to prevent infinite loops when processing records that consumed zero bytes from the input buffer. When the `Decoder` reported that zero bytes were consumed, the `Reader` would advance it's cursor to the end of the current data block. While this successfully prevented an infinite loop, it had the critical side effect of silently discarding any remaining data in that block, leading to potential data loss. This change enhances the decoding logic to handle these zero-byte values correctly, ensuring that the `Reader` makes proper progress without dropping data and without risking an infinite loop. # What changes are included in this PR? - **Refined Decoder Logic**: The `Decoder` has been updated to accurately track and report the number of bytes consumed for all values, including valid zero-length records like `null` or empty `bytes`. This ensures the decoder always makes forward progress. - **Removal of Data-Skipping Safeguard**: The logic in the `Reader` that previously advanced to the end of a block on a zero-byte read has been removed. The reader now relies on the decoder to report accurate consumption and advances its cursor incrementally and safely. - * New integration test using a temporary `zero_byte.avro` file created via this python script: https://gist.github.com/jecsand838/e57647d0d12853f3cf07c350a6a40395 # Are these changes tested? Yes, a new `test_read_zero_byte_avro_file` test was added that reads the new `zero_byte.avro` file and confirms the update. # Are there any user-facing changes? N/A # Follow-Up PRs 1. PR to update `test_read_zero_byte_avro_file` once apache/arrow-testing#109 is merged in.
# Which issue does this PR close? - Closes apache#7899 . This pr wants to avoid the extra allocation for the object builder and the later buffer copy. # Rationale for this change Avoid extra allocation in the object builder like the issue descripted. # What changes are included in this PR? This removes the internal `buffer` in `ObjectBuilder`. All data insertion is done directly to the parent buffer wrapped in `parent_state`. The corresponding new fields are added to `ObjectBuilder`. - add `object_start_offset` in `ObjectBuilder`, which describes the start offset in the parent buffer for the current object - Add `has_been_finished` in `ObjectBuilder`, which describes whether the current object has been finished; it will be used in the `Drop` function. This patch modifies the logic of `new`, `finish`, `parent_state`, and `drop` function according to the change. In particular, it writes data into the parent buffer directly when adding a field to the object (i.e., `insert`/`try_insert` is called). When finalizing (`finish` is called) the object, as header and field ids are must be put in front of data in the buffer, the builder will shift written data bytes for the necessary space for header and field ids. Then it writes header and field ids. In `drop`, if the builder is not finalized before being dropped, it will truncate the written bytes to roll back the parent buffer status. # Are these changes tested? The logic has been covered by the exist logic. # Are there any user-facing changes? No --------- Co-authored-by: Andrew Lamb <[email protected]>
# Which issue does this PR close? - Closes apache#7686 # Rationale for this change int96 min/max statistics emitted by arrow-rs are incorrect. # What changes are included in this PR? 1. Fix the int96 stats 2. Add round-trip test to verify the behavior # Not included in this PR: 1. Read stats only from known good writers. This will be implemented after a new arrow-rs release. # Are there any user-facing changes? The int96 min/max statistics will be different and correct. --------- Co-authored-by: Rahul Sharma <[email protected]> Co-authored-by: Ed Seidl <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Alkis Evlogimenos <[email protected]>
# Rationale for this change If a variant has an unsorted dictionary, you can't assume fields are unique nor ordered by name. This PR updates the logical equality check among `VariantMetadata` to properly handle this case. - Closes apache#7952 It also fixes a bug in apache#7934 where we do a uniqueness check when probing an unsorted dictionary --------- Co-authored-by: Andrew Lamb <[email protected]>
…and writers (apache#7841) # Which issue does this PR close? - Finishes remaining work and closes apache#6661. # What changes are included in this PR? This change adds `decimal32` and `decimal64` support to Parquet, JSON and CSV readers and writers. It does not change the current default behavior of the Parquet reader which (in the absence of a specification that says otherwise) will still translate the INT32 physical type with a logical DECIMAL type into a `decimal128` instead of a `decimal32`. # Are these changes tested? Yes. # Are there any user-facing changes? The `decimal32` and `decimal64` types are now supported in Parquet, JSON and CSV readers and writers. --------- Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Matthijs Brobbel <[email protected]>
…#7911) # Which issue does this PR close? - part of apache#6736 - Closes apache#7964 - Follow on to apache#7905 # Rationale for this change In a quest to have the fastest and most efficient Variant implementation I would like to avoid copies if at all possible Right now, to make a VariantArray first requires completing an individual buffer and appending it to the array. Let's make that faster by having the VariantBuilder append directly into the buffer # What changes are included in this PR? 1. Add `VariantBuilder::new_from_existing` 2. Add a `VariantArrayBuilder::variant_builder` that reuses the buffers # Are these changes tested? 1. New unit tests 1. Yes by existing tests # Are there any user-facing changes? Hopefully faster performance --------- Co-authored-by: Congxian Qiu <[email protected]> Co-authored-by: Liang-Chi Hsieh <[email protected]>
…overflows (apache#7887) # Which issue does this PR close? Closes apache#7886. # Rationale for this change Casting large `Decimal256` values to `Float64` can exceed the representable range of floating point numbers. Previously, this could result in a panic due to unwrapping a failed conversion. This PR introduces a safe conversion that saturates overflowing values to `INFINITY` or `-INFINITY`, following standard floating point semantics. This ensures stable, predictable behavior without runtime crashes. # What changes are included in this PR? - Introduced a helper function `decimal256_to_f64` that converts `i256` to `f64`, returning `INFINITY` or `-INFINITY` when the value is out of range. - Updated the casting logic for `Decimal256` → `Float64` to use the new safe conversion. - Improved inline and module-level documentation to reflect that this conversion is lossy and saturating. - Added a unit test `test_cast_decimal256_to_f64_overflow` to validate overflow behavior. # Are there any user-facing changes? Yes. - **Behavior Change:** When casting `Decimal256` values that exceed the `f64` range, users now receive `INFINITY` or `-INFINITY` instead of a panic. - **Improved Docs:** Updated documentation clarifies the lossy and saturating behavior of decimal-to-float casting. - **Not a Breaking Change:** There are no API changes, but users relying on panics for overflow detection may observe different behavior.
…r coalesce (apache#7967) # Which issue does this PR close? This is a very interesting idea that we only calculate the data buffer size when we choose to gc, because we almost only care about the gc for data buffers, not for other field views/nulls. GC is only for databuffers, so the *2 calculation should also compare the databuffer size? # Rationale for this change optimize actual_buffer_size to use only data buffer capacity # What changes are included in this PR? optimize actual_buffer_size to use only data buffer capacity # Are these changes tested? The performance improvement for some high select benchmark with low null ratio is very good about 2X fast: ```rust cargo bench --bench coalesce_kernels "single_utf8view" Compiling arrow-select v55.2.0 (/Users/zhuqi/arrow-rs/arrow-select) Compiling arrow-cast v55.2.0 (/Users/zhuqi/arrow-rs/arrow-cast) Compiling arrow-string v55.2.0 (/Users/zhuqi/arrow-rs/arrow-string) Compiling arrow-ord v55.2.0 (/Users/zhuqi/arrow-rs/arrow-ord) Compiling arrow-csv v55.2.0 (/Users/zhuqi/arrow-rs/arrow-csv) Compiling arrow-json v55.2.0 (/Users/zhuqi/arrow-rs/arrow-json) Compiling arrow v55.2.0 (/Users/zhuqi/arrow-rs/arrow) Finished `bench` profile [optimized] target(s) in 13.26s Running benches/coalesce_kernels.rs (target/release/deps/coalesce_kernels-bb9750abedb10ad6) filter: single_utf8view, 8192, nulls: 0, selectivity: 0.001 time: [30.946 ms 31.071 ms 31.193 ms] change: [−1.7086% −1.1581% −0.6036%] (p = 0.00 < 0.05) Change within noise threshold. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) low mild 1 (1.00%) high mild filter: single_utf8view, 8192, nulls: 0, selectivity: 0.01 time: [3.8178 ms 3.8311 ms 3.8444 ms] change: [−4.0521% −3.5467% −3.0345%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) low mild Benchmarking filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.9s, enable flat sampling, or reduce sample count to 40. filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1 time: [1.9337 ms 1.9406 ms 1.9478 ms] change: [+0.3699% +0.9557% +1.5666%] (p = 0.00 < 0.05) Change within noise threshold. Found 5 outliers among 100 measurements (5.00%) 2 (2.00%) low mild 3 (3.00%) high severe filter: single_utf8view, 8192, nulls: 0, selectivity: 0.8 time: [797.60 µs 805.31 µs 813.85 µs] change: [−59.177% −58.412% −57.639%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.001 time: [43.742 ms 43.924 ms 44.108 ms] change: [−1.2146% −0.5778% +0.0828%] (p = 0.08 > 0.05) No change in performance detected. filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.01 time: [5.5736 ms 5.5987 ms 5.6247 ms] change: [−0.2381% +0.4740% +1.1711%] (p = 0.18 > 0.05) No change in performance detected. filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.1 time: [2.2963 ms 2.3035 ms 2.3109 ms] change: [−0.9314% −0.5125% −0.0931%] (p = 0.02 < 0.05) Change within noise threshold. Benchmarking filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.1s, enable flat sampling, or reduce sample count to 50. filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8 time: [1.5482 ms 1.5697 ms 1.5903 ms] change: [−45.794% −44.386% −43.000%] (p = 0.00 < 0.05) Performance has improved. ``` If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? # Are there any user-facing changes? If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --------- Co-authored-by: Andrew Lamb <[email protected]>
# Which issue does this PR close? - Follow on to apache#7687 # Rationale for this change I merged apache#7687 without addressing one of @emkornfield 's suggestions: https://github.com/apache/arrow-rs/pull/7687/files#r2205393903 # What changes are included in this PR? Implement the suggestion (restore a comment_ # Are these changes tested? BY CI # Are there any user-facing changes? No
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rebase on upstream main