Skip to content

Conversation

@maksym-iv-ef
Copy link

Which issue does this PR close?

Rebase on upstream main

brancz and others added 30 commits May 13, 2025 09:25
* arrow-select: Implement concat for `RunArray`s

It's unclear whether the added work and complexity of merging the last
run between consecutive arrays with the first one of the next array
being concatenanted is worth the potential benefit. For now this
implementation focuses on the ability to concatenate `RunArray`s in the
first place.

* Address comments
* Speedup arith

* Cleanup

* Fmt

* Remove more unsafe

* Remove more unsafe
* Add arrow_reader_clickbench

* update comments, fix Q1 bug

* Polish comments

* fix typo
* Improve documentation for Parquet `WriterProperties`

* typo

* Apply suggestions from code review

Co-authored-by: Ed Seidl <[email protected]>

---------

Co-authored-by: Ed Seidl <[email protected]>
* Improve Field docs, add missing `Field::{set_name, set_data_type, set_nullable}`

* Apply suggestions from code review

Co-authored-by: Ed Seidl <[email protected]>

---------

Co-authored-by: Ed Seidl <[email protected]>
* Fix Clippy in CI for Rust 1.87 release

* fmt

* Allow

* Add another allow
…ows to decode (apache#7502)

* Introduce `ReadPlan` to encapsulate the calculation of what rows to decode

* Update parquet/src/arrow/arrow_reader/read_plan.rs

Co-authored-by: Ed Seidl <[email protected]>

---------

Co-authored-by: Ed Seidl <[email protected]>
…` property is enabled (apache#7509)

* RecordBatchDecoder: respect skip_validation property

* Update arrow-ipc/src/reader.rs

Co-authored-by: Andrew Lamb <[email protected]>

* Update arrow-ipc/src/reader.rs

Co-authored-by: Andrew Lamb <[email protected]>

---------

Co-authored-by: Andrew Lamb <[email protected]>
…r of columns (~ -15%) (apache#7530)

* Unroll sort up to 5 columns

* Unroll sort up to 5 columns

* Unroll sort up to 5 columns

* Unroll sort up to 5 columns

* Unroll sort up to 5 columns

* Unroll sort up to 5 columns

* Unroll sort up to 5 columns

* Update arrow-ord/src/sort.rs

Co-authored-by: Andrew Lamb <[email protected]>

---------

Co-authored-by: Andrew Lamb <[email protected]>
…7511)

* Add support for records with int32 encoding of TIME_MILLIS

* Add test

---------

Co-authored-by: Andrew Lamb <[email protected]>
…pache#7522)

* Add single row filter test

* DRI

* Update parquet/src/arrow/async_reader/mod.rs

* Improve comments for other tests
* Factor ArrayDataBuilder into a new struct

* fit
* Support Utf8View for Avro

* code review improvements
apache#7517)

Struct array concatenation previously fell back to concat_fallback, which uses
MutableArrayData. This is an issue because any nested types within the struct
will not benefit from optimized concatenation. Dictionaries within structs,
for example, would not be merged and could result in excessive memory usage.

This commit adds no tests because there are already existing tests for struct
concatenation.
* unit test: filter empty struct

* When filtering struct Array's always build with a length since one
exists from the predicate. This prevents panics in situations where a
stucts array has empty columns.
* Added support for reading Avro Maps types

* Fixed lint errors, improved readability of `read_blockwise_items`, added `Map` comments and improved `Map` nullability handling in `data_type` in codec.rs
apache#7534)

* fix: Panic in pretty_format function when displaying DurationSecondsArray with i64::MIN / i64::MAX

* Address comments

* Avoid changes for infallible functions

* Fix compilation

* fix docs

---------

Co-authored-by: Andrew Lamb <[email protected]>
…1st stream element is an Err (apache#7492)

* Fix bug and add tests to verify

* Remove unnecessary let assignment.

* Add error callback for `do_put`

* cargo fmt

---------

Co-authored-by: Andrew Lamb <[email protected]>
…che#7427)

* Add support for creating random Decimal128 and Decimal256 arrays

* chore

* chore: clippy
…pache#7481)

* Improve error messages if schema hint mismatches with parquet scheam

* clippy

* Add vestigal metadata check
* chore: qualified results in parquet derive as std::result

* testing: added test for hygenic macro in parquet derive

* suggestion: harden test on parquet derive
friendlymatthew and others added 30 commits July 16, 2025 13:41
…pache#7922)

# Which issue does this PR close?

- Part of apache#7896

# Rationale for this change

In apache#7896, we saw that inserting a
large amount of field names takes a long time -- in this case ~45s to
insert 2**24 field names. The bulk of this time is spent just allocating
the strings, but we also see quite a bit of time spent reallocating the
`IndexSet` that we're inserting into.

`with_field_names` is an optimization to declare the field names upfront
which avoids having to reallocate and rehash the entire `IndexSet`
during field name insertion. Using this method requires at least 2
string allocations for each field name -- 1 to declare field names
upfront and 1 to insert the actual field name during object building.

This PR adds a new method `with_field_name_capacity` which allows you to
reserve space to the metadata builder, without needing to allocate the
field names themselves upfront. In this case, we see a modest
performance improvement when inserting the field names during object
building

Before: 
<img width="1512" height="829" alt="Screenshot 2025-07-13 at 12 08
43 PM"
src="https://github.com/user-attachments/assets/6ef0d9fe-1e08-4d3a-8f6b-703de550865c"
/>


After:
<img width="1512" height="805" alt="Screenshot 2025-07-13 at 12 08
55 PM"
src="https://github.com/user-attachments/assets/2faca4cb-0a51-441b-ab6c-5baa1dae84b3"
/>
…che#7914)

# Which issue does this PR close?

- Fixes apache#7907

# Rationale for this change

When trying to append `VariantObject` or `VariantList`s directly on the
`VariantBuilder`, it will panic.


# Changes to the public API

`VariantBuilder` now has these additional methods: 

- `append_object`, will panic if shallow validation fails or the object
has duplicate field names
- `try_append_object`, will perform full validation on the object before
appending

- `append_list`, will panic if shallow validation fails
- `try_append_list`, will perform full validation on the list before
appending

---------

Co-authored-by: Andrew Lamb <[email protected]>
# Which issue does this PR close?

- Closes apache#7893

# What changes are included in this PR?

In parquet-variant:
- Add a new function `Variant::get_path`: this traverses the path to
create a new Variant (does not cast any of it).
- Add a new module `parquet_variant::path`: adds structs/enums to define
a path to access a variant value deeply.

In parquet-variant-compute:
- Add a new compute kernel `variant_get`: does the path traversal over a
`VariantArray`. In the future, this would also cast the values to a
specified type.
- Includes some basic unit tests. Not comprehensive.
- Includes a simple micro-benchmark for reference.

Current limitations:
- It can only return another VariantArray. Casts are not implemented
yet.
- Only top-level object/list access is supported. It panics on finding a
nested object/list. Needs apache#7914
to fix this.
- Perf is a TODO.

# Are these changes tested?

Some basic unit tests are added.

# Are there any user-facing changes?

Yes

---------

Co-authored-by: Andrew Lamb <[email protected]>
…he#7774)

# Which issue does this PR close?


- Part of apache#7762


# Rationale for this change

As part of apache#7762 I want to
optimize applying filters by adding a new code path.

To ensure that works well, let's ensure the filtered code path is well
covered with tests


# What changes are included in this PR?

1. Add tests for filtering batches with 0.01%, 1%, 10% and 90% and
varying data types


# Are these changes tested?
Only tests, no functional changes


# Are there any user-facing changes?
…ath (apache#7942)

# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Follow on to apache#7919

# Rationale for this change

While reviewing apache#7919 from
@Samyak2 I found I wanted to write some additional tests directly for
`Variant::get_path`

When I started doing that I found it was somewhat awkward to write
examples, so I added some new conversion routines to make it easier.

# What changes are included in this PR?

1. Add doc examples (and thus tests) of `VaraintGet` and `VariantPath`
2. Add more documentation

# Are these changes tested?
Yes, by doc examples which run in CI
# Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Closes apache#7901 .

# Rationale for this change

Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.

# What changes are included in this PR?

There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.

# Are these changes tested?

We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

# Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

---------

Signed-off-by: codephage2020 <[email protected]>
…pache#7931)

# Which issue does this PR close?

- Closes #NNN.

This is minor but I can create an issue if needed.

# Rationale for this change

`make_builder` currently errors with `Data type Utf8View is not
currently supported`.

# What changes are included in this PR?

Support `DataType::Utf8View` and `DataType::BinaryView` in
`make_builder`.

# Are these changes tested?

Only via the exhaustive enum match. It doesn't look like there are any
tests for `make_builder` in that file?

# Are there any user-facing changes?

No
# Rationale for this change

- Closes apache#7948

This PR introduces a custom implementation of `PartialEq` for variant
objects.

According to the spec, field values are not required to be in the same
order as the field IDs, to enable flexibility when constructing Variant
values.

Instead of comparing the raw bytes of 2 variant objects, this
implementation recursively checks whether the field values are equal --
regardless of their order
# Which issue does this PR close?

Optimize `partition_validity` function used in sort kernels
    
- Preallocate vectors based on known null counts
- Avoid dynamic dispatch by calling `NullBuffer::is_valid` instead of
`Array::is_valid`
- Avoid capacity checks inside loop by writing to `spare_capacity_mut`
instead of using `push`
- Closes apache#7936.

# Rationale for this change

Microbenchmark results for `sort_kernels` compared to `main`, only
looking at benchmarks matching "nulls to indices":

```
sort i32 nulls to indices 2^10
                        time:   [4.9325 µs 4.9370 µs 4.9422 µs]
                        change: [−20.303% −20.133% −19.974%] (p = 0.00 < 0.05)
                        Performance has improved.

sort i32 nulls to indices 2^12
                        time:   [20.096 µs 20.209 µs 20.327 µs]
                        change: [−26.819% −26.275% −25.697%] (p = 0.00 < 0.05)
                        Performance has improved.

sort f32 nulls to indices 2^12
                        time:   [26.329 µs 26.366 µs 26.406 µs]
                        change: [−29.487% −29.331% −29.146%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string[0-10] nulls to indices 2^12
                        time:   [70.667 µs 70.762 µs 70.886 µs]
                        change: [−20.057% −19.935% −19.819%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string[0-100] nulls to indices 2^12
                        time:   [101.98 µs 102.44 µs 102.99 µs]
                        change: [−0.3501% +0.0835% +0.4918%] (p = 0.71 > 0.05)
                        No change in performance detected.

sort string[0-400] nulls to indices 2^12
                        time:   [84.952 µs 85.024 µs 85.102 µs]
                        change: [−5.3969% −4.9827% −4.6421%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string[10] nulls to indices 2^12
                        time:   [72.486 µs 72.664 µs 72.893 µs]
                        change: [−14.937% −14.781% −14.599%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string[100] nulls to indices 2^12
                        time:   [71.354 µs 71.606 µs 71.902 µs]
                        change: [−17.207% −16.795% −16.373%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string[1000] nulls to indices 2^12
                        time:   [73.088 µs 73.195 µs 73.311 µs]
                        change: [−16.705% −16.599% −16.483%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string_view[10] nulls to indices 2^12
                        time:   [32.592 µs 32.654 µs 32.731 µs]
                        change: [−15.722% −15.512% −15.310%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string_view[0-400] nulls to indices 2^12
                        time:   [32.981 µs 33.074 µs 33.189 µs]
                        change: [−25.570% −25.132% −24.700%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string_view_inlined[0-12] nulls to indices 2^12
                        time:   [28.467 µs 28.496 µs 28.529 µs]
                        change: [−22.978% −22.786% −22.574%] (p = 0.00 < 0.05)
                        Performance has improved.

sort string[10] dict nulls to indices 2^12
                        time:   [94.463 µs 94.503 µs 94.542 µs]
                        change: [−11.386% −11.165% −10.961%] (p = 0.00 < 0.05)
                        Performance has improved.
```

# Are these changes tested?

Covered by existing tests

# Are there any user-facing changes?

No, the method is internal to the sort kernels.
# Which issue does this PR close?

- Closes apache#7949

# Rationale for this change

I would like it to be easier / more ergonomic to make objects

# What changes are included in this PR?

1. Add `ObjectBuilder::with_field`
2. Add documentation w/ examples
3. Rewrite some tests 

# Are these changes tested?

Yes, by doc tests

# Are there any user-facing changes?

Yes a new API
…ntArray (apache#7945)

# Which issue does this PR close?


- Closes apache#7920.

# Are these changes tested?


Tests were already implemented

# Are there any user-facing changes?

None
…che#7956)

# Which issue does this PR close?

- Follow-up to apache#7901

# Rationale for this change

- apache#7934

Introduced a minor regression, in (accidentally?) forbidding the empty
string as a dictionary key. Fix the bug and simplify the code a bit
further while we're at it.

# What changes are included in this PR?

Revert the unsorted dictionary check back to what it had been (it just
uses `Iterator::is_sorted_by` now, instead of
`primitive.slice::is_sorted_by`).

Remove the redundant offset monotonicity check from the ordered
dictionary path, relying on the fact that string slice extraction will
anyway fail if the offsets are not monotonic. Improve the error message
now that it does double duty.

# Are these changes tested?

New unit tests for dictionaries containing the empty string. As a side
effect, we now have at least a little coverage for sorted dictionaries
-- somehow, I couldn't find any existing unit test that creates a sorted
dictionary??

# Are there any user-facing changes?

No

---------

Co-authored-by: Andrew Lamb <[email protected]>
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Part of apache#7911
- Part of  apache#6736
- Follow on to apache#7905


# Rationale for this change

I wrote benchmark some changes to the json decoder in
apache#7911 but they are non trivial.
To keep apache#7911 easier to review I
have pulled the benchmarks out to their own PR

# What changes are included in this PR?

1. Add new json  benchmark
2. Include the `variant_get` benchmark added in
apache#7919 by @Samyak2
# Are these changes tested?

I tested them manually and clippy CI coverage ensures they compile

# Are there any user-facing changes?

No these are only benchmarks
# Which issue does this PR close?


- Closes apache#7951 .

# Rationale for this change


# What changes are included in this PR?


# Are these changes tested?

Yes
# Are there any user-facing changes?

New API

Signed-off-by: codephage2020 <[email protected]>
# Which issue does this PR close?
- Follow on to apache#7943
- Part of apache#7948

# Rationale for this change

I found a few more tests I would like to have seen while reviewing
apache#7943

# What changes are included in this PR?

Add some list equality tests

# Are these changes tested?

It is only tests, no functionality changes

# Are there any user-facing changes?

No
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Closes apache#7947 .

# Rationale for this change

Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.

# What changes are included in this PR?

There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.

# Are these changes tested?

We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

# Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

Signed-off-by: codephage2020 <[email protected]>
# Which issue does this PR close?


-  Related to apache#6736 


# Rationale for this change

I noticed in apache#7956 that some
Clippy errors were introduced but not caught by CI.

# What changes are included in this PR?

Add `parquet-variant-compute` to the CI for parqet-variant related PRs

# Are these changes tested?

It is only tests


# Are there any user-facing changes?
No
# Which issue does this PR close?

Does not yet close, but contributes towards:

- apache#6356 
- apache#5981 
- apache#1206

# Rationale for this change

See the above issues. And this is a follow up to

* apache#6711
* apache#6873

This was also split out from:
apache#7929

# What changes are included in this PR?

This removes the API to allow preserving `dict_id` set in the `Schema`'s
`Field` within arrow-ipc and arrow-flight. This is in an effort to
remove the `dict_id` field entirely and make it an IPC/flight-only
concern.

# Are these changes tested?

Yes, all existing tests continue to pass.

# Are there any user-facing changes?

Yes, these previously (in 54.0.0) deprecated functions/fields are
removed:

* `arrow_ipc::DictionaryTracker.set_dict_id`
* `arrow_ipc::DictionaryTracker::new_with_preserve_dict_id`
* `arrow_ipc::IpcWriteOptions.with_preserve_dict_id`
* `arrow_ipc::IpcWriteOptions.preserve_dict_id` (function and field)
* `arrow_ipc::schema_to_fb`
* `arrow_ipc::schema_to_bytes`
# Which issue does this PR close?

- Part of apache#4886
- Related to apache#6965

# Rationale for this change

This change introduces support for Avro files generated by systems like
Impala, which have a specific convention for representing nullable
fields. In Avro, nullability is typically represented by a union of a
type and a type. This PR updates the Avro reader to correctly interpret
these schemas, ensuring proper handling of nullable data and improving
interoperability with Impala-generated data. `null`

# What changes are included in this PR?

This pull request introduces several changes to support Impala-style
nullability in the Avro reader:
- The Avro schema parser has been updated to recognize unions where is
the second type (e.g., `['type', 'null']`) as a nullable field. `null`
- Logic has been added to handle this nullability convention during Avro
decoding.
- New tests are included to verify that Avro files using this
nullability format are read correctly while ensuring that strict mode
properly identifies them.

# Are these changes tested?

Yes, I added new test cases covering these changes to the tests named:
`test_nonnullable_impala`, `test_nonnullable_impala_strict`,
`test_nullable_impala` and `test_nullable_impala_strict`.

# Are there any user-facing changes?

N/A

---------

Co-authored-by: Connor Sanders <[email protected]>
# Which issue does this PR close?

Part of apache#4886

Completes the breaking down/porting of the changes in
apache#6965. This PR will be closed
upon merge of this PR.

# Rationale for this change

This change brings over the remaining integration tests present in the
original PR, which validate the reader logic against the files from
`testing/data/avro`. PRs containing this logic have already been merged
(but are not yet released) which these tests now validate.

# What changes are included in this PR?

The following files are now read in:

- alltypes_dictionary.avro
- alltypes_nulls_plain.avro
- binary.avro
- dict-page-offset-zero.avro
- avro/list_columns.avro
- nested_lists.snappy.avro
- single_nan.avro
- datapage_v2.snappy.avro
- nested_records.avro
- repeated_no_annotation.avro

# Are these changes tested?

This PR consists of integration tests validating code merged recently
into this crate. No changes in functionality are included.

# Are there any user-facing changes?

N/A
…pache#7968)

# Which issue does this PR close?

Closes apache#6356

# Rationale for this change

Now that apache#7940 is merged, nothing
useful can be done with the `dict_id` field, therefore, it is now safe
to be removed from this requirement.

This was also split out from:
apache#7467

# What changes are included in this PR?

No longer require the `dict_id` fields of two `Field`s of schemas being
merged to be equal, as at this point the `dict_id` is only an IPC
concern, and the fact that it is still in the struct definition is just
legacy, marked for removal, we're just going through the proper
procedure of deprecating and replacing the APIs that use it.

# Are these changes tested?

Tests passing.

# Are there any user-facing changes?

No API changes, just a behavior change, that was to be expected and
desired due to the deprecations around the `dict_id` field.

@alamb @adriangb @tustvold
…e#7966)

# Which issue does this PR close?

- Part of apache#4886

- Follow up to apache#7834

# Rationale for this change

The initial Avro reader implementation contained an under-developed and
temporary safeguard to prevent infinite loops when processing records
that consumed zero bytes from the input buffer.

When the `Decoder` reported that zero bytes were consumed, the `Reader`
would advance it's cursor to the end of the current data block. While
this successfully prevented an infinite loop, it had the critical side
effect of silently discarding any remaining data in that block, leading
to potential data loss.

This change enhances the decoding logic to handle these zero-byte values
correctly, ensuring that the `Reader` makes proper progress without
dropping data and without risking an infinite loop.

# What changes are included in this PR?

- **Refined Decoder Logic**: The `Decoder` has been updated to
accurately track and report the number of bytes consumed for all values,
including valid zero-length records like `null` or empty `bytes`. This
ensures the decoder always makes forward progress.
- **Removal of Data-Skipping Safeguard**: The logic in the `Reader` that
previously advanced to the end of a block on a zero-byte read has been
removed. The reader now relies on the decoder to report accurate
consumption and advances its cursor incrementally and safely.
- * New integration test using a temporary `zero_byte.avro` file created
via this python script:
https://gist.github.com/jecsand838/e57647d0d12853f3cf07c350a6a40395

# Are these changes tested?

Yes, a new `test_read_zero_byte_avro_file` test was added that reads the
new `zero_byte.avro` file and confirms the update.

# Are there any user-facing changes?

N/A

# Follow-Up PRs

1. PR to update `test_read_zero_byte_avro_file` once
apache/arrow-testing#109 is merged in.
# Which issue does this PR close?

- Closes apache#7899 .

This pr wants to avoid the extra allocation for the object builder and
the later buffer copy.

# Rationale for this change

Avoid extra allocation in the object builder like the issue descripted.

# What changes are included in this PR?

This removes the internal `buffer` in `ObjectBuilder`. All data
insertion is done directly to the parent buffer wrapped in
`parent_state`.

The corresponding new fields are added to `ObjectBuilder`.

- add `object_start_offset` in `ObjectBuilder`, which describes the
start offset in the parent buffer for the current object
- Add `has_been_finished` in `ObjectBuilder`, which describes whether
the current object has been finished; it will be used in the `Drop`
function.

This patch modifies the logic of `new`, `finish`, `parent_state`, and
`drop` function according to the change.

In particular, it writes data into the parent buffer directly when
adding a field to the object (i.e., `insert`/`try_insert` is called).
When finalizing (`finish` is called) the object, as header and field ids
are must be put in front of data in the buffer, the builder will shift
written data bytes for the necessary space for header and field ids.
Then it writes header and field ids.

In `drop`, if the builder is not finalized before being dropped, it will
truncate the written bytes to roll back the parent buffer status.

# Are these changes tested?

The logic has been covered by the exist logic.

# Are there any user-facing changes?

No

---------

Co-authored-by: Andrew Lamb <[email protected]>
# Which issue does this PR close?

- Closes apache#7686

# Rationale for this change

int96 min/max statistics emitted by arrow-rs are incorrect. 

# What changes are included in this PR?

1. Fix the int96 stats
2. Add round-trip test to verify the behavior

# Not included in this PR:
1. Read stats only from known good writers. This will be implemented
after a new arrow-rs release.

# Are there any user-facing changes?
The int96 min/max statistics will be different and correct.

---------

Co-authored-by: Rahul Sharma <[email protected]>
Co-authored-by: Ed Seidl <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Alkis Evlogimenos <[email protected]>
# Rationale for this change

If a variant has an unsorted dictionary, you can't assume fields are
unique nor ordered by name. This PR updates the logical equality check
among `VariantMetadata` to properly handle this case.

- Closes apache#7952 

It also fixes a bug in apache#7934
where we do a uniqueness check when probing an unsorted dictionary

---------

Co-authored-by: Andrew Lamb <[email protected]>
…and writers (apache#7841)

# Which issue does this PR close?

- Finishes remaining work and closes apache#6661.

# What changes are included in this PR?

This change adds `decimal32` and `decimal64` support to Parquet, JSON
and CSV readers and writers. It does not change the current default
behavior of the Parquet reader which (in the absence of a specification
that says otherwise) will still translate the INT32 physical type with a
logical DECIMAL type into a `decimal128` instead of a `decimal32`.

# Are these changes tested?

Yes.

# Are there any user-facing changes?

The `decimal32` and `decimal64` types are now supported in Parquet, JSON
and CSV readers and writers.

---------

Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Matthijs Brobbel <[email protected]>
…#7911)

# Which issue does this PR close?

- part of apache#6736
- Closes apache#7964
- Follow on to apache#7905

# Rationale for this change

In a quest to have the fastest and most efficient Variant implementation
I would like to avoid copies if at all possible
Right now, to make a VariantArray first requires completing an
individual buffer and appending it
to the array.

Let's make that faster by having the VariantBuilder append directly into
the buffer


# What changes are included in this PR?

1. Add `VariantBuilder::new_from_existing`
2. Add a `VariantArrayBuilder::variant_builder` that reuses the buffers

# Are these changes tested?

1. New unit tests 
1. Yes by existing tests

# Are there any user-facing changes?

Hopefully faster performance

---------

Co-authored-by: Congxian Qiu <[email protected]>
Co-authored-by: Liang-Chi Hsieh <[email protected]>
…overflows (apache#7887)

# Which issue does this PR close?

Closes apache#7886.

# Rationale for this change

Casting large `Decimal256` values to `Float64` can exceed the
representable range of floating point numbers. Previously, this could
result in a panic due to unwrapping a failed conversion.

This PR introduces a safe conversion that saturates overflowing values
to `INFINITY` or `-INFINITY`, following standard floating point
semantics. This ensures stable, predictable behavior without runtime
crashes.

# What changes are included in this PR?

- Introduced a helper function `decimal256_to_f64` that converts `i256`
to `f64`, returning `INFINITY` or `-INFINITY` when the value is out of
range.
- Updated the casting logic for `Decimal256` → `Float64` to use the new
safe conversion.
- Improved inline and module-level documentation to reflect that this
conversion is lossy and saturating.
- Added a unit test `test_cast_decimal256_to_f64_overflow` to validate
overflow behavior.

# Are there any user-facing changes?

Yes.

- **Behavior Change:** When casting `Decimal256` values that exceed the
`f64` range, users now receive `INFINITY` or `-INFINITY` instead of a
panic.
- **Improved Docs:** Updated documentation clarifies the lossy and
saturating behavior of decimal-to-float casting.
- **Not a Breaking Change:** There are no API changes, but users relying
on panics for overflow detection may observe different behavior.
…r coalesce (apache#7967)

# Which issue does this PR close?

This is a very interesting idea that we only calculate the data buffer
size when we choose to gc, because we almost only care about the gc for
data buffers, not for other field views/nulls.

GC is only for databuffers, so the *2 calculation should also compare
the databuffer size?


# Rationale for this change

optimize actual_buffer_size to use only data buffer capacity

# What changes are included in this PR?

optimize actual_buffer_size to use only data buffer capacity

# Are these changes tested?

The performance improvement for some high select benchmark with low null
ratio is very good about 2X fast:

```rust
cargo bench --bench coalesce_kernels "single_utf8view"
   Compiling arrow-select v55.2.0 (/Users/zhuqi/arrow-rs/arrow-select)
   Compiling arrow-cast v55.2.0 (/Users/zhuqi/arrow-rs/arrow-cast)
   Compiling arrow-string v55.2.0 (/Users/zhuqi/arrow-rs/arrow-string)
   Compiling arrow-ord v55.2.0 (/Users/zhuqi/arrow-rs/arrow-ord)
   Compiling arrow-csv v55.2.0 (/Users/zhuqi/arrow-rs/arrow-csv)
   Compiling arrow-json v55.2.0 (/Users/zhuqi/arrow-rs/arrow-json)
   Compiling arrow v55.2.0 (/Users/zhuqi/arrow-rs/arrow)
    Finished `bench` profile [optimized] target(s) in 13.26s
     Running benches/coalesce_kernels.rs (target/release/deps/coalesce_kernels-bb9750abedb10ad6)
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.001
                        time:   [30.946 ms 31.071 ms 31.193 ms]
                        change: [−1.7086% −1.1581% −0.6036%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) low mild
  1 (1.00%) high mild

filter: single_utf8view, 8192, nulls: 0, selectivity: 0.01
                        time:   [3.8178 ms 3.8311 ms 3.8444 ms]
                        change: [−4.0521% −3.5467% −3.0345%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

Benchmarking filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.9s, enable flat sampling, or reduce sample count to 40.
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1
                        time:   [1.9337 ms 1.9406 ms 1.9478 ms]
                        change: [+0.3699% +0.9557% +1.5666%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  3 (3.00%) high severe

filter: single_utf8view, 8192, nulls: 0, selectivity: 0.8
                        time:   [797.60 µs 805.31 µs 813.85 µs]
                        change: [−59.177% −58.412% −57.639%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.001
                        time:   [43.742 ms 43.924 ms 44.108 ms]
                        change: [−1.2146% −0.5778% +0.0828%] (p = 0.08 > 0.05)
                        No change in performance detected.

filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.01
                        time:   [5.5736 ms 5.5987 ms 5.6247 ms]
                        change: [−0.2381% +0.4740% +1.1711%] (p = 0.18 > 0.05)
                        No change in performance detected.

filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.1
                        time:   [2.2963 ms 2.3035 ms 2.3109 ms]
                        change: [−0.9314% −0.5125% −0.0931%] (p = 0.02 < 0.05)
                        Change within noise threshold.

Benchmarking filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.1s, enable flat sampling, or reduce sample count to 50.
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8
                        time:   [1.5482 ms 1.5697 ms 1.5903 ms]
                        change: [−45.794% −44.386% −43.000%] (p = 0.00 < 0.05)
                        Performance has improved.

```

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

# Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

---------

Co-authored-by: Andrew Lamb <[email protected]>
# Which issue does this PR close?


- Follow on to apache#7687

# Rationale for this change

I merged apache#7687 without addressing
one of @emkornfield 's suggestions:
https://github.com/apache/arrow-rs/pull/7687/files#r2205393903

# What changes are included in this PR?

Implement the suggestion (restore a comment_

# Are these changes tested?

 BY CI

# Are there any user-facing changes?

No
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.