-
Notifications
You must be signed in to change notification settings - Fork 986
feat: arrow-ipc delta dictionary support #8001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I worked on adding delta dictionary support a while ago with the help of claude code, but I never ended up having any time to polish it up for review. Here's the branch, feel free to steal whatever you want: https://github.com/polarsignals/arrow-rs/tree/asubiotto-delta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution @JakeDern 🙏
I need some pointers on the best way to test this using only rust, but am happy to implement any suggestions 🙂. The validation that I did so far involved using the Go ipc writer to dump stream data to a file which I then read from rust:
It looks to me like the pre-existing integration suite may already have this
The data is here (it is a git submodule) https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration
Perhaps you could follow the model of a test like this:
fn read_0_1_7() { |
And read the content of https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration/4.0.0-shareddict (instead of 0.17.1)
@asubiotto @alamb Thanks for the pointers! It looks like the test data in 4.0.0-shareddict may not contain any delta dictionaries as this test passes on main as well as on my branch: #[test]
fn read_stream_delta_dictionary() {
let testdata = arrow_test_data();
let path = "generated_shared_dict";
let version = "4.0.0-shareddict";
verify_arrow_file(&testdata, version, path);
verify_arrow_stream(&testdata, version, path);
} I'll take a crack at cleaning up the writer code from @asubiotto and incorporating it into this PR, then we can produce the data for testing as well. |
@asubiotto @alamb I have an initial implementation of both writer and reader ready for some feedback. Thanks @asubiotto for the starter code, some of the additional tests were especially helpful! Something I'm curious to get thoughts on after digging through both the Go and Rust writer code is that the ability for the ipc writers in the rust code to produce delta dictionaries is very limited. Both the ipc writers in Go and Rust look at dictionaries on incoming RecordBatches one at a time, doing a comparison of the new dictionary with the old in order to determine if the new is a superset of old. I thought this was odd because the following test I wrote in Rust produces 1 delta dictionary whereas I thought it would produce 2: #[test]
fn test_deltas() {
// Dictionary resets at ["C", "D"]
let batches: &[&[&str]] = &[&["A"], &["A", "B"], &["C", "D"], &["A", "B", "E"]];
run_test(batches, false);
} However since the writer only looks at the dictionary values of the last RecordBatch, it has to completely reset when This was inconsistent with what I'd seen in Go where similar code did produce two delta dictionaries, despite seeming to follow the same/similar algorithm. So I dug a bit more and wrote some Go code with two different writers and then read it from rust. Batches 0, 2, and 3 are written using one dictionary builder and batch 1 is written using another: dictType := &arrow.DictionaryType{
IndexType: arrow.PrimitiveTypes.Int16,
ValueType: arrow.BinaryTypes.String,
Ordered: false,
}
schema := arrow.NewSchema([]arrow.Field{
{Name: "foo", Type: dictType},
}, nil)
buf := bytes.NewBuffer([]byte{})
writer := ipc.NewWriter(buf, ipc.WithSchema(schema), ipc.WithDictionaryDeltas(true))
allocator := memory.NewGoAllocator()
dict_builder := array.NewDictionaryBuilder(allocator, dictType)
builder := array.NewStringBuilder(allocator)
builder.AppendStringValues([]string{"A", "B", "C"}, []bool{})
dict_builder.AppendArray(array.NewStringData(builder.NewArray().Data()))
record := array.NewRecord(schema, []arrow.Array{
dict_builder.NewArray(),
}, 3)
if err := writer.Write(record); err != nil {
panic(err)
}
// Reset builder
dict_builder2 := array.NewDictionaryBuilder(allocator, dictType)
builder.AppendStringValues([]string{"A", "B", "D"}, []bool{})
dict_builder2.AppendArray(array.NewStringData(builder.NewArray().Data()))
record2 := array.NewRecord(schema, []arrow.Array{
dict_builder2.NewArray(),
}, 3)
if err := writer.Write(record2); err != nil {
panic(err)
}
builder.AppendStringValues([]string{"A", "B", "E"}, []bool{})
dict_builder.AppendArray(array.NewStringData(builder.NewArray().Data()))
record3 := array.NewRecord(schema, []arrow.Array{
dict_builder.NewArray(),
}, 3)
if err := writer.Write(record3); err != nil {
panic(err)
}
builder.AppendStringValues([]string{"A", "B", "D"}, []bool{})
dict_builder.AppendArray(array.NewStringData(builder.NewArray().Data()))
record4 := array.NewRecord(schema, []arrow.Array{
dict_builder.NewArray(),
}, 3)
if err := writer.Write(record4); err != nil {
panic(err)
}
// write buf out to ~/delta_test/delta.arrow
if err := os.WriteFile("/home/jakedern/delta_test/delta2.arrow", buf.Bytes(), 0644); err != nil {
panic(fmt.Errorf("failed to write delta file: %w", err))
} [arrow-ipc/src/reader.rs:717:9] batch.isDelta() = false
[arrow-ipc/src/reader.rs:722:9] &dictionary_values = StringArray
[
"A",
"B",
"C",
]
[arrow-ipc/src/reader.rs:717:9] batch.isDelta() = false
[arrow-ipc/src/reader.rs:722:9] &dictionary_values = StringArray
[
"A",
"B",
"D",
]
[arrow-ipc/src/reader.rs:717:9] batch.isDelta() = false
[arrow-ipc/src/reader.rs:722:9] &dictionary_values = StringArray
[
"A",
"B",
"C",
"E",
]
[arrow-ipc/src/reader.rs:717:9] batch.isDelta() = true
[arrow-ipc/src/reader.rs:725:5] &dictionary_values = StringArray
[
"D",
] What we see is that writing a RecordBatch from a different builder basically resets the The takeaway being that in Go we need some cooperation between the builder and the ipc writer to get delta dictionaries a reasonably amount of time. If we shovel RecordBatches from different builders into the same ipc writer then we get bad behavior for delta dictionaries. In rust, we don't have that My questions are:
Curious to hear any thoughts, thank you both for the help so far! |
Based on comments above, it seems like maybe what we need is a way to efficiently detect if a value is already in the dictionary. The dictionary builders in all keep some kind of internal state the allows some efficient lookup of this. For example,
Maybe we could refactor this to be something that's reusable by the IPC writer? |
Exposing builder state makes the most sense to me. Thanks for taking this on @JakeDern! |
@asubiotto the approach I opted to take is to allow accumulating values only on the builder via a I also did a little bit of refactoring to get better visibility into the messages that the reader sees. Since we're trying to improve the conditions under which delta dictionaries are emitted (optimization), we need this visibility to test precisely rather than relying on heuristics like the size of the underlying stream. Feedback would be greatly appreciated! If this approach seems reasonable then I can add the same |
Which issue does this PR close?
Rationale for this change
Delta dictionaries are not supported by either the arrow-ipc reader or writer. Other languages like Go have delta dictionary support and so reading ipc streams produced by those languages sometimes includes delta dictionaries.
This PR adds reader and writer support so that we can consume streams with those messages in rust.
What changes are included in this PR?
read_dictionary_impl
to support delta dictionaries by concatenating the dictionaries ifisDelta()
is truefinish_preserve_values
API toGenericBytesDictionaryBuilder
which allows for accumulating the total set of values built by the builderStreamReader
to de-couple parsing individual IPC messages from producing the next record batch. This enables better testing via inspection of the individual messaging in the streamMessageReader
type to handle reading metadata lengths, headers and message bodiesAre these changes tested?
Yes, unit testing suites were added to cover delta functionality specifically. Existing unit tests are expected to also guard against regressions due to refactors.
Are there any user-facing changes?
Yes, there is a new
finish_preserve_values
public method for GenericBytesDictionaryBuilder. If we want to move forward with this approach then this will be added to the other dictionary builders too.