Skip to content

feat: arrow-ipc delta dictionary support #8001

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

JakeDern
Copy link

@JakeDern JakeDern commented Jul 25, 2025

Which issue does this PR close?

Rationale for this change

Delta dictionaries are not supported by either the arrow-ipc reader or writer. Other languages like Go have delta dictionary support and so reading ipc streams produced by those languages sometimes includes delta dictionaries.

This PR adds reader and writer support so that we can consume streams with those messages in rust.

What changes are included in this PR?

  • Update read_dictionary_impl to support delta dictionaries by concatenating the dictionaries if isDelta() is true
  • Update ipc writer to support delta dictionaries
  • Add a finish_preserve_values API to GenericBytesDictionaryBuilder which allows for accumulating the total set of values built by the builder
  • Refactor StreamReader to de-couple parsing individual IPC messages from producing the next record batch. This enables better testing via inspection of the individual messaging in the stream
  • Introduce a MessageReader type to handle reading metadata lengths, headers and message bodies

Are these changes tested?

Yes, unit testing suites were added to cover delta functionality specifically. Existing unit tests are expected to also guard against regressions due to refactors.

Are there any user-facing changes?

Yes, there is a new finish_preserve_values public method for GenericBytesDictionaryBuilder. If we want to move forward with this approach then this will be added to the other dictionary builders too.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 25, 2025
@JakeDern JakeDern marked this pull request as draft July 25, 2025 15:53
@asubiotto
Copy link
Contributor

I worked on adding delta dictionary support a while ago with the help of claude code, but I never ended up having any time to polish it up for review. Here's the branch, feel free to steal whatever you want: https://github.com/polarsignals/arrow-rs/tree/asubiotto-delta

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @JakeDern 🙏

I need some pointers on the best way to test this using only rust, but am happy to implement any suggestions 🙂. The validation that I did so far involved using the Go ipc writer to dump stream data to a file which I then read from rust:

It looks to me like the pre-existing integration suite may already have this

The data is here (it is a git submodule) https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration

Perhaps you could follow the model of a test like this:

And read the content of https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration/4.0.0-shareddict (instead of 0.17.1)

@JakeDern
Copy link
Author

@asubiotto @alamb Thanks for the pointers! It looks like the test data in 4.0.0-shareddict may not contain any delta dictionaries as this test passes on main as well as on my branch:

#[test]
fn read_stream_delta_dictionary() {
    let testdata = arrow_test_data();

    let path = "generated_shared_dict";
    let version = "4.0.0-shareddict";
    verify_arrow_file(&testdata, version, path);
    verify_arrow_stream(&testdata, version, path);
}

I'll take a crack at cleaning up the writer code from @asubiotto and incorporating it into this PR, then we can produce the data for testing as well.

@JakeDern JakeDern changed the title feat: arrow-ipc delta dictionary support for reader feat: arrow-ipc delta dictionary support Jul 29, 2025
@JakeDern
Copy link
Author

JakeDern commented Jul 29, 2025

@asubiotto @alamb I have an initial implementation of both writer and reader ready for some feedback. Thanks @asubiotto for the starter code, some of the additional tests were especially helpful!

Something I'm curious to get thoughts on after digging through both the Go and Rust writer code is that the ability for the ipc writers in the rust code to produce delta dictionaries is very limited.

Both the ipc writers in Go and Rust look at dictionaries on incoming RecordBatches one at a time, doing a comparison of the new dictionary with the old in order to determine if the new is a superset of old. I thought this was odd because the following test I wrote in Rust produces 1 delta dictionary whereas I thought it would produce 2:

#[test]
fn test_deltas() {
    // Dictionary resets at ["C", "D"]
    let batches: &[&[&str]] = &[&["A"], &["A", "B"], &["C", "D"], &["A", "B", "E"]];
    run_test(batches, false);
}

However since the writer only looks at the dictionary values of the last RecordBatch, it has to completely reset when &["C", "D"]. comes in and similarly again when &["A", "B", "E"] comes in.

This was inconsistent with what I'd seen in Go where similar code did produce two delta dictionaries, despite seeming to follow the same/similar algorithm.

So I dug a bit more and wrote some Go code with two different writers and then read it from rust. Batches 0, 2, and 3 are written using one dictionary builder and batch 1 is written using another:

	dictType := &arrow.DictionaryType{
		IndexType: arrow.PrimitiveTypes.Int16,
		ValueType: arrow.BinaryTypes.String,
		Ordered:   false,
	}

	schema := arrow.NewSchema([]arrow.Field{
		{Name: "foo", Type: dictType},
	}, nil)

	buf := bytes.NewBuffer([]byte{})
	writer := ipc.NewWriter(buf, ipc.WithSchema(schema), ipc.WithDictionaryDeltas(true))

	allocator := memory.NewGoAllocator()

	dict_builder := array.NewDictionaryBuilder(allocator, dictType)

	builder := array.NewStringBuilder(allocator)
	builder.AppendStringValues([]string{"A", "B", "C"}, []bool{})
	dict_builder.AppendArray(array.NewStringData(builder.NewArray().Data()))
	record := array.NewRecord(schema, []arrow.Array{
		dict_builder.NewArray(),
	}, 3)

	if err := writer.Write(record); err != nil {
		panic(err)
	}

	// Reset builder
	dict_builder2 := array.NewDictionaryBuilder(allocator, dictType)
	builder.AppendStringValues([]string{"A", "B", "D"}, []bool{})
	dict_builder2.AppendArray(array.NewStringData(builder.NewArray().Data()))
	record2 := array.NewRecord(schema, []arrow.Array{
		dict_builder2.NewArray(),
	}, 3)

	if err := writer.Write(record2); err != nil {
		panic(err)
	}

	builder.AppendStringValues([]string{"A", "B", "E"}, []bool{})
	dict_builder.AppendArray(array.NewStringData(builder.NewArray().Data()))
	record3 := array.NewRecord(schema, []arrow.Array{
		dict_builder.NewArray(),
	}, 3)

	if err := writer.Write(record3); err != nil {
		panic(err)
	}

	builder.AppendStringValues([]string{"A", "B", "D"}, []bool{})
	dict_builder.AppendArray(array.NewStringData(builder.NewArray().Data()))
	record4 := array.NewRecord(schema, []arrow.Array{
		dict_builder.NewArray(),
	}, 3)

	if err := writer.Write(record4); err != nil {
		panic(err)
	}

	// write buf out to ~/delta_test/delta.arrow
	if err := os.WriteFile("/home/jakedern/delta_test/delta2.arrow", buf.Bytes(), 0644); err != nil {
		panic(fmt.Errorf("failed to write delta file: %w", err))
	}
[arrow-ipc/src/reader.rs:717:9] batch.isDelta() = false
[arrow-ipc/src/reader.rs:722:9] &dictionary_values = StringArray
[
  "A",
  "B",
  "C",
]
[arrow-ipc/src/reader.rs:717:9] batch.isDelta() = false
[arrow-ipc/src/reader.rs:722:9] &dictionary_values = StringArray
[
  "A",
  "B",
  "D",
]
[arrow-ipc/src/reader.rs:717:9] batch.isDelta() = false
[arrow-ipc/src/reader.rs:722:9] &dictionary_values = StringArray
[
  "A",
  "B",
  "C",
  "E",
]
[arrow-ipc/src/reader.rs:717:9] batch.isDelta() = true
[arrow-ipc/src/reader.rs:725:5] &dictionary_values = StringArray
[
  "D",
]

What we see is that writing a RecordBatch from a different builder basically resets the ipc.Writer because it doesn't know about values from the first builder. The insight that I missed the first time is that in Go, there is a dictionary stapled to the RecordBatches by the RecordBatchWriter which contains all of the values produced by that builder. And the ipc writer uses that rather than the RecordBatch values.

The takeaway being that in Go we need some cooperation between the builder and the ipc writer to get delta dictionaries a reasonably amount of time. If we shovel RecordBatches from different builders into the same ipc writer then we get bad behavior for delta dictionaries.

In rust, we don't have that dictionary field and therefore our ability to write delta dictionaries is always pretty bad. Unless consecutive batches contain values that are supersets of the previous then we get no deltas and just waste time comparing dictionaries. Additionally we reset the internal dictionary after creating every batch in rust, presumably because we can't do anything with it anyway.

My questions are:

  1. Is there a better way to handle this than requiring cooperation between the dictionary builder and the ipc writers? It was difficult for me to figure out this information and I could imagine that this would surprise a lot of people who expect it to "just work" and who aren't getting delta dictionaries in different circumstances.
  2. Is there a straightforward way/desire to add a similar dictionary field to rust data? So at least we can get delta dictionaries under similar condititions as Go writers can.

Curious to hear any thoughts, thank you both for the help so far!

@albertlockett
Copy link
Contributor

albertlockett commented Jul 31, 2025

Based on comments above, it seems like maybe what we need is a way to efficiently detect if a value is already in the dictionary. The dictionary builders in all keep some kind of internal state the allows some efficient lookup of this. For example,



map: HashMap<Value<V::Native>, usize>,

Maybe we could refactor this to be something that's reusable by the IPC writer?

@asubiotto
Copy link
Contributor

Exposing builder state makes the most sense to me. Thanks for taking this on @JakeDern!

@JakeDern
Copy link
Author

JakeDern commented Aug 5, 2025

@asubiotto the approach I opted to take is to allow accumulating values only on the builder via a finish_preserve_values api. This was very simple to do and I think is closest to the go implementation which seems to do this by default. That means that the dictionary values are simply copied to the produced Array when this is called and the internal de-dup dictionary is preserved. Only the keys are cleared.

I also did a little bit of refactoring to get better visibility into the messages that the reader sees. Since we're trying to improve the conditions under which delta dictionaries are emitted (optimization), we need this visibility to test precisely rather than relying on heuristics like the size of the underlying stream.

Feedback would be greatly appreciated! If this approach seems reasonable then I can add the same finish_preserve_values api to other dictionary types as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support delta-encoded dictionaries in the Arrow IPC format
5 participants