Skip to content

Conversation

@the-other-tim-brown
Copy link
Contributor

@the-other-tim-brown the-other-tim-brown commented Oct 7, 2025

What is the purpose of the pull request

If the existing data files have field IDs set that do not align with the convention that Iceberg uses, then the schema will not line up properly. This is very limiting for users.

Addresses #750

Brief change log

  • Initializes the table with an empty schema to avoid auto-assigning the field IDs. After that, the schema is manually updated along with the partition spec
  • Updates the schema sync to use a manual update instead of the provided update APIs when the source schema has its own field IDs provided.

Verify this pull request

  • Updated the RunSync testing to create tables with IDs set in the parquet files to ensure that the new code path is exercised in an end to end test
  • Added unit tests

.build();
private final Schema icebergSchema =
new Schema(
Types.NestedField.required(1, "timestamp_field", Types.TimestampType.withoutZone()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this numbering change? Would it break when this version gets rolled out to existing iceberg targets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to prove out that the field numbering matches what is provided. Currently the numbers are reassigned when the table is first created.

This can impact existing tables if the schema field IDs are currently wrong but I think that would also imply that existing table is in a broken state.

@the-other-tim-brown the-other-tim-brown changed the title Fix schema sync for Iceberg tables [750] Fix schema sync for Iceberg tables Oct 8, 2025
@the-other-tim-brown the-other-tim-brown marked this pull request as ready for review October 8, 2025 01:13
@the-other-tim-brown the-other-tim-brown merged commit 93d4d27 into apache:main Oct 8, 2025
2 checks passed
@the-other-tim-brown the-other-tim-brown deleted the iceberg-schema-ids branch October 8, 2025 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants