Skip to content

Conversation

vinishjail97
Copy link
Contributor

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@vinishjail97
Copy link
Contributor Author

TODO: Fix tests and add more validations.

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Sep 18, 2025
@danny0405 danny0405 changed the title Integrate BUCKET_INDEX for HoodieJavaClient feat: integrate BUCKET_INDEX for HoodieJavaClient Sep 18, 2025
Comment on lines 73 to 74
testUpsertsInternal((writeClient, recordRDD, instantTime) -> writeClient.upsertPreppedRecords(recordRDD, instantTime),
true, true, JavaUpgradeDowngradeHelper.getInstance());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add validations on the number of records and buckets after upserts?

return super.getConfigBuilder(schemaStr, HoodieIndex.IndexType.BUCKET)
.withIndexConfig(HoodieIndexConfig.newBuilder()
.withIndexType(HoodieIndex.IndexType.BUCKET)
.withBucketIndexEngineType(HoodieIndex.BucketIndexEngineType.SIMPLE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also have a test on CONSISTENT_HASHING type?

@github-actions github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Sep 30, 2025
@vinishjail97
Copy link
Contributor Author

@yihua Need help on the following to improve this PR.

  1. There's a need for a better interface which extends org.apache.hudi.table.action.commit.Partitioner having methods like getBucketInfo etc. used by both spark/java/flink engines. I have added a TODO interface to get basic tests passing, can you let me know if there's a better way to reduce duplicate code between spark/java and other engines?
  2. HoodieBucketIndex supports tagging for INSERT and BULK_INSERT operations but enforces the presence of a recordKey in validateBucketIndexConfig. Can we remove this to unblock HoodieBucketIndex use-cases for append-only tables?

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:L PR with lines of changes in (300, 1000]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants