Skip to content

fix: Optimize RowType::hashKind #12999

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

Yuhta
Copy link
Contributor

@Yuhta Yuhta commented Apr 10, 2025

Summary:
Avoid recurively rehash for RowType. This reduces the expression
compilation time significantly for flatmap types commonly seen in ML workload.

Differential Revision: D72803509

Yuhta added 2 commits April 10, 2025 10:44
…ager_flush (facebookincubator#12975)

Summary:

Add fast path for streaming aggregation where we have input rows from same group located together.  For certain functions, we can leverage this property to reduce the number of copy calls and create larger and fewer ranges for copy.  This brings 3x improvements for a specific query shape common in data loading for AI training.

We implement this optimization for `arbitrary` and `array_agg`.  For `arbitrary`, if the input is clustered, we just keep a reference to the input vector and index that is selected; when we extract values from the container, we group all copies from same vector to one `copyRange` call so the cost is minimized.  For `array_agg`, we do similar thing, only track the range (offset and size) where the input will be taken for each group, and do the copy in bulk when we extract value.

There is another optimization to flush the streaming aggregation output whenever there is result available, via a new query config `streaming_aggregation_eager_flush`.  This allows us to minimize the memory used by accumulators.

Differential Revision: D72677410
Summary:
Avoid recurively rehash for `RowType`.  This reduces the expression
compilation time significantly for flatmap types commonly seen in ML workload.

Differential Revision: D72803509
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 10, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72803509

Copy link

netlify bot commented Apr 10, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 41d9504
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67f8039bd577dc000875eb01

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yuhta Are we caching hash for ROW type? Any reason not to compute it in the ctor?

@Yuhta
Copy link
Contributor Author

Yuhta commented Apr 10, 2025

@mbasmanova Not sure how often this is used. Seems only used for expression deduplication so we don't need to pay for this at runtime.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 6b5a5f1.

@prestodb-ci
Copy link

Rebase triggered for oap-project/velox.

zhanglistar pushed a commit to bigo-sg/velox that referenced this pull request Apr 22, 2025
Summary:
Pull Request resolved: facebookincubator#12999

Avoid recurively rehash for `RowType`.  This reduces the expression
compilation time significantly for flatmap types commonly seen in ML workload.

Reviewed By: mbasmanova

Differential Revision: D72803509

fbshipit-source-id: 95f6f0db615b526ac0d349a6a4a3f2155f00e2eb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants