-
Notifications
You must be signed in to change notification settings - Fork 1.3k
fix: Optimize RowType::hashKind #12999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ager_flush (facebookincubator#12975) Summary: Add fast path for streaming aggregation where we have input rows from same group located together. For certain functions, we can leverage this property to reduce the number of copy calls and create larger and fewer ranges for copy. This brings 3x improvements for a specific query shape common in data loading for AI training. We implement this optimization for `arbitrary` and `array_agg`. For `arbitrary`, if the input is clustered, we just keep a reference to the input vector and index that is selected; when we extract values from the container, we group all copies from same vector to one `copyRange` call so the cost is minimized. For `array_agg`, we do similar thing, only track the range (offset and size) where the input will be taken for each group, and do the copy in bulk when we extract value. There is another optimization to flush the streaming aggregation output whenever there is result available, via a new query config `streaming_aggregation_eager_flush`. This allows us to minimize the memory used by accumulators. Differential Revision: D72677410
Summary: Avoid recurively rehash for `RowType`. This reduces the expression compilation time significantly for flatmap types commonly seen in ML workload. Differential Revision: D72803509
This pull request was exported from Phabricator. Differential Revision: D72803509 |
✅ Deploy Preview for meta-velox canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yuhta Are we caching hash for ROW type? Any reason not to compute it in the ctor?
@mbasmanova Not sure how often this is used. Seems only used for expression deduplication so we don't need to pay for this at runtime. |
This pull request has been merged in 6b5a5f1. |
Rebase triggered for oap-project/velox. |
Summary: Pull Request resolved: facebookincubator#12999 Avoid recurively rehash for `RowType`. This reduces the expression compilation time significantly for flatmap types commonly seen in ML workload. Reviewed By: mbasmanova Differential Revision: D72803509 fbshipit-source-id: 95f6f0db615b526ac0d349a6a4a3f2155f00e2eb
Summary:
Avoid recurively rehash for
RowType
. This reduces the expressioncompilation time significantly for flatmap types commonly seen in ML workload.
Differential Revision: D72803509