feat: Add Aggregate::addRawClusteredInput and streaming_aggregation_eager_flush #12975

Yuhta · 2025-04-09T14:22:13Z

Summary:
Add fast path for streaming aggregation where we have input rows from same group located together. For certain functions, we can leverage this property to reduce the number of copy calls and create larger and fewer ranges for copy. This brings 3x improvements for a specific query shape common in data loading for AI training.

We implement this optimization for arbitrary and array_agg. For arbitrary, if the input is clustered, we just keep a reference to the input vector and index that is selected; when we extract values from the container, we group all copies from same vector to one copyRange call so the cost is minimized. For array_agg, we do similar thing, only track the range (offset and size) where the input will be taken for each group, and do the copy in bulk when we extract value.

There is another optimization to flush the streaming aggregation output whenever there is result available, via a new query config streaming_aggregation_eager_flush. This allows us to minimize the memory used by accumulators.

Differential Revision: D72677410

facebook-github-bot · 2025-04-09T14:22:27Z

This pull request was exported from Phabricator. Differential Revision: D72677410

netlify · 2025-04-09T14:22:34Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`c2cedea`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/67f849eb55091100082ee54c

mbasmanova · 2025-04-09T15:47:03Z

@Yuhta Jimmy, thank you for the optimization. Would you update the PR description to share some findings about why this is a useful optimization? 3x improvements for a specific query shape common in data loading for AI training.

velox/core/QueryConfig.h

velox/exec/Aggregate.h

velox/type/Type.cpp

velox/vector/BaseVector.h

velox/exec/StreamingAggregation.cpp

rui-mo

Thanks. Added two nits.

velox/functions/prestosql/aggregates/ArbitraryAggregate.cpp

velox/exec/StreamingAggregation.cpp

velox/exec/StreamingAggregation.h

velox/functions/prestosql/aggregates/ArbitraryAggregate.cpp

velox/functions/prestosql/aggregates/ArrayAggAggregate.cpp

…ager_flush (facebookincubator#12975) Summary: Add fast path for streaming aggregation where we have input rows from same group located together. For certain functions, we can leverage this property to reduce the number of copy calls and create larger and fewer ranges for copy. This brings 3x improvements for a specific query shape common in data loading for AI training. We implement this optimization for `arbitrary` and `array_agg`. For `arbitrary`, if the input is clustered, we just keep a reference to the input vector and index that is selected; when we extract values from the container, we group all copies from same vector to one `copyRange` call so the cost is minimized. For `array_agg`, we do similar thing, only track the range (offset and size) where the input will be taken for each group, and do the copy in bulk when we extract value. There is another optimization to flush the streaming aggregation output whenever there is result available, via a new query config `streaming_aggregation_eager_flush`. This allows us to minimize the memory used by accumulators. Differential Revision: D72677410

facebook-github-bot · 2025-04-10T17:41:59Z

This pull request was exported from Phabricator. Differential Revision: D72677410

…ager_flush (facebookincubator#12975) Summary: Add fast path for streaming aggregation where we have input rows from same group located together. For certain functions, we can leverage this property to reduce the number of copy calls and create larger and fewer ranges for copy. This brings 3x improvements for a specific query shape common in data loading for AI training. We implement this optimization for `arbitrary` and `array_agg`. For `arbitrary`, if the input is clustered, we just keep a reference to the input vector and index that is selected; when we extract values from the container, we group all copies from same vector to one `copyRange` call so the cost is minimized. For `array_agg`, we do similar thing, only track the range (offset and size) where the input will be taken for each group, and do the copy in bulk when we extract value. There is another optimization to flush the streaming aggregation output whenever there is result available, via a new query config `streaming_aggregation_eager_flush`. This allows us to minimize the memory used by accumulators. Differential Revision: D72677410

mbasmanova

Thank you, Jimmy.

velox/docs/configs.rst

velox/functions/prestosql/aggregates/ArrayAggAggregate.cpp

…ager_flush (facebookincubator#12975) Summary: Add fast path for streaming aggregation where we have input rows from same group located together. For certain functions, we can leverage this property to reduce the number of copy calls and create larger and fewer ranges for copy. This brings 3x improvements for a specific query shape common in data loading for AI training. We implement this optimization for `arbitrary` and `array_agg`. For `arbitrary`, if the input is clustered, we just keep a reference to the input vector and index that is selected; when we extract values from the container, we group all copies from same vector to one `copyRange` call so the cost is minimized. For `array_agg`, we do similar thing, only track the range (offset and size) where the input will be taken for each group, and do the copy in bulk when we extract value. There is another optimization to flush the streaming aggregation output whenever there is result available, via a new query config `streaming_aggregation_eager_flush`. This allows us to minimize the memory used by accumulators. Reviewed By: mbasmanova Differential Revision: D72677410

facebook-github-bot · 2025-04-10T21:28:07Z

This pull request was exported from Phabricator. Differential Revision: D72677410

…ager_flush (facebookincubator#12975) Summary: Add fast path for streaming aggregation where we have input rows from same group located together. For certain functions, we can leverage this property to reduce the number of copy calls and create larger and fewer ranges for copy. This brings 3x improvements for a specific query shape common in data loading for AI training. We implement this optimization for `arbitrary` and `array_agg`. For `arbitrary`, if the input is clustered, we just keep a reference to the input vector and index that is selected; when we extract values from the container, we group all copies from same vector to one `copyRange` call so the cost is minimized. For `array_agg`, we do similar thing, only track the range (offset and size) where the input will be taken for each group, and do the copy in bulk when we extract value. There is another optimization to flush the streaming aggregation output whenever there is result available, via a new query config `streaming_aggregation_eager_flush`. This allows us to minimize the memory used by accumulators. Reviewed By: mbasmanova Differential Revision: D72677410

facebook-github-bot · 2025-04-10T22:26:54Z

This pull request was exported from Phabricator. Differential Revision: D72677410

…ager_flush (facebookincubator#12975) Summary: Add fast path for streaming aggregation where we have input rows from same group located together. For certain functions, we can leverage this property to reduce the number of copy calls and create larger and fewer ranges for copy. This brings 3x improvements for a specific query shape common in data loading for AI training. We implement this optimization for `arbitrary` and `array_agg`. For `arbitrary`, if the input is clustered, we just keep a reference to the input vector and index that is selected; when we extract values from the container, we group all copies from same vector to one `copyRange` call so the cost is minimized. For `array_agg`, we do similar thing, only track the range (offset and size) where the input will be taken for each group, and do the copy in bulk when we extract value. There is another optimization to flush the streaming aggregation output whenever there is result available, via a new query config `streaming_aggregation_eager_flush`. This allows us to minimize the memory used by accumulators. Reviewed By: mbasmanova Differential Revision: D72677410

facebook-github-bot · 2025-04-10T22:34:52Z

This pull request was exported from Phabricator. Differential Revision: D72677410

…ager_flush (facebookincubator#12975) Summary: Add fast path for streaming aggregation where we have input rows from same group located together. For certain functions, we can leverage this property to reduce the number of copy calls and create larger and fewer ranges for copy. This brings 3x improvements for a specific query shape common in data loading for AI training. We implement this optimization for `arbitrary` and `array_agg`. For `arbitrary`, if the input is clustered, we just keep a reference to the input vector and index that is selected; when we extract values from the container, we group all copies from same vector to one `copyRange` call so the cost is minimized. For `array_agg`, we do similar thing, only track the range (offset and size) where the input will be taken for each group, and do the copy in bulk when we extract value. There is another optimization to flush the streaming aggregation output whenever there is result available, via a new query config `streaming_aggregation_eager_flush`. This allows us to minimize the memory used by accumulators. Reviewed By: mbasmanova Differential Revision: D72677410

facebook-github-bot · 2025-04-10T22:45:23Z

This pull request was exported from Phabricator. Differential Revision: D72677410

facebook-github-bot · 2025-04-11T15:53:49Z

This pull request has been merged in afb236a.

prestodb-ci · 2025-04-11T15:53:50Z

Rebase triggered for oap-project/velox.

…ager_flush (facebookincubator#12975) Summary: Pull Request resolved: facebookincubator#12975 Add fast path for streaming aggregation where we have input rows from same group located together. For certain functions, we can leverage this property to reduce the number of copy calls and create larger and fewer ranges for copy. This brings 3x improvements for a specific query shape common in data loading for AI training. We implement this optimization for `arbitrary` and `array_agg`. For `arbitrary`, if the input is clustered, we just keep a reference to the input vector and index that is selected; when we extract values from the container, we group all copies from same vector to one `copyRange` call so the cost is minimized. For `array_agg`, we do similar thing, only track the range (offset and size) where the input will be taken for each group, and do the copy in bulk when we extract value. There is another optimization to flush the streaming aggregation output whenever there is result available, via a new query config `streaming_aggregation_eager_flush`. This allows us to minimize the memory used by accumulators. Reviewed By: mbasmanova Differential Revision: D72677410 fbshipit-source-id: eedd664174b13784b47325eb2ab8274445470235

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 9, 2025

facebook-github-bot added the fb-exported label Apr 9, 2025

Yuhta changed the title ~~feat: add Aggregate::addRawClusteredInput and streaming_aggregation_eager_flush~~ feat: Add Aggregate::addRawClusteredInput and streaming_aggregation_eager_flush Apr 9, 2025

mbasmanova requested review from xiaoxmeng and rui-mo April 9, 2025 15:45

mbasmanova reviewed Apr 9, 2025

View reviewed changes

velox/core/QueryConfig.h Show resolved Hide resolved

mbasmanova reviewed Apr 9, 2025

View reviewed changes

velox/exec/Aggregate.h Outdated Show resolved Hide resolved

mbasmanova reviewed Apr 9, 2025

View reviewed changes

velox/type/Type.cpp Outdated Show resolved Hide resolved

mbasmanova reviewed Apr 9, 2025

View reviewed changes

velox/vector/BaseVector.h Show resolved Hide resolved

mbasmanova reviewed Apr 9, 2025

View reviewed changes

velox/exec/StreamingAggregation.cpp Outdated Show resolved Hide resolved

mbasmanova mentioned this pull request Apr 10, 2025

Compatibility enhancements on AggregationNode #12830

Open

rui-mo reviewed Apr 10, 2025

View reviewed changes

velox/functions/prestosql/aggregates/ArbitraryAggregate.cpp Outdated Show resolved Hide resolved

velox/functions/prestosql/aggregates/ArbitraryAggregate.cpp Outdated Show resolved Hide resolved

mbasmanova reviewed Apr 10, 2025

View reviewed changes

velox/exec/StreamingAggregation.cpp Show resolved Hide resolved

mbasmanova reviewed Apr 10, 2025

View reviewed changes

velox/exec/StreamingAggregation.cpp Outdated Show resolved Hide resolved

mbasmanova reviewed Apr 10, 2025

View reviewed changes

velox/exec/StreamingAggregation.h Outdated Show resolved Hide resolved

mbasmanova reviewed Apr 10, 2025

View reviewed changes

velox/exec/StreamingAggregation.h Show resolved Hide resolved

mbasmanova reviewed Apr 10, 2025

View reviewed changes

velox/functions/prestosql/aggregates/ArrayAggAggregate.cpp Outdated Show resolved Hide resolved

velox/functions/prestosql/aggregates/ArrayAggAggregate.cpp Outdated Show resolved Hide resolved

velox/functions/prestosql/aggregates/ArrayAggAggregate.cpp Outdated Show resolved Hide resolved

Yuhta force-pushed the export-D72677410 branch from f639c66 to fa6d5ad Compare April 10, 2025 17:41

mbasmanova approved these changes Apr 10, 2025

View reviewed changes

velox/docs/configs.rst Show resolved Hide resolved

velox/functions/prestosql/aggregates/ArrayAggAggregate.cpp Show resolved Hide resolved

Yuhta force-pushed the export-D72677410 branch from fa6d5ad to a987bc4 Compare April 10, 2025 21:27

Yuhta force-pushed the export-D72677410 branch from a987bc4 to aec840e Compare April 10, 2025 22:26

Yuhta force-pushed the export-D72677410 branch from aec840e to ed0ccd1 Compare April 10, 2025 22:34

Yuhta force-pushed the export-D72677410 branch from ed0ccd1 to c2cedea Compare April 10, 2025 22:44

facebook-github-bot closed this in afb236a Apr 11, 2025

facebook-github-bot added the Merged label Apr 11, 2025

prestodb-ci mentioned this pull request Apr 11, 2025

Rebase branch velox_pr_rebase (d0957e1) with oss-main (afb236a) oap-project/velox#516

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Aggregate::addRawClusteredInput and streaming_aggregation_eager_flush #12975

feat: Add Aggregate::addRawClusteredInput and streaming_aggregation_eager_flush #12975

Yuhta commented Apr 9, 2025 •

edited

Loading

facebook-github-bot commented Apr 9, 2025

netlify bot commented Apr 9, 2025 •

edited

Loading

mbasmanova commented Apr 9, 2025

rui-mo left a comment

facebook-github-bot commented Apr 10, 2025

mbasmanova left a comment

facebook-github-bot commented Apr 10, 2025

facebook-github-bot commented Apr 10, 2025

facebook-github-bot commented Apr 10, 2025

facebook-github-bot commented Apr 10, 2025

facebook-github-bot commented Apr 11, 2025

prestodb-ci commented Apr 11, 2025

feat: Add Aggregate::addRawClusteredInput and streaming_aggregation_eager_flush #12975

feat: Add Aggregate::addRawClusteredInput and streaming_aggregation_eager_flush #12975

Conversation

Yuhta commented Apr 9, 2025 • edited Loading

facebook-github-bot commented Apr 9, 2025

netlify bot commented Apr 9, 2025 • edited Loading

✅ Deploy Preview for meta-velox canceled.

mbasmanova commented Apr 9, 2025

rui-mo left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Apr 10, 2025

mbasmanova left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Apr 10, 2025

facebook-github-bot commented Apr 10, 2025

facebook-github-bot commented Apr 10, 2025

facebook-github-bot commented Apr 10, 2025

facebook-github-bot commented Apr 11, 2025

prestodb-ci commented Apr 11, 2025

Yuhta commented Apr 9, 2025 •

edited

Loading

netlify bot commented Apr 9, 2025 •

edited

Loading