Skip to content

Conversation

anandpatel9998
Copy link

@anandpatel9998 anandpatel9998 commented Oct 4, 2025

Description

Current cardinality aggregator logic selects DirectCollector over OrdinalsCollector when relative memory overhead due to OrdinalsCollector (compared to DirectCollector) is higher. Because of this relative memory consumption logic, DirectCollector is selected for high cardinality aggregation queries. DirectCollector is slower compared to OrdinalsCollector. This default selection leads to higher search latency even when Opensearch process have available memory to use ordinals collector for faster query performance.

There is no way to figure out memory requirement for nested aggregation because number of buckets are dynamically created as we traverse through all the matching document ids. To overcome this limitation, this change create a hybrid collector which will first use Ordinals Collector and will switch to DirectCollector if memory usage for Ordinals Collector Increase beyond certain threshold. When Hybrid collector switch from Ordinals Collector to Direct Collector, it will utilize already computed aggregation data from Ordinals Collector so that we do not have to rebuild aggregation result using Direct Collector.

Signed-off-by: Anand Pravinbhai Patel [email protected]

Related Issues

Resolves #19260

Check List

  • [ Done ] Functionality includes testing.
  • [ Not Applicable ] API changes companion pull request created, if applicable.
  • [ Is it required ? ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

github-actions bot commented Oct 4, 2025

❌ Gradle check result for a2f5dd7: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 4, 2025

❌ Gradle check result for 41a9e69: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 5, 2025

❌ Gradle check result for c142ac4: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 6, 2025

❌ Gradle check result for 88989f3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 6, 2025

❌ Gradle check result for c142ac4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 6, 2025

❌ Gradle check result for 06ce5c3: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 7, 2025

❌ Gradle check result for fc328a2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@owaiskazi19
Copy link
Member

Multiple failures for task :rest-api-spec:yamlRestTest, :qa:smoke-test-multinode:integTest, and :qa:mixed-cluster:v3.2.0#mixedClusterTest. Can you run these tasks on local and verify?

Copy link
Contributor

github-actions bot commented Oct 7, 2025

❌ Gradle check result for 3d9432e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@anandpatel9998
Copy link
Author

Okay, I was missing one test file config update which was causing more tests failures. Now tests are failing only for MixedClusterClientYamlTestSuiteIT. I think they are failing most probably for older version of opensearch code is used which uses ordinals collector. I am curious how are MixedClusterClientYamlTestSuiteIT are executed and if it is using older opensearch version, how can I fix this test since this is a behavior change with new change where we try to prioritize using Hybrid Collector instead of OrdinalsCollector?

@owaiskazi19
Copy link
Member

This is the failing test

REPRODUCE WITH: ./gradlew ':qa:mixed-cluster:v3.2.0#mixedClusterTest' --tests 'org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT' -Dtests.method='test {p0=search.aggregation/170_cardinality_metric_unsigned/profiler string}' -Dtests.seed=8BF0A05CBE79F283 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=pa-IN -Dtests.timezone=America/North_Dakota/New_Salem -Druntime.java=24

MixedClusterClientYamlTestSuiteIT > test {p0=search.aggregation/170_cardinality_metric_unsigned/profiler string} FAILED
    java.lang.AssertionError: Failure at [search.aggregation/170_cardinality_metric_unsigned:285]: profile.shards.0.aggregations.0.debug.ordinals_collectors_used didn't match expected value:
    profile.shards.0.aggregations.0.debug.ordinals_collectors_used: expected Integer [0] but was Integer [1]
        at __randomizedtesting.SeedInfo.seed([8BF0A05CBE79F283:3A49F8610859F7B]:0)

@anandpatel9998
Copy link
Author

@owaiskazi19 , Yes I was able to reproduce the failures for MixedClusterClientYamlTestSuiteIT. I think they are failing most probably for older version of opensearch code is used which uses ordinals collector. I am curious how are MixedClusterClientYamlTestSuiteIT are executed and if it is using older opensearch version, how can I fix this test since this is a behavior change with new change where we try to prioritize using Hybrid Collector instead of OrdinalsCollector?

@owaiskazi19
Copy link
Member

owaiskazi19 commented Oct 7, 2025

Need to add a version aware logic for such scenario in test, something like this to only check for the newer version

- skip:
    version: "<version_here>"
    reason: "Old version used OrdinalsCollector"
- match:
    profile.shards.0.aggregations.0.debug.ordinals_collectors_used: 0

@anandpatel9998
Copy link
Author

Thanks for the suggestion @owaiskazi19

I am wondering if that will help or not since if one process is running without latest commit changes, it may still fail. Can you help me understand how mixed cluster tests execute ?

@owaiskazi19
Copy link
Member

owaiskazi19 commented Oct 7, 2025

Mixed clusters tests mixed-version clusters to ensure that newer versions can interoperate correctly with older nodes. The :qa:mixed-cluster task spins up a test cluster composed of different versions (old/new nodes). Then the tests validate behavior across upgrades or during rolling restarts.
There is a blog also for the bwc framework: https://opensearch.org/blog/bwc-testing-for-opensearch/
You can also try conditional matching

- is_one_of: 
    profile.shards.0.aggregations.0.debug.ordinals_collectors_used: [0, 1]

Copy link
Contributor

github-actions bot commented Oct 7, 2025

❕ Gradle check result for 4ee0fd1: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Oct 7, 2025

Codecov Report

❌ Patch coverage is 87.27273% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.06%. Comparing base (39b7a59) to head (a34c044).
⚠️ Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
...ch/aggregations/metrics/CardinalityAggregator.java 92.68% 1 Missing and 2 partials ⚠️
...va/org/opensearch/search/DefaultSearchContext.java 83.33% 2 Missing ⚠️
.../org/opensearch/search/internal/SearchContext.java 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #19524      +/-   ##
============================================
+ Coverage     73.00%   73.06%   +0.05%     
+ Complexity    70534    70522      -12     
============================================
  Files          5719     5719              
  Lines        323260   323310      +50     
  Branches      46816    46818       +2     
============================================
+ Hits         235993   236217     +224     
+ Misses        68224    67995     -229     
- Partials      19043    19098      +55     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@anandpatel9998
Copy link
Author

Thanks @owaiskazi19 for your suggestions. Adding skip filter helped fix the mixed-cluster tests.

Current cardinality aggregator logic selects DirectCollector over OrdinalsCollector when relative memory overhead due to OrdinalsCollector (compared to DirectCollector) is higher. Because of this relative memory consumption logic, DirectCollector is selected for high cardinality aggregation queries. DirectCollector is slower compared to OrdinalsCollector. This default selection leads to higher search latency even when Opensearch process have available memory to use ordinals collector for faster query performance.

There is no way to figure out memory requirement for nested aggregation because number of buckets are dynamically created as we traverse through all the matching document ids. To overcome this limitation, this change create a hybrid collector which will first use Ordinals Collector and will switch to DirectCollector if memory usage for Ordinals Collector Increase beyond certain threshold. When Hybrid collector switch from Ordinals Collector to Direct Collector, it will utilize already computed aggregation data from Ordinals Collector so that we do not have to rebuild aggregation result using Direct Collector.

Signed-off-by: Anand Pravinbhai Patel <[email protected]>
Copy link
Contributor

github-actions bot commented Oct 8, 2025

❕ Gradle check result for 522a92b: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Contributor

github-actions bot commented Oct 8, 2025

❌ Gradle check result for 6375b70: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 8, 2025

❌ Gradle check result for b666de2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 8, 2025

❌ Gradle check result for e9e7fe0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 9, 2025

❌ Gradle check result for 5848513: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 9, 2025

✅ Gradle check result for a34c044: SUCCESS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants