Skip to content

Conversation

@owaiskazi19
Copy link
Member

@owaiskazi19 owaiskazi19 commented Sep 16, 2025

Description

Convert agentic query translator processor to system-generated processor

{
    "query": {
        "agentic": {
            "query_text": "List all species",
            "agent_id": "jf-WUZkBl9tG0YB4F8-A",
            "query_fields": ["species", "petal_length_in_cm"]
        }
    }
}

Related Issues

Part of #1525

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@owaiskazi19 owaiskazi19 force-pushed the system-search-pipeline branch from aaa9d44 to 4af6137 Compare September 16, 2025 08:42
}
throw new IllegalStateException(
"Agentic search query must be used as top-level query, not nested inside other queries. Should be used with agentic_query_translator search processor"
"Agentic search query must be processed by the agentic_query_translator system processor before query execution. "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add a validation in the AgenticSearchQueryBuilder to ensure the processor is enabled in the cluster setting cluster.search.enabled_system_generated_factories.

Copy link
Member Author

@owaiskazi19 owaiskazi19 Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a new generic method NeuralSearchSettingsAccessor for all the system generated processor to use to verify if they are enabled in the factories

@owaiskazi19
Copy link
Member Author

Integ tests failing because of opensearch-project/ml-commons#4172

@owaiskazi19 owaiskazi19 reopened this Sep 16, 2025
* @return true if the processor is enabled in cluster settings
*/
public boolean isSystemGenerateProcessorEnabled(String processor) {
String enabledFactories = String.valueOf(clusterService.getClusterSettings().get(SYSTEM_GENERATED_PIPELINE_SETTINGS));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ENABLED_SYSTEM_GENERATED_FACTORIES_SETTING is a list of strings. And I think it's better to get the value through ENABLED_SYSTEM_GENERATED_FACTORIES_SETTING.get(clusterService.getSettings()) which will return a list of string to you.

And we also need to check if it has "*" which will enabled all the system processors.

@owaiskazi19
Copy link
Member Author

Based on the discussion with the team again, we have decided to go with the current approach of user-defined pipeline to avoid overloading the request with agent id. Will keep this PR open and will add another processor to this PR later. Keeping this on hold until then

@owaiskazi19 owaiskazi19 marked this pull request as draft September 16, 2025 22:45
@heemin32
Copy link
Collaborator

Based on the discussion with the team again, we have decided to go with the current approach of user-defined pipeline to avoid overloading the request with agent id. Will keep this PR open and will add another processor to this PR later. Keeping this on hold until then

Could you tell more about the concern regarding agent id overloading?

@owaiskazi19
Copy link
Member Author

Could you tell more about the concern regarding agent id overloading?

For every search request, we need to pass an agent_id:

{
    "query": {
        "agentic": {
            "query_text": "List all species",
            "agent_id": "jf-WUZkBl9tG0YB4F8-A",
            "query_fields": ["species", "petal_length_in_cm"]
        }
    }
}

While we can define a pipeline as a one-time setup, we can have a system-generated pipeline if users want to use an agentic query for small purposes. However, overall, having a user-defined one-time pipeline would be a better choice here.

@heemin32
Copy link
Collaborator

While we can define a pipeline as a one-time setup, we can have a system-generated pipeline if users want to use an agentic query for small purposes. However, overall, having a user-defined one-time pipeline would be a better choice here.

That’s debatable. If putting agent-id in the query isn’t a good idea, then the same could be said about query_field? Both can just be set in the pipeline. Personally, I feel passing agent-id in the query is simpler than creating and attaching a processor.
As the api is called by program but not by person, passing agent-id every time is not that bad imo.

@owaiskazi19
Copy link
Member Author

owaiskazi19 commented Sep 17, 2025

I feel passing agent-id in the query is simpler than creating and attaching a processor. As the api is called by program but not by person, passing agent-id every time is not that bad imo.

Agreed and we can provide both the options of system generated processor and user defined processor. Will add both the types in this same PR. Also, one other reason to keep it in a separate pipeline is we can have other processors like RAG or ml inference attached in the same pipeline for summarization of the result.

then the same could be said about query_field?

it's an optional field for better context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants