feat: Vertex AI Search Tuning sample notebook #1944

JincheolKim · 2025-04-07T03:35:10Z

Contributing an additional notebook for the "Vertex AI Search Tuning" feature to help users try it with a sample notebook.

[V] Follow the CONTRIBUTING Guide.
[V] You are listed as the author in your notebook or README file.
- Your account is listed in CODEOWNERS for the file(s).
[V] Make your Pull Request title in the https://www.conventionalcommits.org/ specification.
[V] Ensure the tests and linter pass (Run nox -s format from the repository root to format).
[V] Appropriate docs were updated (if necessary)

No issue fix for this PR.

…ure of Vertex AI Search

…ure of Vertex AI Search - V_1_0_0_ALPHA

gemini-code-assist

Hello @JincheolKim, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request introduces a new sample notebook for Vertex AI Search Tuning. The notebook guides users through the process of tuning Vertex AI Search with sample data, including preparing data in JSONL and TSV formats, configuring a search tuning job, and submitting it to Vertex AI. The notebook also includes steps for creating a data store and search app, and testing the tuned search app endpoint. The author also included FAQ documents from the Kubernetes project, and transformed them into PDF format for use with Vertex AI Search.

Highlights

New Notebook: Adds a new notebook, sample-search-tuning.ipynb, demonstrating Vertex AI Search Tuning.
Data Preparation: Includes code to generate training and test datasets from Kubernetes FAQ markdown files, converting them to JSONL and TSV formats.
Datastore and Search App Creation: Provides steps and code for creating a data store and search app using the Vertex AI Search SDK.
Search Tuning Configuration: Demonstrates how to configure and submit a search tuning job to Vertex AI.
PDF Conversion: Includes steps to convert Markdown files to PDF format for compatibility with Vertex AI Search.
Clean Up: Includes code to clean up resources created during the notebook execution.

Changelog

Click here to see the changelog

search/vertexai-search-options/vertexai-search-tuning/sample-search-tuning.ipynb
- Initial commit of the sample notebook.
- Includes sections for data preparation, datastore creation, search app creation, search tuning configuration, and testing.
- Provides code examples and explanations for each step of the process.
- Includes clean up steps to remove created resources.
search/vertexai-search-options/vertexai-search-tuning/tuning_data/FAQ-Kubernetes-Client.md
- Added FAQ document for Kubernetes Client
search/vertexai-search-options/vertexai-search-tuning/tuning_data/FAQ.md
- Added FAQ document for Kubernetes
search/vertexai-search-options/vertexai-search-tuning/tuning_data/README.md
- Added README document for Kubernetes
search/vertexai-search-options/vertexai-search-tuning/tuning_data/query_file.jsonl
- Added query file for Kubernetes
search/vertexai-search-options/vertexai-search-tuning/tuning_data/test_data.tsv
- Added test data file for Kubernetes
search/vertexai-search-options/vertexai-search-tuning/tuning_data/training_data.tsv
- Added training data file for Kubernetes

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

In clouds of data, search does reside,
Relevance sought, with nowhere to hide.
Tune the LLM's ear,
Make the answers clear,
Find the right doc, with knowledge applied.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The notebook introduces a workflow for tuning Vertex AI Search using FAQ data from Kubernetes. It covers data preparation, datastore updates, search app rebuilding, and evaluation. The notebook is comprehensive, but there are some areas where improvements can be made to enhance clarity and robustness.

Summary of Findings

TODO comments: The notebook contains a TODO comment that should be addressed by uncommenting the variables or providing a default implementation.
Inconsistent code execution: The notebook contains code that is commented out for MacOS but not for Linux, which could lead to confusion for users on different operating systems.
Missing Error Handling: The notebook lacks explicit error handling for cloud storage operations, which could lead to unexpected behavior if these operations fail.

Merge Readiness

The notebook provides a good starting point for users to understand Vertex AI Search Tuning. However, addressing the TODO comment and adding error handling would improve the notebook's completeness and robustness. I am unable to directly approve this pull request, and recommend that others review and approve this code before merging. Given the presence of a high severity comment, I recommend that the pull request not be merged until this is addressed.

gemini-code-assist · 2025-04-07T03:35:53Z

search/vertexai-search-options/vertexai-search-tuning/sample-search-tuning.ipynb

+    "# TODO(developer): Uncomment these variables before running the sample.\n",
+    "data_store_id = f\"{SEARCH_DATASTORE_ID}\"\n",
+    "corpus_data_path = f\"{TUNING_DATA_PATH_REMOTE}/corpus_file.jsonl\"\n",
+    "query_data_path = f\"{TUNING_DATA_PATH_REMOTE}/query_file.jsonl\"\n",
+    "train_data_path = f\"{TUNING_DATA_PATH_REMOTE}/training_data.tsv\"\n",
+    "test_data_path = f\"{TUNING_DATA_PATH_REMOTE}/test_data.tsv\""


This section is marked as TODO. Either uncomment these variables or provide a default implementation. Leaving them commented will cause the notebook to fail if a user tries to run it without modification. It's also not clear what the user should set these variables to.

data_store_id = f"{SEARCH_DATASTORE_ID}" corpus_data_path = f"{TUNING_DATA_PATH_REMOTE}/corpus_file.jsonl" query_data_path = f"{TUNING_DATA_PATH_REMOTE}/query_file.jsonl" train_data_path = f"{TUNING_DATA_PATH_REMOTE}/training_data.tsv" test_data_path = f"{TUNING_DATA_PATH_REMOTE}/test_data.tsv"

gemini-code-assist · 2025-04-07T03:35:54Z

search/vertexai-search-options/vertexai-search-tuning/sample-search-tuning.ipynb

+    "# For Linux\n",
+    "!pandoc --pdf-engine=pdflatex FAQ-Kubernetes-Client.md -o FAQ-Kubernetes-Client.pdf\n",
+    "!pandoc --pdf-engine=pdflatex FAQ.md -o FAQ.pdf\n",
+    "!pandoc --pdf-engine=pdflatex README.md -o README.pdf\n",
+    "## For MacOS\n",
+    "#!pandoc --pdf-engine=xelatex FAQ-Kubernetes-Client.md -o FAQ-Kubernetes-Client.pdf\n",
+    "#!pandoc --pdf-engine=xelatex FAQ-Kubernetes-Client.md -o FAQ-Kubernetes-Client.pdf\n",
+    "#!pandoc --pdf-engine=xelatex FAQ-Kubernetes-Client.md -o FAQ-Kubernetes-Client.pdf"


Consider using the same approach for MacOS as for Linux, or provide a conditional execution based on the OS to avoid confusion. What's the reasoning for commenting out these lines for MacOS?

gemini-code-assist · 2025-04-07T03:35:54Z

search/vertexai-search-options/vertexai-search-tuning/sample-search-tuning.ipynb

+    "!gcloud storage cp \"{TUNING_DATA_PATH_LOCAL}/*.pdf\" \"{SEARCH_DATASTORE_PATH_REMOTE}\"\n",
+    "!gcloud storage ls \"{SEARCH_DATASTORE_PATH_REMOTE}\""


Consider adding error handling to the cloud storage copy operations. If these operations fail, the notebook may not function correctly, and the user will not know why.

gemini-code-assist · 2025-04-07T03:35:54Z

search/vertexai-search-options/vertexai-search-tuning/sample-search-tuning.ipynb

+    "        ClientOptions(api_endpoint=f\"{location}-discoveryengine.googleapis.com\")\n",
+    "        if location != \"global\"\n",


Consider adding error handling to the cloud storage copy operations. If these operations fail, the notebook may not function correctly, and the user will not know why.

JincheolKim · 2025-04-07T03:42:07Z

Would you give me a guidance to resolve the failures from

Check spelling
Lint Code Base

checks during the PR process? Thank you very much.

holtskinner

Remove all of the duplicate files.

Move the Notebook to a folder called search/tuning.

Title it vertex-ai-search-tuning.ipynb

Change the Image link to be https://storage.googleapis.com/github-repo/generative-ai/search/tuning/images/key_user_journey_search_tuning.png

Remove all of the tuning data from this PR, and use the data from this public GCS Bucket gs://github-repo/generative-ai/search/tuning/tuning_data/ Either have the users pull directly from this Bucket for the tuning job, or have them create the bucket and do gsutil cp to copy the data into their own bucket from this one.

Also, be sure to fix the spelling and lint errors

JincheolKim · 2025-04-09T05:41:14Z

Please create a subfolder named "awesome_rlhf" under the "gs://github-repo/generative-ai/search/tuning" subfolder. I will move all of the data under the public bucket. Thank you.

JincheolKim · 2025-04-10T02:13:20Z

gs://github-repo/generative-ai/search/tuning

Please create a subfolder named "awesome_rlhf" under the "gs://github-repo/generative-ai/search/tuning" subfolder or would you be able to give me the ""Create/Delete/Read/Write" right to the "gs://github-repo/generative-ai/search/tuning" folder to create the subfolder? I need to move all of the test data under the subfolder to make the asset complied with your comments.

…ets were added. 1. Removed all of the duplicate files. 2. Moved the Notebook to a folder called search/tuning. 3. Retitled the notebook to "vertex-ai-search-tuning.ipynb". 4. Changed the Image link to be "https://storage.googleapis.com/github-repo/generative-ai/search/tuning/images/key_user_journey_search_tuning.png" 5. Removed all of the tuning data from this PR, and moved them to the public GCS Bucket gs://github-repo/generative-ai/search/tuning/tuning_data/, The logic in the notebook has been changed together corresponding to the changes.

…ets were added. 1. Added jincheolkim@ to the "CODEOWNDERS" for search/tuning. 2. Moved the location of the contributed assets to search/tuning and detached all of the old files from the old path vertexai-search-options/vertexai-search-tuning from the working tree. 3. Removed all of the test data from the repository.

…ets were added. 1. Wrong spells of the text are corrected.

…ets were added. 1. The spell checking exception (vertexai-search-tuning.ipynb) has been added.

…ets were added. 1. The spell-checking exception is corrected again. 2. Added 4 exceptional words (FHIR, Jincheol, reidx, tscore)

JincheolKim · 2025-04-11T09:16:08Z

All of the requested corrections and actions are completed and committed. Please review them again. Thank you.

.github/actions/spelling/excludes.txt

holtskinner · 2025-04-14T15:04:29Z