About

Product categorization project, using the Gemini embeddings API to assign subcategories to products based on existing subcategories, and performed through embeddings comparisons.
See notebook here.

Objective

The objective was to understand how we can reliably assign subcategories to new products using existing product data through embeddings.
The categorization is based on embeddings similarities: the subcategory assigned to a new product is the subcategory of the existing product whose embeddings are the closest to the new product's embeddings.

Embeddings API

In this project, the embeddings are product names transformed in lists of float values, known as vectors.
The closer two values' embeddings are, the closer their semantic meaning.
For this reason, embeddings can be used to group data based on their semantic similarity.
For more details, refer to Google's embeddings guide.

The Gemini embeddings API offers the task_type parameter to enable the user to specify what the embeddings will be used for.

In this project, two task types were used to compare their reliability:

semantic similarity
classification

For a same value, different task types will produce different embeddings.
See the TaskType API reference page

Testing plan

To check the reliability of this process, tests were run on products that already had a subcategory.
Pass/Fail conditions are as follows:

Pass: the new subcategory assigned to a product is identical to its default subcategory
Fail: the new subcategory assigned to a product is different from its default subcategory

See the testing steps below :

Load source file with all the products and their subcategory
File is divided in 2 parts:
- dataset, used as a baseline to assign a new subcategory to new products
- testing subset, containing products with existing subcategories, used to check if the new subcategory assigned is identical to the existing subcategory
Generate embeddings for each product of the dataset
Generate embeddings for the product of the testing subset
Compare the embeddings of the testing subset product to the embeddings of the dataset products
Return the dataset product with the closest embeddings to the embeddings of the testing subset product
Assign the returned dataset product's subcategory to the testing subset product

See the simplified diagram below

In this simplified diagram, we can see the test failed for Product 10 since the new subcategory returned (C) is different from the default subcategory (B).

Results

The test results indicate that product categorization is more reliable when using the semantic similarity task_type compared to the classification task_type.

On a sample of 100 products:

66 passed the tests with the semantic similarity task_type
53 passed the tests with the classification task_type

Given the small scale of this project, the next step would be to experiment on a larger scale to yield better results by leveraging the following elements:

large dataset (1M+ products)
large testing subset (1000+ products)
batch method to generate embeddings in shorter processing time
vector database to store/query embeddings

Output example

In the output example below, we can see that the test has failed for product #6 since the new subcategory is different from the default subcategory.

Material

The material used is an Amazon product dataset csv file (source).

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
notebook.ipynb		notebook.ipynb
product_file.png		product_file.png
result_example.png		result_example.png
task_types.png		task_types.png
testing_diagram.png		testing_diagram.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Objective

Embeddings API

Testing plan

Results

Output example

Material

About

Uh oh!

Releases

Packages

Languages

FlorianLD/gemini_product_categorization

Folders and files

Latest commit

History

Repository files navigation

About

Objective

Embeddings API

Testing plan

Results

Output example

Material

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages