Skip to content

Test out BERTTopic to get meaningful topic segmentations of a query dataset  #291

@Gautam-Rajeev

Description

@Gautam-Rajeev

Goal:

Get an accurate list of topics (around 20 topics max) for an agri dataset of queries (has around 20k unique queries) using BERTTopic. Only the 'questioninEnglish' column is relevant for the analysis

Description

Be able to segregate the given dataset into topics using BERTTopics.
The veracity of the clusters are difficult to measure and currently will have to be observed manually and verified.
Any suggestions to measure this better are welcome

One can also use simple TF-IDF, Topic2vec or LDA if they form better clusters. The sentences are just one sentence questions, not a paragraph.

Implementation Details

It'll include the following :

  • This will just be Collab notebook for an analysis done
  • If results are good, this will be used as a classification model for similar queries
  • Intuitive clusters that may form (initial seeds if required)
    - paddy pest management
    - paddy seed selection
    - how to cultivate ____ crop
    - pest management for ____ crop
    - best variety of seed for ____ crop
    - wheat cultivation practices
    - Scheme available from the govt
    - wheat management and cultivation

Anyone is welcome to begin work on the ticket, it'll not be assigned to anyone in particular initially. One can ask doubts and provide solutions through comments. Relevant points and ticket will be assigned to the best PR raised.

Other links

Medium

Product Name

AI Tools

Organization Name

SamagraX

Domain

NA

Tech Skills Needed

Python, BERT, ML

Category

Feature

Mentor(s)

@GautamR-Samagra

Complexity

Low

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions