-
Notifications
You must be signed in to change notification settings - Fork 114
Description
Goal:
Get an accurate list of topics (around 20 topics max) for an agri dataset of queries (has around 20k unique queries) using BERTTopic. Only the 'questioninEnglish' column is relevant for the analysis
Description
Be able to segregate the given dataset into topics using BERTTopics.
The veracity of the clusters are difficult to measure and currently will have to be observed manually and verified.
Any suggestions to measure this better are welcome
One can also use simple TF-IDF, Topic2vec or LDA if they form better clusters. The sentences are just one sentence questions, not a paragraph.
Implementation Details
It'll include the following :
- This will just be Collab notebook for an analysis done
- If results are good, this will be used as a classification model for similar queries
- Intuitive clusters that may form (initial seeds if required)
- paddy pest management
- paddy seed selection
- how to cultivate ____ crop
- pest management for ____ crop
- best variety of seed for ____ crop
- wheat cultivation practices
- Scheme available from the govt
- wheat management and cultivation
Anyone is welcome to begin work on the ticket, it'll not be assigned to anyone in particular initially. One can ask doubts and provide solutions through comments. Relevant points and ticket will be assigned to the best PR raised.
Other links
Product Name
AI Tools
Organization Name
SamagraX
Domain
NA
Tech Skills Needed
Python, BERT, ML
Category
Feature
Mentor(s)
@GautamR-Samagra
Complexity
Low