-
Notifications
You must be signed in to change notification settings - Fork 244
Identify Attendees Based on Voice #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Will add speaker identification once
|
@sujithatzackriya I can work on speaker identification. Do you have any other approaches except pyannote-whisper mentioned at this issue description? |
Observations and solutionsI was thinking of using WhisperX based on faster-whisper backend with enhancements like VAD pre processing, word level timestamps and speaker diarization. https://github.com/m-bain/whisperX Identified this from initial research some times back. following are the advantages and disadvantages I see Advantages
Disadvantages
We used whisper.cpp to utilize the efficiency of CPP. but moving to python might introduce memory issues. Alternative solutionsMoving the backend completely to Rust is also something that's to be explored as libraries such as mistral.rs and pyanote.rs, whisper-rs etc are available. but this needs careful research and experimentation. My thoughtsSince we are using LLMs, identifying the speaker from the transcript is possible. I never had an issue with having no speaker diarization personally. It would be really nice to understand how this can be helpful. |
Hi,
Thanks so much for all the hard work on this project!
I would like to suggest a feature to identify attendees based on their voice from audio recordings.
The idea is to:
• Detect who is speaking
• Match voices to known attendees
• Generate labeled transcripts based on attendee
Possible Approach
The pyannote-whisper project looks like a great starting point that could be incorporated, as it combines:
• whisper for transcription
• pyannote-audio for speaker diarization
This could be adapted to recognize specific attendees and tag their speech in recordings.
The text was updated successfully, but these errors were encountered: