Skip to content

Identify Attendees Based on Voice #56

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tyrwinn opened this issue Mar 21, 2025 · 3 comments
Open

Identify Attendees Based on Voice #56

tyrwinn opened this issue Mar 21, 2025 · 3 comments

Comments

@tyrwinn
Copy link

tyrwinn commented Mar 21, 2025

Hi,

Thanks so much for all the hard work on this project!

I would like to suggest a feature to identify attendees based on their voice from audio recordings.

The idea is to:
• Detect who is speaking
• Match voices to known attendees
• Generate labeled transcripts based on attendee

Possible Approach

The pyannote-whisper project looks like a great starting point that could be incorporated, as it combines:
• whisper for transcription
• pyannote-audio for speaker diarization

This could be adapted to recognize specific attendees and tag their speech in recordings.

@sujithatzackriya
Copy link
Collaborator

Will add speaker identification once

  1. The ollama generation issue is fixed
  2. The basic db and related logic are better.

@ilyamochalov
Copy link

ilyamochalov commented Apr 1, 2025

@sujithatzackriya I can work on speaker identification. Do you have any other approaches except pyannote-whisper mentioned at this issue description?

@sujithatzackriya
Copy link
Collaborator

@ilyamochalov

Observations and solutions

I was thinking of using WhisperX based on faster-whisper backend with enhancements like VAD pre processing, word level timestamps and speaker diarization.

https://github.com/m-bain/whisperX

Identified this from initial research some times back. following are the advantages and disadvantages I see

Advantages

  1. Good community support
  2. Uses faster-whisper backend which has some good benchmark scores

Disadvantages

  1. Written in python
  2. Uses more RAM.

We used whisper.cpp to utilize the efficiency of CPP. but moving to python might introduce memory issues.

Alternative solutions

Moving the backend completely to Rust is also something that's to be explored as libraries such as mistral.rs and pyanote.rs, whisper-rs etc are available. but this needs careful research and experimentation.

My thoughts

Since we are using LLMs, identifying the speaker from the transcript is possible. I never had an issue with having no speaker diarization personally. It would be really nice to understand how this can be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants