This project provides real-time speech-to-text transcription using Faster Whisper and voice activity detection (VAD). It efficiently detects speech, transcribes audio, and supports seamless real-time processing with optional Kafka-based post-processing.
- Advanced Voice Activity Detection: Detects speech and pauses with high accuracy using the Silero VAD model.
- Real-Time Parallel Processing: Ensures recording and transcription run simultaneously without interruptions.
- Post-Processing Integration: Supports custom actions after transcription, including Kafka-based text-to-speech (TTS) integration.
- Cross-Device Compatibility: Works with various audio devices, including user-defined favorites.
The application employs a Producer-Consumer Model:
-
Producer (Recording):
- Records audio chunks in real-time.
- Uses VAD to detect speech segments.
- Sends detected audio files to a processing queue.
-
Consumer (Transcription):
- Processes audio chunks from the queue.
- Transcribes speech using Faster Whisper.
- Outputs incremental and grouped transcription results.
-
Kafka Integration:
- Transcription results can be sent to a Kafka topic.
- A Kafka consumer listens for messages on the topic and converts text to speech using a TTS API.
- Audio Input: User selects or auto-detects a microphone.
- Voice Activity Detection:
- Identifies speech and pauses.
- Saves speech segments to temporary files.
- Sends file paths to a queue for transcription.
- Parallel Processing:
- Recording and transcription occur in separate threads.
- Post-Processing:
- Transcription results are sent to Kafka for downstream applications like TTS.
- NVIDIA GPU with CUDA support (e.g., RTX 3090 or higher recommended).
- Supported microphone (e.g., Elgato Wave XLR, Jabra SPEAK 410).
- Operating System: Ubuntu 20.04+ or Windows 10+.
- Python 3.8+.
- NVIDIA CUDA Toolkit and cuDNN.
Follow these steps to install the latest CUDA Toolkit and cuDNN:
wget https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux.run
sudo sh cuda_12.6.3_560.35.05_linux.run
wget https://developer.download.nvidia.com/compute/cudnn/9.6.0/local_installers/cudnn-local-repo-ubuntu2204-9.6.0_1.0-1_amd64.deb
sudo dpkg -i cudnn-local-repo-ubuntu2204-9.6.0_1.0-1_amd64.deb
sudo cp /var/cudnn-local-repo-ubuntu2204-9.6.0/cudnn-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cudnn
git clone https://github.com/Teachings/sts.git
cd sts
Install the dependencies:
pip install -r requirements.txt
- Start the transcription application:
python stt/main.py
- Start the Kafka consumer for text-to-speech:
python kafka_consumer.py
- Select a microphone from the listed devices when prompted.
- Speak into the microphone to see real-time transcription and hear the synthesized speech.
Navigate to the kafka
directory and use docker-compose
to start Kafka services:
cd kafka
docker-compose up -d
-
Verify GPU and CUDA Compatibility:
- Ensure the system has a CUDA-enabled NVIDIA GPU.
- Install compatible versions of CUDA Toolkit and cuDNN.
-
Clone the Repository:
- Clone the project to the target system.
- Install dependencies using
pip
.
-
Configure Audio Devices:
- Use
sounddevice.query_devices()
to list available microphones. - Modify
config.yml
to specify preferred devices if necessary.
- Use
-
Run the Application:
- Start the transcription and Kafka consumer as described above.
-
CUDA Errors:
- Verify that CUDA and cuDNN are correctly installed.
- Ensure the Faster Whisper model is configured to use
float16
for GPU acceleration.
-
No Audio Devices Found:
- Confirm that the microphone is properly connected and recognized by the OS.
- Run
sounddevice.query_devices()
to check available devices.
-
Kafka Connection Issues:
- Ensure that the Kafka broker is running and accessible.
- Verify that the
BROKER
andTOPIC_NAME
inkafka_consumer.py
match the Kafka setup.
-
Slow Transcription:
- Ensure that the application is utilizing the GPU.
- Use
nvidia-smi
to monitor GPU utilization.
- Check logs for detailed error messages.
- Use
nvidia-smi
to ensure GPU resources are allocated to the application. - Verify that the correct microphone is selected.
This sub-project implements text-to-speech message handling, session management, real-time agent decision-making, and aggregation of transcriptions for final summaries. It’s built around Kafka for message passing, Postgres for storing transcriptions, and LLM-based agents (using ollama) to make decisions on how to process user requests.
- Transcriptions are published to
transcriptions.all
. - PersistenceProcessor stores each transcription in Postgres, tagged by user ID and timestamp.
- RealTimeProcessor analyzes each transcription in real time:
- Checks if it needs an immediate action (turn on lights, do a web search, etc.).
- Determines whether to create, destroy, or ignore session boundaries.
- SessionProcessor listens for session commands (
CREATE
,DESTROY
,NO_ACTION
) onsessions.management
.- Maintains the sessions table in Postgres, including a timeout mechanism for long sessions.
- Triggers an AggregatorProcessor when a session is destroyed.
- AggregatorProcessor gathers all transcriptions from the final session window and compiles them into a summary stored in the sessions table.
- IntentProcessor can perform text-to-speech or other post-processing on “action required” events.
This design allows each piece to be independently developed, tested, and scaled.
intent_analysis/
├── run_real_time_processor.py
├── run_aggregator_processor.py
├── run_session_processor.py
├── run_persistence_processor.py
├── init_db.py
├── run_intent_processor.py
├── run_transcription_processor.py
├── services/
│ └── tts_service.py
├── core/
│ ├── database.py
│ ├── logger.py
│ ├── base_processor.py
│ ├── base_avro_processor.py
│ └── config_loader.py
├── agents/
│ ├── realtime_agent.py
│ ├── session_management_agent.py
│ └── decision_agent.py
├── processors/
│ ├── transcription_processor.py
│ ├── aggregator_processor.py
│ ├── intent_processor.py
│ ├── persistence_processor.py
│ ├── session_processor.py
│ └── real_time_processor.py
└── config/
├── prompts.yml
└── config.yml
- agents/: LLM-based logic (e.g., RealTimeAgent, SessionManagementAgent, DecisionAgent).
- processors/: Kafka consumers that orchestrate reading/writing messages, calling agents, storing data, etc.
- core/: Reusable utilities (database connections, logging, base classes, config loading).
- services/: Additional micro-services or integration logic (e.g., TTS).
All configs are in config/config.yml
.
Key sections include:
- kafka: Contains broker addresses, consumer groups, and topic names.
- db: Postgres connection parameters.
- ollama: LLM (model name, host URL, etc.).
- text_to_speech: TTS endpoints, API keys, etc.
- app: Additional project-wide settings
-
Install Dependencies
pip install -r requirements.txt
-
Ensure Kafka + Postgres are running (e.g., via Docker Compose or your preferred setup).
-
Initialize Database
python init_db.py
- This creates the
transcriptions
andsessions
tables if they don’t exist.
- This creates the
-
Start Processors (in separate terminals or background processes):
python run_persistence_processor.py python run_real_time_processor.py python run_session_processor.py python run_aggregator_processor.py
- (Optional) Start
run_intent_processor.py
if you want TTS on agent actions. - (Optional) Start
run_transcription_processor.py
if you’re using the older DecisionAgent approach.
- (Optional) Start
-
Publish Transcriptions to
transcriptions.all
- Typically from a speech-to-text system that sends messages like:
{ "timestamp": "2024-12-21 23:00:21", "text": "The user is seeking assistance with accessibility features.", "user": "mukul" }
- This triggers the pipeline:
- PersistenceProcessor saves it to DB.
- RealTimeProcessor decides if it’s an immediate command or session action.
- Typically from a speech-to-text system that sends messages like:
-
Observe logs in each processor for real-time debugging information (
info
,debug
,error
logs).
- Speech-to-text publishes JSON to
transcriptions.all
. - PersistenceProcessor saves every record in DB.
- RealTimeProcessor:
- Uses RealtimeAgent to see if any immediate action is needed.
- If yes, publishes to
transcriptions.agent.action
.
- If yes, publishes to
- Uses SessionManagementAgent to see if we should
CREATE
/DESTROY
or doNO_ACTION
about sessions.- If
CREATE
orDESTROY
, publishes tosessions.management
.
- If
- Uses RealtimeAgent to see if any immediate action is needed.
- SessionProcessor:
- Consumes from
sessions.management
, updates thesessions
table, and if the session is ended, publishes a message toaggregations.request
.
- Consumes from
- AggregatorProcessor:
- Consumes from
aggregations.request
, fetches transcriptions from the DB within the session window, and updatessessions.summary
.
- Consumes from
- IntentProcessor (optional):
- Consumes from
transcriptions.agent.action
, reads thereasoning
field, and performs TTS playback.
- Consumes from
- Add new processors by creating a subclass of
BaseAvroProcessor
, hooking into new or existing topics. - Add new LLM agents for specialized tasks (similar to
RealtimeAgent
orSessionManagementAgent
). - Modify DB via migrations or
initialize_database
for additional columns/tables. - Adjust session logic or timeouts in
SessionProcessor
to suit your needs.
Review the readme.md file in the intent_analysis subfolder for detailed architecture of this application.