Wiki: Setting Up Speech-to-Speech with VAD, Faster Whisper and Kafka

Introduction

This project provides real-time speech-to-text transcription using Faster Whisper and voice activity detection (VAD). It efficiently detects speech, transcribes audio, and supports seamless real-time processing with optional Kafka-based post-processing.

Key Features

Advanced Voice Activity Detection: Detects speech and pauses with high accuracy using the Silero VAD model.
Real-Time Parallel Processing: Ensures recording and transcription run simultaneously without interruptions.
Post-Processing Integration: Supports custom actions after transcription, including Kafka-based text-to-speech (TTS) integration.
Cross-Device Compatibility: Works with various audio devices, including user-defined favorites.

Architecture Overview

The application employs a Producer-Consumer Model:

Producer (Recording):
- Records audio chunks in real-time.
- Uses VAD to detect speech segments.
- Sends detected audio files to a processing queue.
Consumer (Transcription):
- Processes audio chunks from the queue.
- Transcribes speech using Faster Whisper.
- Outputs incremental and grouped transcription results.
Kafka Integration:
- Transcription results can be sent to a Kafka topic.
- A Kafka consumer listens for messages on the topic and converts text to speech using a TTS API.

High-Level Flow

Audio Input: User selects or auto-detects a microphone.
Voice Activity Detection:
- Identifies speech and pauses.
- Saves speech segments to temporary files.
- Sends file paths to a queue for transcription.
Parallel Processing:
- Recording and transcription occur in separate threads.
Post-Processing:
- Transcription results are sent to Kafka for downstream applications like TTS.

Setup Instructions

Prerequisites

Hardware

NVIDIA GPU with CUDA support (e.g., RTX 3090 or higher recommended).
Supported microphone (e.g., Elgato Wave XLR, Jabra SPEAK 410).

Software

Operating System: Ubuntu 20.04+ or Windows 10+.
Python 3.8+.
NVIDIA CUDA Toolkit and cuDNN.

Installation Steps

1. Install CUDA and cuDNN

Follow these steps to install the latest CUDA Toolkit and cuDNN:

CUDA Installation

wget https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux.run
sudo sh cuda_12.6.3_560.35.05_linux.run

cuDNN Installation

wget https://developer.download.nvidia.com/compute/cudnn/9.6.0/local_installers/cudnn-local-repo-ubuntu2204-9.6.0_1.0-1_amd64.deb
sudo dpkg -i cudnn-local-repo-ubuntu2204-9.6.0_1.0-1_amd64.deb
sudo cp /var/cudnn-local-repo-ubuntu2204-9.6.0/cudnn-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cudnn

2. Clone the Repository

git clone https://github.com/Teachings/sts.git
cd sts

3. Install Python Dependencies

Install the dependencies:

pip install -r requirements.txt

4. Run the Application

Start the transcription application:
```
python stt/main.py
```
Start the Kafka consumer for text-to-speech:
```
python kafka_consumer.py
```
Select a microphone from the listed devices when prompted.
Speak into the microphone to see real-time transcription and hear the synthesized speech.

5. Setup Kafka

Navigate to the kafka directory and use docker-compose to start Kafka services:

cd kafka
docker-compose up -d

Using on a Different System

Verify GPU and CUDA Compatibility:
- Ensure the system has a CUDA-enabled NVIDIA GPU.
- Install compatible versions of CUDA Toolkit and cuDNN.
Clone the Repository:
- Clone the project to the target system.
- Install dependencies using pip.
Configure Audio Devices:
- Use sounddevice.query_devices() to list available microphones.
- Modify config.yml to specify preferred devices if necessary.
Run the Application:
- Start the transcription and Kafka consumer as described above.

Troubleshooting

Common Issues

CUDA Errors:
- Verify that CUDA and cuDNN are correctly installed.
- Ensure the Faster Whisper model is configured to use float16 for GPU acceleration.
No Audio Devices Found:
- Confirm that the microphone is properly connected and recognized by the OS.
- Run sounddevice.query_devices() to check available devices.
Kafka Connection Issues:
- Ensure that the Kafka broker is running and accessible.
- Verify that the BROKER and TOPIC_NAME in kafka_consumer.py match the Kafka setup.
Slow Transcription:
- Ensure that the application is utilizing the GPU.
- Use nvidia-smi to monitor GPU utilization.

Debugging Tips

Check logs for detailed error messages.
Use nvidia-smi to ensure GPU resources are allocated to the application.
Verify that the correct microphone is selected.

Intent Analysis

This sub-project implements text-to-speech message handling, session management, real-time agent decision-making, and aggregation of transcriptions for final summaries. It’s built around Kafka for message passing, Postgres for storing transcriptions, and LLM-based agents (using ollama) to make decisions on how to process user requests.

Core Ideas

Transcriptions are published to transcriptions.all.
PersistenceProcessor stores each transcription in Postgres, tagged by user ID and timestamp.
RealTimeProcessor analyzes each transcription in real time:
- Checks if it needs an immediate action (turn on lights, do a web search, etc.).
- Determines whether to create, destroy, or ignore session boundaries.
SessionProcessor listens for session commands (CREATE, DESTROY, NO_ACTION) on sessions.management.
- Maintains the sessions table in Postgres, including a timeout mechanism for long sessions.
- Triggers an AggregatorProcessor when a session is destroyed.
AggregatorProcessor gathers all transcriptions from the final session window and compiles them into a summary stored in the sessions table.
IntentProcessor can perform text-to-speech or other post-processing on “action required” events.

This design allows each piece to be independently developed, tested, and scaled.

Directory Structure

intent_analysis/
├── run_real_time_processor.py
├── run_aggregator_processor.py
├── run_session_processor.py
├── run_persistence_processor.py
├── init_db.py
├── run_intent_processor.py
├── run_transcription_processor.py
├── services/
│   └── tts_service.py
├── core/
│   ├── database.py
│   ├── logger.py
│   ├── base_processor.py
│   ├── base_avro_processor.py
│   └── config_loader.py
├── agents/
│   ├── realtime_agent.py
│   ├── session_management_agent.py
│   └── decision_agent.py
├── processors/
│   ├── transcription_processor.py
│   ├── aggregator_processor.py
│   ├── intent_processor.py
│   ├── persistence_processor.py
│   ├── session_processor.py
│   └── real_time_processor.py
└── config/
    ├── prompts.yml
    └── config.yml

Key Subdirectories

agents/: LLM-based logic (e.g., RealTimeAgent, SessionManagementAgent, DecisionAgent).
processors/: Kafka consumers that orchestrate reading/writing messages, calling agents, storing data, etc.
core/: Reusable utilities (database connections, logging, base classes, config loading).
services/: Additional micro-services or integration logic (e.g., TTS).

Configuration

All configs are in config/config.yml.
Key sections include:

kafka: Contains broker addresses, consumer groups, and topic names.
db: Postgres connection parameters.
ollama: LLM (model name, host URL, etc.).
text_to_speech: TTS endpoints, API keys, etc.
app: Additional project-wide settings

How To Run

Install Dependencies
```
pip install -r requirements.txt
```
Ensure Kafka + Postgres are running (e.g., via Docker Compose or your preferred setup).
Initialize Database
```
python init_db.py
```
- This creates the transcriptions and sessions tables if they don’t exist.
Start Processors (in separate terminals or background processes):
```
python run_persistence_processor.py
python run_real_time_processor.py
python run_session_processor.py
python run_aggregator_processor.py
```
- (Optional) Start run_intent_processor.py if you want TTS on agent actions.
- (Optional) Start run_transcription_processor.py if you’re using the older DecisionAgent approach.
Publish Transcriptions to transcriptions.all
- Typically from a speech-to-text system that sends messages like:
```
{
  "timestamp": "2024-12-21 23:00:21",
  "text": "The user is seeking assistance with accessibility features.",
  "user": "mukul"
}
```
- This triggers the pipeline:
  - PersistenceProcessor saves it to DB.
  - RealTimeProcessor decides if it’s an immediate command or session action.
Observe logs in each processor for real-time debugging information (info, debug, error logs).

Data Flow Summary

Speech-to-text publishes JSON to transcriptions.all.
PersistenceProcessor saves every record in DB.
RealTimeProcessor:
- Uses RealtimeAgent to see if any immediate action is needed.
  - If yes, publishes to transcriptions.agent.action.
- Uses SessionManagementAgent to see if we should CREATE/DESTROY or do NO_ACTION about sessions.
  - If CREATE or DESTROY, publishes to sessions.management.
SessionProcessor:
- Consumes from sessions.management, updates the sessions table, and if the session is ended, publishes a message to aggregations.request.
AggregatorProcessor:
- Consumes from aggregations.request, fetches transcriptions from the DB within the session window, and updates sessions.summary.
IntentProcessor (optional):
- Consumes from transcriptions.agent.action, reads the reasoning field, and performs TTS playback.

Extending the System

Add new processors by creating a subclass of BaseAvroProcessor, hooking into new or existing topics.
Add new LLM agents for specialized tasks (similar to RealtimeAgent or SessionManagementAgent).
Modify DB via migrations or initialize_database for additional columns/tables.
Adjust session logic or timeouts in SessionProcessor to suit your needs.

Review the readme.md file in the intent_analysis subfolder for detailed architecture of this application.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Wiki: Setting Up Speech-to-Speech with VAD, Faster Whisper and Kafka

Introduction

Key Features

Architecture Overview

High-Level Flow

Setup Instructions

Prerequisites

Hardware

Software

Installation Steps

1. Install CUDA and cuDNN

CUDA Installation

cuDNN Installation

2. Clone the Repository

3. Install Python Dependencies

4. Run the Application

5. Setup Kafka

Using on a Different System

Troubleshooting

Common Issues

Debugging Tips

Intent Analysis

Core Ideas

Directory Structure

Key Subdirectories

Configuration

How To Run

Data Flow Summary

Extending the System

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
intent_analysis		intent_analysis
stt		stt
tutorial		tutorial
.gitignore		.gitignore
config.yml		config.yml
readme.md		readme.md
requirements.txt		requirements.txt

Teachings/sts

Folders and files

Latest commit

History

Repository files navigation

Wiki: Setting Up Speech-to-Speech with VAD, Faster Whisper and Kafka

Introduction

Key Features

Architecture Overview

High-Level Flow

Setup Instructions

Prerequisites

Hardware

Software

Installation Steps

1. Install CUDA and cuDNN

CUDA Installation

cuDNN Installation

2. Clone the Repository

3. Install Python Dependencies

4. Run the Application

5. Setup Kafka

Using on a Different System

Troubleshooting

Common Issues

Debugging Tips

Intent Analysis

Core Ideas

Directory Structure

Key Subdirectories

Configuration

How To Run

Data Flow Summary

Extending the System

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages