feat(vad): implement native voice activity detection for Linux #846

wchest · 2025-09-29T17:01:06Z

Summary

Implements native voice activity detection for Linux using CPAL audio capture and the Silero VAD model, providing better platform integration and reliability compared to browser-based audio processing.

Motivation

While the existing web-based VAD works well, browser audio APIs can have limitations on Linux systems. This native implementation provides:

Platform Integration: Direct OS-level audio capture using CPAL instead of browser MediaRecorder API
Device Control: OS-level device enumeration and selection without browser restrictions
Audio Reliability: Native audio streams eliminate browser-specific audio issues
Local Processing: All audio processing happens locally maintaining the project's privacy-first approach

Implementation Details

Architecture

Service Layer: New NativeVadService alongside existing VadService with identical interface
Dynamic Selection: Runtime service selection based on user settings (recording.vad.useNative)
Rust Backend: CPAL audio capture + Silero VAD processing in src-tauri/src/recorder/vad.rs
Event Communication: Tauri events for speech start/end with embedded audio data

Key Features

Configurable sensitivity slider (0.1-0.9 threshold) in recording settings
Automatic session cleanup to prevent conflicts
OS-level device enumeration and selection
Proper state management matching web VAD behavior
Direct file handling without frontend permission issues

Technical Implementation

CPAL: Cross-platform audio library for native audio capture
16kHz Preference: Optimal sample rate for Silero VAD model
Event Architecture: Separate speech start/end events for accurate UI state transitions
Embedded Audio: File contents included in events to avoid Tauri filesystem permission issues
Platform Gating: Linux-only initially for focused testing and validation

User Experience

Settings Integration

Native VAD toggle in recording settings (Linux only)
Real-time sensitivity slider with live value display
Dynamic description text showing active VAD implementation
Seamless switching between web and native VAD modes

Testing

✅ Toggle native VAD on/off in settings
✅ Sensitivity slider functionality across full range
✅ Device enumeration and selection
✅ End-to-end recording and transcription workflow
✅ Session management and cleanup
✅ UI state transitions and icon accuracy
✅ Error handling and graceful fallbacks

Breaking Changes

None. This is purely additive:

Existing web VAD remains default and unchanged
Native VAD is opt-in via settings checkbox
All existing functionality preserved

Dependencies

Added voice_activity_detector = "0.2.1" to provide Silero VAD model integration.

Files Changed

src-tauri/src/recorder/vad.rs - New native VAD implementation
src/lib/services/native-vad.ts - TypeScript service wrapper
src/lib/settings/settings.ts - Added VAD configuration options
src/routes/(config)/settings/recording/+page.svelte - UI controls and descriptions
Various integration files for service selection and query handling

Future Considerations

Test and potentially expand to other platforms after Linux validation
Consider additional audio processing configuration options
Explore integration with other native audio features

This implementation maintains Epicenter's local-first philosophy while providing Linux users with improved audio processing reliability through native platform integration.

This adds native voice activity detection for Linux using the Silero VAD model, providing better speech detection performance compared to the web-based VAD. Key features: - Configurable sensitivity slider (0.1-0.9 threshold) - Automatic session cleanup to prevent conflicts - Event-based communication between Rust backend and TypeScript frontend - Proper state management matching web VAD behavior - Device enumeration support for consistent UI Technical implementation: - Uses voice_activity_detector crate with Silero v5 model - CPAL for audio capture with 16kHz sample rate preference - Separate events for speech start/end with proper timing - File contents embedded in events to bypass permission issues - Dynamic service selection based on user settings UI improvements: - Fixed icon mapping: ear (👂) for listening, chat bubble (💬) for speech detected - Sensitivity slider only shown when native VAD is enabled - Settings require page reload to apply VAD mode changes 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

… VAD The VAD mode description now dynamically shows whether native Silero VAD or web-based VAD is being used, providing accurate information to users about the underlying implementation. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Will Chester and others added 2 commits September 27, 2025 23:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(vad): implement native voice activity detection for Linux #846

feat(vad): implement native voice activity detection for Linux #846

Uh oh!

wchest commented Sep 29, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

feat(vad): implement native voice activity detection for Linux #846

Are you sure you want to change the base?

feat(vad): implement native voice activity detection for Linux #846

Uh oh!

Conversation

wchest commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Implementation Details

Architecture

Key Features

Technical Implementation

User Experience

Settings Integration

Testing

Breaking Changes

Dependencies

Files Changed

Future Considerations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wchest commented Sep 29, 2025 •

edited

Loading