Skip to content

Conversation

@wchest
Copy link

@wchest wchest commented Sep 29, 2025

Summary

Implements native voice activity detection for Linux using CPAL audio capture and the Silero VAD model, providing better platform integration and reliability compared to browser-based audio processing.

Motivation

While the existing web-based VAD works well, browser audio APIs can have limitations on Linux systems. This native implementation provides:

  • Platform Integration: Direct OS-level audio capture using CPAL instead of browser MediaRecorder API
  • Device Control: OS-level device enumeration and selection without browser restrictions
  • Audio Reliability: Native audio streams eliminate browser-specific audio issues
  • Local Processing: All audio processing happens locally maintaining the project's privacy-first approach

Implementation Details

Architecture

  • Service Layer: New NativeVadService alongside existing VadService with identical interface
  • Dynamic Selection: Runtime service selection based on user settings (recording.vad.useNative)
  • Rust Backend: CPAL audio capture + Silero VAD processing in src-tauri/src/recorder/vad.rs
  • Event Communication: Tauri events for speech start/end with embedded audio data

Key Features

  • Configurable sensitivity slider (0.1-0.9 threshold) in recording settings
  • Automatic session cleanup to prevent conflicts
  • OS-level device enumeration and selection
  • Proper state management matching web VAD behavior
  • Direct file handling without frontend permission issues

Technical Implementation

  • CPAL: Cross-platform audio library for native audio capture
  • 16kHz Preference: Optimal sample rate for Silero VAD model
  • Event Architecture: Separate speech start/end events for accurate UI state transitions
  • Embedded Audio: File contents included in events to avoid Tauri filesystem permission issues
  • Platform Gating: Linux-only initially for focused testing and validation

User Experience

Settings Integration

  • Native VAD toggle in recording settings (Linux only)
  • Real-time sensitivity slider with live value display
  • Dynamic description text showing active VAD implementation
  • Seamless switching between web and native VAD modes

Testing

  • ✅ Toggle native VAD on/off in settings
  • ✅ Sensitivity slider functionality across full range
  • ✅ Device enumeration and selection
  • ✅ End-to-end recording and transcription workflow
  • ✅ Session management and cleanup
  • ✅ UI state transitions and icon accuracy
  • ✅ Error handling and graceful fallbacks

Breaking Changes

None. This is purely additive:

  • Existing web VAD remains default and unchanged
  • Native VAD is opt-in via settings checkbox
  • All existing functionality preserved

Dependencies

Added voice_activity_detector = "0.2.1" to provide Silero VAD model integration.

Files Changed

  • src-tauri/src/recorder/vad.rs - New native VAD implementation
  • src/lib/services/native-vad.ts - TypeScript service wrapper
  • src/lib/settings/settings.ts - Added VAD configuration options
  • src/routes/(config)/settings/recording/+page.svelte - UI controls and descriptions
  • Various integration files for service selection and query handling

Future Considerations

  • Test and potentially expand to other platforms after Linux validation
  • Consider additional audio processing configuration options
  • Explore integration with other native audio features

This implementation maintains Epicenter's local-first philosophy while providing Linux users with improved audio processing reliability through native platform integration.

Will Chester and others added 2 commits September 27, 2025 23:21
This adds native voice activity detection for Linux using the Silero VAD model,
providing better speech detection performance compared to the web-based VAD.

Key features:
- Configurable sensitivity slider (0.1-0.9 threshold)
- Automatic session cleanup to prevent conflicts
- Event-based communication between Rust backend and TypeScript frontend
- Proper state management matching web VAD behavior
- Device enumeration support for consistent UI

Technical implementation:
- Uses voice_activity_detector crate with Silero v5 model
- CPAL for audio capture with 16kHz sample rate preference
- Separate events for speech start/end with proper timing
- File contents embedded in events to bypass permission issues
- Dynamic service selection based on user settings

UI improvements:
- Fixed icon mapping: ear (👂) for listening, chat bubble (💬) for speech detected
- Sensitivity slider only shown when native VAD is enabled
- Settings require page reload to apply VAD mode changes

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
… VAD

The VAD mode description now dynamically shows whether native Silero VAD
or web-based VAD is being used, providing accurate information to users
about the underlying implementation.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant