GitHub - WesleyFister/llm-voice-assistant: Carry a spoken conversation with LLM models through the use of Whisper speech to text and Piper text to speech!

Carry a spoken conversation with large language models! This project uses Whisper speech to text to transcribe the user's voice, send it to the LLM and pipe the final result to Piper text to speech. This project uses a socket server to get audio from a client device and sends it to the server to do all the processing. The final TTS output from the LLM is sent to the client device.

Install

This program only works on Linux Ubuntu/Debian and Arch systems.

Make sure you have an OpenAI API compatible server running for the LLM such as Ollama. Then pass the LLM in through start.sh with --llm-model (this is shown in the configuring section).

Run setup.sh script for both the client and server side. Next run start.sh for the server first followed by the client next.

Features

100% offline, opensource and private
Wake word detection: 'Hey Jarvis'
Hands free interation
Client server model
Fully multilingual pipeline
Streamed reponses

Configuring

In both the client and server configuring is done by opening 'start.sh' and passing in the corresponding flag to 'main.py' like so. python3 main.py --stt-cuda --stt-model large-v3-turbo --llm-model hf.co/bartowski/Mistral-Small-24B-Instruct-2501-GGUF:Q4_K_M --ip-address 0.0.0.0 --port 5432

Piper models can be changed in src/llm-voice-assistant-server/piper-models/piper-models.json. Only one model can be enabled per language. Make sure all models for your language are disabled by setting the enabled and auto_start section to false. Then set a different model to true.

Multilinguality

The language has to be supported by STT (Whisper languages), the LLM (Llama3.1 languages)) and TTS (Piper languages).

This means by default the application supports the following languages: English, German, French, Italian, Portuguese, and Spanish. However, more languages can easily be added by changing the large language model used such as Mistral Small.

The output text generated by the LLM is chunked into sentences and ran aganist Lingua to detect the language. A TTS model for that language is then downloaded and cached for future use. This unfortunately means that the first time a language is used the time it takes to generate a response back to the user will be exetremely slow. This is done to save on memory as loading all of the TTS models can take ~2.5GB of memory.

Todo

Cancel audio playback when saying the wake word 'hey Jarvis'.
Properly close the server with CTRL + C.
Make a proper setup.sh for both the client and server.
Clear LLM chat history by saying some variation of 'hey Jarvis clear chat history'.
Clean up code.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Install

Features

Configuring

Multilinguality

Todo

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

WesleyFister/llm-voice-assistant

Folders and files

Latest commit

History

Repository files navigation

Install

Features

Configuring

Multilinguality

Todo

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages