Carry a spoken conversation with large language models! This project uses Whisper speech to text to transcribe the user's voice, send it to the LLM and pipe the final result to Piper text to speech. This project uses a socket server to get audio from a client device and sends it to the server to do all the processing. The final TTS output from the LLM is sent to the client device.
This program only works on Linux Ubuntu/Debian and Arch systems.
Make sure you have an OpenAI API compatible server running for the LLM such as Ollama. Then pass the LLM in through start.sh
with --llm-model (this is shown in the configuring section).
Run setup.sh
script for both the client and server side. Next run start.sh
for the server first followed by the client next.
- 100% offline, opensource and private
- Wake word detection: 'Hey Jarvis'
- Hands free interation
- Client server model
- Fully multilingual pipeline
- Streamed reponses
In both the client and server configuring is done by opening 'start.sh' and passing in the corresponding flag to 'main.py' like so.
python3 main.py --stt-cuda --stt-model large-v3-turbo --llm-model hf.co/bartowski/Mistral-Small-24B-Instruct-2501-GGUF:Q4_K_M --ip-address 0.0.0.0 --port 5432
Piper models can be changed in src/llm-voice-assistant-server/piper-models/piper-models.json
. Only one model can be enabled per language. Make sure all models for your language are disabled by setting the enabled
and auto_start
section to false. Then set a different model to true.
The language has to be supported by STT (Whisper languages), the LLM (Llama3.1 languages)) and TTS (Piper languages).
This means by default the application supports the following languages: English, German, French, Italian, Portuguese, and Spanish. However, more languages can easily be added by changing the large language model used such as Mistral Small.
The output text generated by the LLM is chunked into sentences and ran aganist Lingua to detect the language. A TTS model for that language is then downloaded and cached for future use. This unfortunately means that the first time a language is used the time it takes to generate a response back to the user will be exetremely slow. This is done to save on memory as loading all of the TTS models can take ~2.5GB of memory.
- Cancel audio playback when saying the wake word 'hey Jarvis'.
- Properly close the server with CTRL + C.
- Make a proper setup.sh for both the client and server.
- Clear LLM chat history by saying some variation of 'hey Jarvis clear chat history'.
- Clean up code.