A fully offline, privacy-focused voice assistant that runs locally on your machine. It uses Whisper for Speech-to-Text, Gemma (via llama.cpp) for intelligence, and Coqui XTTS for high-quality, clonable Text-to-Speech.
- Offline Reliability: No internet connection required after initial model downloads.
- Voice Cloning: Clone any voice using just a few seconds of audio samples.
- Low Latency: Optimized for reasonably fast CPU inference (GPU recommended for TTS).
- Privacy: No audio or text leaves your device.
- Python 3.9+
- Git
- Basic knowledge of terminal/command prompt.
pip install -r requirements.txtNote: You may need to install PyTorch separately depending on your hardware (CUDA/CPU).
- Clone
whisper.cppor download the prebuilt binary for your OS. - Place the
main.exe(Windows) ormain(Linux/Mac) insideasr/whisper.cpp/. - Download a model (e.g.,
ggml-base.en.bin) and place it inasr/whisper.cpp/models/.
- Clone
llama.cppor download the prebuilt binary. - Place
main.exeinsidellm/llama.cpp/. - Download the Gemma GGUF model (e.g.,
gemma-2b-it.gguf). - Place the model in
llm/llama.cpp/models/.
The TTS python library will automatically download the XTTS-v2 model on first run.
- Record 3-5 audio samples (wav format, approx 5-10 seconds each) of the voice you want to clone.
- Place them in the
client_voice/directory.
Edit config.yaml to match your exact paths if they differ from the defaults.
Run the assistant:
python assistant.py- Wait for initialization.
- When it says "Listening...", speak into your microphone.
- Stop speaking to trigger processing.
- The assistant will reply in the cloned voice.
- "Whisper binary not found": Ensure
asr/whisper.cpp/main.exeexists. - "Llama binary not found": Ensure
llm/llama.cpp/main.exeexists. - Slow TTS: XTTS is heavy. A GPU is highly recommended. For CPU, expect significant delay.
assistant.py: Main entry point.modules/: Wrappers for binary interactions.utils/: Audio and Text helpers.config.yaml: System configuration.