Athena Demo v3

An updated demo of a 100% local, privacy-first conversational LLM chatbot, using speech-to-text and text-to-speech interfaces. The design goal was maximum conversational realism and engagement in a 100% local laptop form-factor. Let me know what you think in the comments! It's about 95% glitch-free. I'm working on fixing the remaining 5%.

In this demo, the STT, TTS, and non-expert LLM layers run on a *single* RTX 5000 Pro Mobile GPU with 24 GB VRAM total. The MoE experts of Qwen3.5-397B-A17B are offloaded to the P-Cores and E-Cores of the CPU (16 cores total on an Intel Core Ultra 9 285HX) and system RAM (DDR5-4000). There are no Python dependencies. All components are built in C++ for speed.

I've recorded a live packet capture in the bottom right of the screen to show that no traffic is moving in and out of the laptop.

*** Since the last demo I've upgraded the LLM to Qwen3.5-397B UD-Q3_K_XL. I've also added SSE streaming for the output, reducing conversational latency to near real-time. Lastly, I've added an interruptibility feature, where I can naturally interject in the conversation and Athena remembers what she said before the interruption and can recover the conversation after responding. Some additional tweaks to intelligence were also made in the system prompt for even greater conversational realism ***

Components include:

1) Qwen3.5-397B-A17B UD-Q3_K_XL (GGUF) - LLM running on a (very) customized talk-llama.cpp example from GGML.org's whisper.cpp. Customizations include an ability to set KV cache quantization levels, as well as additional Qwen3.5 generation parameters (repeat-penalty, presence-penalty) to optimize text generation. Context is the maximum-supported 131,072 tokens - enough for a few hours of conversation.
2) Whisper-small (GGUF) model for accurate STT, running on talk-llama.cpp.
3) Orpheus-3B-ft UD-Q4_K_XL (GGUF) - A leading local text-to-speech model with the popular "Tara" voice, running on llama-server from GGML.org's llama.cpp. Includes the capability to generate emotive tags e.g. laugh, chuckle, sigh, etc.
4) Custom-written "orpheus-speak" C++ app to rapidly convert the speech tokens generated by the Orpheus TTS to audio using an optimized snac24_dynamic_fp16 (community-sourced) decoder over an ONNX runtime. The decoder stays warm between utterances, and audio WAV data is written directly to and played from RAM in 1-3 sentence chunks, allowing for accurate and (relatively) rapid audio generation across long text blocks.
5) An *extensively* A/B tested system prompt allowing for natural-sounding, engaging conversations, compiled into talk-llama.cpp.
6) A launcher shell script optimizing context and generation parameters across all neural nets (LLM, STT, TTS, decoder) running on the GPU.

Видео Athena Demo v3 канала Igor Barshteyn

Комментарии отсутствуют

Информация о видео

14 июня 2026 г. 8:33:18

00:19:05

Igor Barshteyn

Правообладателям

Жалоба на материал Недопустимый материал Нарушение авторских прав

Комментарии

Другие видео канала