- Популярные видео
- Авто
- Видео-блоги
- ДТП, аварии
- Для маленьких
- Еда, напитки
- Животные
- Закон и право
- Знаменитости
- Игры
- Искусство
- Комедии
- Красота, мода
- Кулинария, рецепты
- Люди
- Мото
- Музыка
- Мультфильмы
- Наука, технологии
- Новости
- Образование
- Политика
- Праздники
- Приколы
- Природа
- Происшествия
- Путешествия
- Развлечения
- Ржач
- Семья
- Сериалы
- Спорт
- Стиль жизни
- ТВ передачи
- Танцы
- Технологии
- Товары
- Ужасы
- Фильмы
- Шоу-бизнес
- Юмор
Athena Demo v3
An updated demo of a 100% local, privacy-first conversational LLM chatbot, using speech-to-text and text-to-speech interfaces. The design goal was maximum conversational realism and engagement in a 100% local laptop form-factor. Let me know what you think in the comments! It's about 95% glitch-free. I'm working on fixing the remaining 5%.
In this demo, the STT, TTS, and non-expert LLM layers run on a *single* RTX 5000 Pro Mobile GPU with 24 GB VRAM total. The MoE experts of Qwen3.5-397B-A17B are offloaded to the P-Cores and E-Cores of the CPU (16 cores total on an Intel Core Ultra 9 285HX) and system RAM (DDR5-4000). There are no Python dependencies. All components are built in C++ for speed.
I've recorded a live packet capture in the bottom right of the screen to show that no traffic is moving in and out of the laptop.
*** Since the last demo I've upgraded the LLM to Qwen3.5-397B UD-Q3_K_XL. I've also added SSE streaming for the output, reducing conversational latency to near real-time. Lastly, I've added an interruptibility feature, where I can naturally interject in the conversation and Athena remembers what she said before the interruption and can recover the conversation after responding. Some additional tweaks to intelligence were also made in the system prompt for even greater conversational realism ***
Components include:
1) Qwen3.5-397B-A17B UD-Q3_K_XL (GGUF) - LLM running on a (very) customized talk-llama.cpp example from GGML.org's whisper.cpp. Customizations include an ability to set KV cache quantization levels, as well as additional Qwen3.5 generation parameters (repeat-penalty, presence-penalty) to optimize text generation. Context is the maximum-supported 131,072 tokens - enough for a few hours of conversation.
2) Whisper-small (GGUF) model for accurate STT, running on talk-llama.cpp.
3) Orpheus-3B-ft UD-Q4_K_XL (GGUF) - A leading local text-to-speech model with the popular "Tara" voice, running on llama-server from GGML.org's llama.cpp. Includes the capability to generate emotive tags e.g. laugh, chuckle, sigh, etc.
4) Custom-written "orpheus-speak" C++ app to rapidly convert the speech tokens generated by the Orpheus TTS to audio using an optimized snac24_dynamic_fp16 (community-sourced) decoder over an ONNX runtime. The decoder stays warm between utterances, and audio WAV data is written directly to and played from RAM in 1-3 sentence chunks, allowing for accurate and (relatively) rapid audio generation across long text blocks.
5) An *extensively* A/B tested system prompt allowing for natural-sounding, engaging conversations, compiled into talk-llama.cpp.
6) A launcher shell script optimizing context and generation parameters across all neural nets (LLM, STT, TTS, decoder) running on the GPU.
Видео Athena Demo v3 канала Igor Barshteyn
In this demo, the STT, TTS, and non-expert LLM layers run on a *single* RTX 5000 Pro Mobile GPU with 24 GB VRAM total. The MoE experts of Qwen3.5-397B-A17B are offloaded to the P-Cores and E-Cores of the CPU (16 cores total on an Intel Core Ultra 9 285HX) and system RAM (DDR5-4000). There are no Python dependencies. All components are built in C++ for speed.
I've recorded a live packet capture in the bottom right of the screen to show that no traffic is moving in and out of the laptop.
*** Since the last demo I've upgraded the LLM to Qwen3.5-397B UD-Q3_K_XL. I've also added SSE streaming for the output, reducing conversational latency to near real-time. Lastly, I've added an interruptibility feature, where I can naturally interject in the conversation and Athena remembers what she said before the interruption and can recover the conversation after responding. Some additional tweaks to intelligence were also made in the system prompt for even greater conversational realism ***
Components include:
1) Qwen3.5-397B-A17B UD-Q3_K_XL (GGUF) - LLM running on a (very) customized talk-llama.cpp example from GGML.org's whisper.cpp. Customizations include an ability to set KV cache quantization levels, as well as additional Qwen3.5 generation parameters (repeat-penalty, presence-penalty) to optimize text generation. Context is the maximum-supported 131,072 tokens - enough for a few hours of conversation.
2) Whisper-small (GGUF) model for accurate STT, running on talk-llama.cpp.
3) Orpheus-3B-ft UD-Q4_K_XL (GGUF) - A leading local text-to-speech model with the popular "Tara" voice, running on llama-server from GGML.org's llama.cpp. Includes the capability to generate emotive tags e.g. laugh, chuckle, sigh, etc.
4) Custom-written "orpheus-speak" C++ app to rapidly convert the speech tokens generated by the Orpheus TTS to audio using an optimized snac24_dynamic_fp16 (community-sourced) decoder over an ONNX runtime. The decoder stays warm between utterances, and audio WAV data is written directly to and played from RAM in 1-3 sentence chunks, allowing for accurate and (relatively) rapid audio generation across long text blocks.
5) An *extensively* A/B tested system prompt allowing for natural-sounding, engaging conversations, compiled into talk-llama.cpp.
6) A launcher shell script optimizing context and generation parameters across all neural nets (LLM, STT, TTS, decoder) running on the GPU.
Видео Athena Demo v3 канала Igor Barshteyn
Комментарии отсутствуют
Информация о видео
14 июня 2026 г. 8:33:18
00:19:05
Другие видео канала
