- Популярные видео
- Авто
- Видео-блоги
- ДТП, аварии
- Для маленьких
- Еда, напитки
- Животные
- Закон и право
- Знаменитости
- Игры
- Искусство
- Комедии
- Красота, мода
- Кулинария, рецепты
- Люди
- Мото
- Музыка
- Мультфильмы
- Наука, технологии
- Новости
- Образование
- Политика
- Праздники
- Приколы
- Природа
- Происшествия
- Путешествия
- Развлечения
- Ржач
- Семья
- Сериалы
- Спорт
- Стиль жизни
- ТВ передачи
- Танцы
- Технологии
- Товары
- Ужасы
- Фильмы
- Шоу-бизнес
- Юмор
Gradium's on-device, CPU-only text-to-speech for private voice AI - Voice AI Space Barcelona
Timothé Duval & Constance Grisoni - Gradium
The Gradium team presents their work as a voice foundation model lab spun off from the QIA research lab, which pioneered early live speech-to-speech translation. They frame what makes good voice AI around two pillars: quality — natural, human-sounding flow, expressivity and emotion control, and robustness on hard cases like email addresses, phone numbers, and URLs — and scalability, which depends on inference speed, predictable cost economics, and privacy. They note they still run a cascaded pipeline (ASR, then LLM or translation, then TTS) rather than full duplex, since production speech-to-speech remains hard. Privacy is their central focus, motivating a push toward local inference. They introduce Gradium Phonon, their first on-device text-to-speech model that runs entirely on CPU across smartphones, tablets, Macs, and laptops, requiring no server and working even in airplane mode. Phonon supports 10-second voice cloning, covers five languages (English, French, German, Spanish, Portuguese) with more and custom models coming, and is lightweight at roughly 100 million parameters, a 100–200MB app footprint, and under 500MB of memory, while improving sharply over the prior version and beating Kokoro/Google even in English. Constance runs live demos on a budget MacBook Air showing about 30–35ms latency (using Duolingo's "Lily" voice), a slower ~210–250ms on a Raspberry Pi, multilingual generation, and an interactive on-device game with cloned character voices. They explain the main use cases are mobile games and consumer apps like language learners where per-user API costs are prohibitive, acknowledge that local ASR isn't good enough yet but is being worked on, and tease on-device live translation from CEO Neil Zeghidour. The session closes with their licensing model — a flat per-user-per-month fee for unlimited usage, often mixed with API for heavier users — plus Q&A on device latency and pricing.
Recorded during a Voice AI Space Event. Check past and future events https://events.voiceaispace.com
Видео Gradium's on-device, CPU-only text-to-speech for private voice AI - Voice AI Space Barcelona канала Voice AI Space
The Gradium team presents their work as a voice foundation model lab spun off from the QIA research lab, which pioneered early live speech-to-speech translation. They frame what makes good voice AI around two pillars: quality — natural, human-sounding flow, expressivity and emotion control, and robustness on hard cases like email addresses, phone numbers, and URLs — and scalability, which depends on inference speed, predictable cost economics, and privacy. They note they still run a cascaded pipeline (ASR, then LLM or translation, then TTS) rather than full duplex, since production speech-to-speech remains hard. Privacy is their central focus, motivating a push toward local inference. They introduce Gradium Phonon, their first on-device text-to-speech model that runs entirely on CPU across smartphones, tablets, Macs, and laptops, requiring no server and working even in airplane mode. Phonon supports 10-second voice cloning, covers five languages (English, French, German, Spanish, Portuguese) with more and custom models coming, and is lightweight at roughly 100 million parameters, a 100–200MB app footprint, and under 500MB of memory, while improving sharply over the prior version and beating Kokoro/Google even in English. Constance runs live demos on a budget MacBook Air showing about 30–35ms latency (using Duolingo's "Lily" voice), a slower ~210–250ms on a Raspberry Pi, multilingual generation, and an interactive on-device game with cloned character voices. They explain the main use cases are mobile games and consumer apps like language learners where per-user API costs are prohibitive, acknowledge that local ASR isn't good enough yet but is being worked on, and tease on-device live translation from CEO Neil Zeghidour. The session closes with their licensing model — a flat per-user-per-month fee for unlimited usage, often mixed with API for heavier users — plus Q&A on device latency and pricing.
Recorded during a Voice AI Space Event. Check past and future events https://events.voiceaispace.com
Видео Gradium's on-device, CPU-only text-to-speech for private voice AI - Voice AI Space Barcelona канала Voice AI Space
Комментарии отсутствуют
Информация о видео
15 июня 2026 г. 14:54:44
00:17:28
Другие видео канала





















