Gradium's on-device, CPU-only text-to-speech for private voice AI - Voice AI Space Barcelona

Timothé Duval & Constance Grisoni - Gradium

The Gradium team presents their work as a voice foundation model lab spun off from the QIA research lab, which pioneered early live speech-to-speech translation. They frame what makes good voice AI around two pillars: quality — natural, human-sounding flow, expressivity and emotion control, and robustness on hard cases like email addresses, phone numbers, and URLs — and scalability, which depends on inference speed, predictable cost economics, and privacy. They note they still run a cascaded pipeline (ASR, then LLM or translation, then TTS) rather than full duplex, since production speech-to-speech remains hard. Privacy is their central focus, motivating a push toward local inference. They introduce Gradium Phonon, their first on-device text-to-speech model that runs entirely on CPU across smartphones, tablets, Macs, and laptops, requiring no server and working even in airplane mode. Phonon supports 10-second voice cloning, covers five languages (English, French, German, Spanish, Portuguese) with more and custom models coming, and is lightweight at roughly 100 million parameters, a 100–200MB app footprint, and under 500MB of memory, while improving sharply over the prior version and beating Kokoro/Google even in English. Constance runs live demos on a budget MacBook Air showing about 30–35ms latency (using Duolingo's "Lily" voice), a slower ~210–250ms on a Raspberry Pi, multilingual generation, and an interactive on-device game with cloned character voices. They explain the main use cases are mobile games and consumer apps like language learners where per-user API costs are prohibitive, acknowledge that local ASR isn't good enough yet but is being worked on, and tease on-device live translation from CEO Neil Zeghidour. The session closes with their licensing model — a flat per-user-per-month fee for unlimited usage, often mixed with API for heavier users — plus Q&A on device latency and pricing.

Recorded during a Voice AI Space Event. Check past and future events https://events.voiceaispace.com

Видео Gradium's on-device, CPU-only text-to-speech for private voice AI - Voice AI Space Barcelona канала Voice AI Space

Комментарии отсутствуют