← Back

Unmute.sh, the most modular voice AI around

Speak to an AI using Kyutai's real-time speech-to-text and text-to-speech.

Check out the demo at unmute.sh and read more about it on the Kyutai site.

This was my first project at Kyutai. I joined in March 2025, and at that point, they/we almost had a speech-to-text and text-to-speech model ready. What I had to figure out was how to promote the models using a demo.

The original idea that Kyutai had was to create something like an interactive NotebookLM. We also had some wackier ideas, like an AI dating show. In the end, we settled on a simplifying the original idea by removing the RAG component and just having a plain AI voice chat.

Of course, now people are used to realistic-sounding voice chat: what we wanted to emphasize is that because it's a cascaded system where the LLM is separated from the speech models (unlike, say, Sesame, which seems to fuse the LLM and the TTS), you can swap out the voice or the LLM easily without having to retrain the model.

After we released the site in May, it took us a bit over a month to wrap up the project and open-source all the models.

The TTS model gets its voice from a 10-second sample and the voice is provided via cross-attention, using a small voice encoder model. That means have the ability to not release the voice encoder to prevent people from using arbitrary voices for impersonation – or at least make it more difficult. We discussed at length whether we should release the voice cloning or not and in the end decided to be on the safe side and keep it to ourselves.

I also set up a voice donation project for people who wanted more voices: you could submit a sample of your own voice and it'd be released anonymously for use with the TTS. As of December 2025, we collected 174 voices this way. Thank you to everyone who contributed!

← Back