Back in May (2023), Meta released their Massively Multilingual Speech Model [github]. This model is significant because it includes 1,107 languages, including a vast number of low resource languages. And supposedly it has lower error rates for transcription than OpenAI’s Whisper model.

For text-to-speech capabilities, the model actually reuses another library in Meta’s Fairseq toolkit, VITS, which was released back in June of 2021. VITS came with some pretty impressive TTS demos, and so I wanted to see how well they worked in my own hands and compare the new MMS and the older VITS models.

Unfortunately the MMS TTS model doesn’t generate Chinese or Japanese. The only other language it generates that I am familiar with is French. Here’s a short story in French from ChatGPT:

Un garçon nommé Thomas découvre une boîte mystérieuse dans son jardin. En l’ouvrant, il trouve une lettre qui révèle qu’il est en réalité un robot créé par ses parents ingénieurs. Au début choqué, Thomas accepte finalement sa véritable identité et décide de vivre sa vie en embrassant sa nature de robot. Il devient un modèle pour les autres en prouvant que l’acceptation de soi est essentielle, peu importe notre apparence ou notre origine.

And here’s the audio French model it generated:

The largest problem in my mind is that it ignores commas and has otherwise somewhat unnatural prosody. The pronunciation seems decent and rather human like. It’s also doing a nice job eliding words together which is very typical in French.

For the purposes of testing TTS in English I asked ChatGPT for a joke:

A robot walks into a bar and sits down. The bartender says, “We don’t serve robots here.” The robot replies, “Oh, don’t worry, I’m not here to drink. I just came to charge my batteries.”

Here’s the MMS audio using the single-voice model, which I generated using this Colab notebook, which I adapted from their original notebook:

The prosody doesn’t seem very natural. Let’s compare it to the original VITS model:

I think the VITS model is actually quite a bit better. Perhaps making the model massively multilingual causes it to have less fidelity in English.

We can also generate the same joke using the different speakers from the VITS multi-speaker model. Based on the model config file it looks like there may be 109 speakers. Here are the first 20:

I asked it to generate all of these in parallel, and so the shorter ones initially had padding at the end that sounds like a buzz, which I manually trimmed.

To me they sound a bit like most of them have Indian accents. I’m not sure why.

Here are the next 20 voices:

I was curious if there was something about the text I choose that resulted in worse quality TTS. So here I use some new text:

I tried generating some more speech with this text with voice 77, which from their voice conversion demo sounded pretty good.

Natural selection is a fundamental concept in evolutionary biology proposed by Charles Darwin. It refers to the process by which certain heritable traits become more or less common in a population over successive generations due to their impact on reproductive success.

And here’s the robot joke with the same voice (77):

It’s strange, because the natural selection passage is quite a bit more clearly enunciated compared to the robot joke.

Here’s another example of voice 77:

The three-body problem is a famous problem in classical mechanics that deals with predicting the motion of three celestial bodies under their mutual gravitational influence. It refers to the challenge of determining the exact positions and velocities of three point masses, such as stars or planets, at any given time, based on their initial conditions and the laws of motion and gravity.

Here’s another example:

Curds are the solid part of cheese formed when milk coagulates, while whey is the liquid left after the curds are removed.

How does it do with a very short example?

A penny saved is a penny earned.