Speech synthesis algorithms used in Plaits

There are actually three speech synthesis algorithms in Plaits:

NaiveSpeechSynth

A pulse train (at f0) into a mild band-pass filter, into a bank of band-pass filters tuned at formants. There’s no specific source for this, it’s just the kind of “recipe” for vowel synthesis one would expect to patch on a Moog modular system in the 70s :smiley: The band-pass filters are not steep enough to make any intelligible sound, so it remains vaguely vowel-ly…

SAMSpeechSynth

This is an implementation of SAM.

Here’s how it works: we sum three sine waves (one for each formant) enveloped by a decreasing ramp (at f0), the phase of the sine waves being reset when the ramp resets. This is very roughly equivalent to directly synthesizing in the time-domain the waveform of a pulse processed by a bank of three band-pass resonant filter (in this case the envelope would have to be a decreasing exponential, and different for each of the three formants). Adjusting the amplitude and frequency of each sine-wave can make for different combinations of formants.

SAM (and its Shruthi, Ambika, Braids implementations) do this with a set of super low resolution look-up table (16 8-bit samples per sine waves for 16 different amplitude levels).

In Plaits, I ditched the lookup tables, and used a band-limited sawtooth (and the phases of the sine-waves are reset to the correct phase corresponding to the sub-sample transition, not 0). So it’s a hi-fi implementation of a low-fi algorithm :smiley:

LPCSpeechSynth

This is a direct implementation of the TI TMS5100 (Speak and Spell chip) algorithm.

White noise, or a wide-bandwidth periodic signal serves as the excitation of an order 10 all-pole filter (implemented as an order 10 lattice filter).

The noise and pulse energy, along with the 10 coefficients are stored in ROM tables derived from a speech analysis program (which searches for the optimal energies and coefficients that minimize the error between the synthesized signal and the original speech recording – for each small frame of signal). I used data from the original ROMs, and made my own data (the colors, the synthesis vocabulary…) by running the analysis process on a MacOS X voice (so that’s speech resynthesis of a speech synthesis voice).

Specificities of the Plaits implementation: one can interpolate between one set of coefficients and the next (rather than directly jumping to new values), the excitation pulse is the original TI function gone through minimum phase reconstruction (the goal is to keep the same spectrum but get something more compact in the time domain, to reach higher pitches without dealing with overlap issues), and replayed with band-limiting techniques. Also, there’s a whole variable sample-rate playback layer on top of that, to allow formant shifting.

13 Likes

Hi Emilie, and anyone else who may be able to answer this :slight_smile:,

I have only recently been getting into the more technical side of this and previously only been fiddling with synthesis from the perspective of what I like to hear. So, this is probably silly, but please bear with me! I was not sure what “f0” meant and after poking around, I suspect that it is the “fundamental frequency”. Is that correct?

All the best,
Bob

1 Like

Yes, that’s the fundamental frequency!

1 Like

Awesome! Thank you for the clarification. I appreciate your time :slight_smile:

This is great, and the references are great. Also I laughed when I saw that Atari manual for S.A.M.

Screen Shot 2020-03-18 at 3.10.10 PM

1 Like