Speech synthesis

Wednesday, November 9, 2011

Speech synthesis

Lau Kwan Yuen

In this day and age, a machine speaking to us is not a surprise. Stephen Hawking exactly uses the “speaking machine” to “speak” and communicate with the others. Even the mobile phone and electronic dictionary has this ability. The technology is called speech synthesis. This survey would focus on the description, principle and applications of speech synthesis.

Speech synthesis artificially produces the human speech. In 1950s, the first computer –based speech synthesizer was invented. It can be implemented in hardware or software. The computer system used for speech synthesis has several types.

The speech can be synthesized by linking up the speech that are recorded and stored in the database. In addition, a completely synthetic voice output can be created by using the model or characteristics of the human voice.

In 1968, the first text-to- speech (TTS) system was created. TTS system can convert normal language text into speech while some systems can change symbolic linguistic representations such as phonetic transcriptions into speech. This system has two parts, front-end and back-end. The front-end converts text and symbols into written form (text normalization). Moreover, the front-end can assign phonetic transcriptions to each of the words and break them into several phrases. On the other hand, the back-end then changes the phonetic transcriptions into sound.

In some specific application, the speech synthesizing quality needs to be higher. Actually, the quality of the speech synthesizer would be determined by the similarity to the voice of human and the ability to be understood. However, the more important qualities of the synthesizing system is the naturalness and intelligibility. Different syntheses have different features and usage. Some systems would be discussed.

One of the syntheses providing good naturalness is the unit selection. The reason is it applies only a little of the digital signal processing, which always produce unnatural sound, to the recorded speech. Moreover, some other system can smooth the waveform at the point of linkage by using some signal processing.

Another synthesis is the formant synthesis. It does not use any human speech sample to synthesize. In other words, the speech is totally artificial created by using additive synthesis and an acoustic model. The system can use the model to simulate the voicing, noise levels and fundamental, etc.

Various computer operation systems have adopted the speech system. For instance, the two very popular operating system, Apple iOS and Android, have added support for speech synthesis. For iOS used on the iPhone and iPad, VoiceOver speech synthesis is installed for accessibility for some kind of disabilities.

Since Microsoft Windows 2000, the Narrator, a text-to-speech utility for visual handicaps, was added. Also, the CoolSpeech program can be run in Windows to speak text from webpages and text documents.

Moreover, the speech synthesis systems are used in many different entertainment products. For example, some e-book can be read out by the speaker for convenience. It allows people with reading disabilities or visual impairments to listen to the words in the book so that they can enjoy the interest of reading books.

What is more innovative is the speech synthesis has been applied to the software called Vocaloid, which is a singing synthesizer application by Yamaha Corporation. This software lets the users to synthesize singing and create their own song by the virtual singer by typing in melody and lyrics. It uses synthesizing technology with specially recorded vocals of voice actor or singers. The software allows users to change the stress of the pronunciations, vibrato, dynamics and tone of the voice. The Vocaloid software is sold as “a singer in a box” being a replacement for a traditional actual singer. Therefore, the application of the synthesis system is not confined to the area for assistance.

To conclude, speech synthesis is a significant technology in our daily life. It does not only aid the people with visual or verbal impairments, but also bring us new experience of entertainment. The quality of synthesizing the human voice is expected to improve in the future, so that an artificial human voice can be more similar to the actual human voice. Then, the recognition of speech can be easier and more accurate.

References
1.    “What is Speech Synthesis?”
http://www.wisegeek.com/what-is-speech-synthesis.htm
2.    eSpeak: Speech Synthesizer
http://espeak.sourceforge.net/
3.    Speech Synthesis and Recognition
http://www.dspguide.com/ch22/6.htm
4.    vozMe - From text to speech (speech synthesis)
http://vozme.com/index.php?lang=en
5.    VOCALOID
http://www.vocaloid.com/

Everyday Computing and the Internet -- HKU CCST9003 Common Core

Wednesday, November 9, 2011

Speech synthesis

No comments:

Post a Comment