Science and technology | Speech recognition

Watch what you say

Better automated acquisition of speech may be more about seeing than hearing

Jan 23rd 2015|SEATTLE

"IF HE were proven to be malfunctioning, I wouldn't see how we'd have any choice but disconnection." In the film "2001", Frank Poole, an astronaut played by Gary Lockwood, considers what should be done with HAL, the homicidal computer in charge of the ship. HAL learns of his human masters’ plan to unplug him by lip-reading their conversation through a window—a strategy that several researchers and companies are getting closer to realising. Their goal is less about spaceship-driving robots and more about improving the performance of voice-controlled helpers such as Apple’s Siri and Microsoft’s Cortana.

No matter how good voice-recognition software becomes, it will always be hostage to its sonic environment. Ask your digital assistant to dial a number in a quiet office and it might hear the right numbers. Try again near a busy road or at a noisy party and you will probably be disappointed. If only your phone could simply read your lips.

Ahmad Hassanat, an artificial-intelligence researcher at Mu’tah University, in Jordan, has been trying to teach a computer program to do just that. Previous attempts to get computers to lip-read have focused, understandably enough, on the shape and movement of the lips as they produce phonemes (individual sounds like "b", "ng" or "th"). Such shapes-of-sounds are called visemes. The problem is that there are just a dozen visemes for the 40 to 50 phonemes in English; "pan" and "banned", for example, look remarkably similar to a lip-reader. That makes it rather taxing to reconstruct words from visemes alone. Instead, Dr Hassanat has been trying for the past few years to detect the visual signature of entire words all at once, using the appearance of the tongue and teeth as well the lips.

His method has had some success. In a paper published late last year, Dr Hassanat trained his system by filming 10 women and 15 men of different ethnicities as they read passages of text. The computer first compared these recordings to a text it knew, then tried to guess what they were saying in a second video. When the computer was allowed to use the same person’s training speech, it was fairly accurate—around 75% for all subjects and up to 97% for one speaker. But when the person’s own training video was excluded from the analysis—analogously to similarly untrained digital assistants—the program's accuracy plunged to 33% on average and as poor as 15% in some cases (moustaches and beards, it seems, are particularly confusing to the system).

Another idea is not to focus on the mouth. In 2013, Yasuhiro Oikawa, an engineer at Waseda University in Japan, used a high-speed camera capable of shooting 10,000 frames a second of a speaker’s throat. The approach measures tiny, fleeting vibrations in the skin caused by the very act of speaking. The precise frequencies present in the vibrations can then, in principle, be used to reconstruct the word being spoken. So far, however, Dr Oikawa’s team has managed to map the visual vibrations of just a single Japanese word.

The best results seem to come when the approach is used at closer quarters. VocalZoom is an Israeli start-up whose idea is to point a low-power laser beam at a speaker’s cheek to measure vibrations, and use those to infer the frequencies of speech. The system combines those results with ordinary speech audio from a microphone, subtracting unwanted ambient noise or other talkers and leaving just the cheek-wobble frequencies.

Earlier this month, the firm took its technology to CES, a big trade show and a notoriously ear-splitting environment, and impressed the tech press. But it is not yet ready for the mass market. The prototype system is currently larger than the smartphones it is intended to be built into, and tempting manufacturers into adding components to ever-slimmer, ever-sleeker handsets will not be easy. The company may have more luck getting its technology into cars, another industry increasingly reliant on voice control; VocalZoom claims to be in early talks with a big carmaker. Perhaps the company will even get its kit into space-faring vehicles.

Reuse this content

Watch what you say

Better automated acquisition of speech may be more about seeing than hearing

More from Science and technology

Large language models are getting bigger and better

What is screen time doing to children?

Locust-busting is getting a upgrade