Technology Quarterly | Speech recognition

Watch what you say

Better automated acquisition of speech may be more about seeing than it is about hearing

“IF HE were proven to be malfunctioning, I wouldn’t see how we’d have any choice but disconnection.” In the film “2001” (pictured), Frank Poole, an astronaut played by Gary Lockwood, considers what should be done with HAL, the homicidal computer in charge of the ship. HAL learns of his human masters’ plan to unplug him by lip-reading their conversation through a window—an idea that researchers and companies are getting closer to realising. Their goal is less about spaceship-driving robots and more about improving voice-controlled helpers such as Apple’s Siri and Microsoft’s Cortana.

No matter how good voice-recognition software becomes, it will always be hostage to its sonic environment. Ask your digital assistant to dial a number in a quiet office and it might hear the right numbers. Try again near a busy road or at a noisy party and you will probably be disappointed. If only your phone could read your lips.

Ahmad Hassanat, a researcher in artificial intelligence at Mu’tah University, in Jordan, has been trying to teach a computer program to do just that. Previous attempts to get computers to lip-read have focused, understandably enough, on the shape and movement of the lips as they produce phonemes (individual sounds like “b”, “ng” or “th”). Such shapes-of-sounds are called visemes. The problem is that there are just a dozen visemes for the 40 to 50 phonemes in English; “pan” and “ban”, for example, look remarkably similar to a lip-reader. That makes it rather taxing to reconstruct words from visemes alone. Instead, Dr Hassanat has been trying for the past few years to detect the visual signature of entire words, using the appearance of the tongue and teeth as well as the lips.

His method has had some success. In a paper published late last year, Dr Hassanat described how he had trained his system by filming ten women and 16 men of different ethnicities as they read passages of text. The computer first compared these recordings with a text it knew, then tried to guess what they were saying in a second video. When the computer was allowed to use the same person’s training speech, it was fairly accurate—around 75% of words spoken for all subjects and up to 97% for one speaker. But when the person’s own training video was excluded from the analysis—just like untrained digital assistants—the program’s accuracy plunged to 33% on average and as poor as 15% in some cases (moustaches and beards, it seems, are particularly confusing to the system).

Another idea is not to focus on the mouth. In 2013 Yasuhiro Oikawa, an engineer at Waseda University in Japan, used a high-speed camera capable of shooting 10,000 frames a second of a speaker’s throat. This measures tiny, fleeting vibrations in the skin caused by the act of speaking. The precise frequencies present in the vibrations can then, in principle, be used to reconstruct the word being spoken. So far, however, Dr Oikawa’s team has managed to map the visual vibrations of just a single Japanese word.

The best results come when a system does more than just passively watch. VocalZoom is an Israeli startup whose idea is to point a low-power laser beam at a speaker’s cheek to measure vibrations, and use those to infer the frequencies of speech. The system combines those results with ordinary speech audio from a microphone, subtracting unwanted ambient noise or other talkers and leaving just the cheek-wobble frequencies.

In January the firm took its technology to CES, a giant technology trade show in Las Vegas and a notoriously ear-splitting environment, and impressed the tech press. But the system is not yet ready for the mass market. The prototype is still larger than the smartphones it is intended to be built into, and tempting manufacturers into adding components to ever-slimmer, ever-sleeker handsets will not be easy. The company may have more luck getting its technology into cars, another industry increasingly reliant on voice control; VocalZoom claims to be in early talks with a big carmaker. Perhaps the company will, one day, even get its kit into space-faring vehicles.

This article appeared in the Technology Quarterly section of the print edition under the headline "Watch what you say"

The new nuclear age

From the March 7th 2015 edition

Discover stories from this section and more in the list of contents

Explore the edition