Did you know that?
When you tell your phone to search for something, play a song or guide you to a destination, chances are a company is recording it. (Apple, Google, Microsoft and Amazon emphasize that they anonymize user data to protect customer privacy.) When you ask Alexa what the weather is or the latest football score, the gadget uses the queries to improve its understanding of natural language (although “she” isn’t listening to your conversations unless you say her name). By design, Alexa gets smarter as you use her.
In the beginning,
Not so long ago, voice recognition was comically rudimentary. An early version of Microsoft’s technology running in Windows transcribed “mom” as “aunt” during a 2006 demo before an auditorium of analysts and investors. When Apple debuted Siri five years back, the personal assistant’s gaffes were widely mocked because it, too, routinely spat out incorrect results or didn’t hear the question correctly. When asked if Gillian Anderson is British, Siri provided a list of English restaurants. Now Microsoft says its speech engine makes the same number or fewer errors than professional transcribers, Siri is winning grudging respect, and Alexa has given us a tantalizing glimpse of the future.
Much of that progress owes a debt to the magic of neural networks, a form of artificial intelligence based loosely on the architecture of the human brain. Neural networks learn without being explicitly programmed but generally require an enormous breadth and diversity of data. The more a speech recognition engine consumes, the better it gets at understanding different voices and the closer it gets to the eventual goal of having a natural conversation in many languages and situations.
Hence the global scramble to capture a multitude of voices. “The more data we shove in our systems the better it performs,” says Andrew Ng, Baidu’s chief scientist. “This is why speech is such a capital-intensive exercise; not a lot of organizations have this much data.”
One of the key challenges is getting the technology conversant with multiple languages, accents and dialects. Nowhere, perhaps, is this more crucial than in China. Seeking to harvest dialects from all over the country, Baidu launched a marketing campaign during Chinese New Year earlier this year. In two weeks, the company recorded more than 1,000 hours of speech to plug into its computers. Many people did it for free simply because they were proud of their hometown dialects.
Another challenge: teaching voice recognition technology to pick up commands over background noise—the clamor of happy hour, say, or the cacophony of a sports stadium. Microsoft has deployed an Xbox app called Voice Studio to harvest conversation over the din of users shooting villains or watching movies. The company offered rewards including points and digital apparel for avatars and lured hundreds of subjects willing to contribute their game chatter to Microsoft’s speech efforts. The data was used to create the Brazilian Portuguese version of Cortana, released earlier this year.
Companies are also designing voice recognition systems for specific situations. Microsoft has been testing technology that can answer travelers’ queries without being distracted by the constant barrage of flight announcements at airports. The company’s technology is also being used in an automated ordering system for McDonald’s drive-thrus. Trained to ignore scratchy audio, screaming kids and “ums,” it can spit out a complicated order, getting even the condiments right. Amazon is conducting tests in automobiles, challenging Alexa to work well with road noise and open windows.
Even as companies scour the world for data, they’re figuring out ways to improve voice recognition with less of it. The technology being tested at McDonald’s is more accurate than other systems that use much more data, says Xuedong Huang, Microsoft’s chief speech scientist, who has been working on voice recognition at the company for more than two decades. “You can always have breakthroughs even without using the most data.”
Google generally subscribes to a less-is-more philosophy, deploying a piecemeal approach that uses unintelligible units of sound to build words and phrases. With its speech recognition system, the company aims to solve multiple problems with just one change. For its data sets, Google strings together tens of thousands of audio snippets that are typically two to five seconds long. The process requires less computing power and can be more easily tested and tweaked.
Voice technology in our daily lives
Voice recognition has come a long way in the past few years. But it’s still not good enough to popularize the technology for everyday use and usher in a new era of human-machine interaction, allowing us to talk with all our gadgets—cars, washing machines, televisions. Despite advances in speech recognition, most people continue to swipe, tap and click. It will probably remain so for the foreseeable future.
Today we took a look the challenges and innovations of voice technologies. When will it be possible to speak naturally to your digital assistant and they get wistful? No one really knows. Neural networks remain mysterious even to those who understand them best. And much of the work is trial and error; make a tweak here and you’re never quite sure what will happen there. Based on the current technology and methods, the process will probably take years. You never know when a breakthrough will arrive, catapulting research forward and turning Alexa and Siri into true conversationalists.
Reference “Why Google, Microsoft and Amazon Love the Sound of Your Voice”, Jing Cao, Dina Bass, Bloomberg Technology https://www.bloomberg.com/news/articles/2016-12-13/why-google-microsoft-and-amazon-love-the-sound-of-your-voice
BITNINE GLOBAL INC., THE COMPANY SPECIALIZING IN GRAPH DATABASE
비트나인, 그래프 데이터베이스 전문 기업