Alexa and Siri, listen up! The research team is teaching machines to really feel us

Credit: Pixabay / Public Domain CC0

University of Virginia cognitive scientist Per Sederberg has a fun experiment you can try at home. He takes out his smartphone and, using a voice assistant like the one from the Google search engine, says the word “octopus” as slowly as possible.

Your device will struggle to repeat what you just said. It might give a nonsensical answer, or it might give you something similar, but still out of place, like “foot kiss”. Great!

The point is, Sederberg said, that when it comes to receiving auditory signals, as humans and other animals do, despite all the computing power devoted to the company by heavyweights like Google, Deep Mind, IBM and Microsoft, the current artificial intelligence still somewhat tough. of hearing.

The results can range from comical and slightly frustrating to downright alienating for those with speech disabilities.

But using the latest breakthroughs in neuroscience as a model, UVA collaborative research has made it possible to convert existing AI neural networks into a technology that can truly listen to us, regardless of the rate at which we speak.

The deep learning tool is called SITHCon, and by generalizing the input, it is able to understand words spoken at different speeds than those a network has been trained on.

This new feature will not only change the end user experience; it has the potential to change the way artificial neural networks “think”, allowing them to process information more efficiently. And that could all change in an industry constantly seeking to increase processing capacity, minimize data storage and reduce AI’s huge carbon footprint.

Sederberg, an associate professor of psychology who serves as director of the cognitive science program at UVA, collaborated with graduate student Brandon Jacques to program a working demo of the technology in collaboration with researchers from Boston University and Indiana University.

“We have shown that we can decode speech, especially speech at scale, better than any model we know of,” said Jacques, who is the first author of the paper.

Sederberg added: “We see ourselves as a messy group of misfits. We’ve solved this problem that the big groups of Google, Deep Mind and Apple haven’t”.

The research was presented Tuesday at the high-profile International Conference on Machine Learning, or ICML, in Baltimore.

Current AI training: Hearing overload

For decades, but even more so in the last 20 years, companies have built complex artificial neural networks into machines to try to mimic the way the human brain recognizes a changing world. These programs are not limited to facilitating basic information seeking and consumerism; they also specialize in stock market forecasting, diagnosing medical conditions, and monitoring national security threats, among many other applications.

“Basically, we’re trying to discover meaningful patterns in the world around us,” Sederberg said. “These models will help us make decisions about how to behave and how to adapt to our environment so that we can get as many rewards as possible.”

Programmers used the brain as the initial inspiration for the technology, hence the name “neural networks”.

“Early AI researchers took the basic properties of neurons and how they connect together and recreated them with computer code,” Sederberg said.

For complex problems like teaching machines to “feel” language, however, programmers have unwittingly taken a different path than how the brain actually works, he said. They have failed to pivot on the basis of developments in the understanding of neuroscience.

“The way these large companies tackle the problem is to invest computational resources,” the professor explained. “So they enlarge the neural networks. A field originally inspired by the brain has turned into an engineering problem”.

Basically, programmers enter a multitude of different voices using different words at different rates and train large networks through a process called backpropagation. Programmers know the answers they want to get, so they keep delivering the continuously refined information in a loop. The AI ​​then begins to give due weight to the aspects of the input that will translate into accurate answers. The sounds become useful characters in the text.

“You do it many millions of times,” Sederberg said.

Although the training data sets that serve as input have improved, as have calculation speeds, the process is still far from ideal, as programmers add more layers to discover more nuances and complexities, the so-called “deep” learning or “Convolutional”.

More than 7,000 languages ​​are spoken worldwide today. Variations occur with accents and dialects, deeper or higher pitched voices and of course faster or slower speech. As competitors create better products, a computer must process information at every step.

This has real consequences for the environment. In 2019, a study found that the carbon dioxide emissions from the energy required to train a single large deep learning model was equivalent to the lifetime footprint of five cars.

Three years later, datasets and neural networks continued to grow.

How the brain really hears speech

The late Howard Eichenbaum of Boston University coined the term “time cells,” the phenomenon upon which this new AI research is built. Neuroscientists studying time cells in mice, and then humans, have shown that there are spikes in neural activity when the brain interprets time-based input, such as sound. These individual neurons, residing in the hippocampus and other parts of the brain, capture specific intervals, data points, that the brain examines and interprets in relation to. The cells are located next to the so-called “local cells”, which help us form mental maps.

Time cells help the brain create a unified understanding of sound, regardless of how quickly or slowly the information arrives.

“If I say ‘ooooooc-tooooo-pussssssss,’ you’ve probably never heard anyone say ‘octopus’ at that rate, but you can understand it anyway because the way your brain processes that information is called an ‘invariant scale.’ ,” Sederberg said.. “What that basically means is that if you’ve heard it and learned to decode that information on a scale, if that information now comes a little bit faster or a little bit slower, or even a lot slower, then you get it anyway.” .

The main exception to the rule, he said, is information that arrives hyper-fast. This data will not always be translated. “You lose bits of information,” he said.

Cognitive scientist Marc Howard’s laboratory at Boston University builds on the discovery of time cells. Howard has been a collaborator with Sederberg for over 20 years, studying how people make sense of the events in their lives. Then convert that understanding into math.

Howard’s equation describing auditory memory implies a time sequence. The timeline is created using time cells that fire sequentially. Basically, the equation predicts that the timeline blurs in a certain way as the sound moves into the past. This is because the brain’s memory of an event becomes less accurate over time.

“So there’s a specific redundancy scheme that encodes what happened at a certain period in the past, and the information gets more and more blurred as you move into the past,” Sederberg said. “The interesting thing is that Marc and a postdoc who looked into Marc’s laboratory have mathematically figured out what it should look like. Then the neuroscientists began to find evidence in the brain”.

Time adds context to sounds and is part of what gives meaning to what we are told. Howard said the math works out perfectly.

“Time cells in the brain seem to obey this equation,” Howard said.

UVA encodes the voice decoder

About five years ago, Sederberg and Howard identified that the AI ​​field could benefit from such brain-inspired representations. Working with Howard’s lab and in consultation with Zoran Tiganj and colleagues from Indiana University, Sederberg’s Computational Memory Lab began building and testing models.

Jacques took the big step about three years ago, which helped him do the coding for the resulting proof of concept. The algorithm has a form of compression that can be decompressed as needed, similar to how a zip file works on a computer to compress and archive large files. The machine only stores the “memory” of a sound in a resolution that will be useful later, saving storage space.

“Because the information is logarithmically compressed, it doesn’t completely change the pattern when the input changes, it just shifts,” Sederberg said.

The AI ​​training for SITHCon has been compared to a pre-existing resource freely available to researchers called the “temporal convolutional network”. The goal was to convert the network from one that was only trained to listen at certain speeds.

The process started with a core language, Morse code, which uses bursts of long and short sounds to represent dots and dashes, and progressed to an open source set of English speakers who pronounce the numbers 1 to 9 for input. .

In the end, no further training was required. Once the AI ​​recognized the communication at a speed, it couldn’t fool itself if a speaker drew out the words.

“We have shown that SITHCon can generalize to speech at increased or decreased speed, while other models have failed to decode information at speeds they could not see during training,” Jacques said.

Now UVA has decided to make its code available for free to advance knowledge. The team says the information must adapt to any neural network that translates speech.

“We will publish and release all the code because we believe in open science,” Sederberg said. “The hope is that the companies will see it, get really excited and say that they would like to finance our continued work. We’ve harnessed a fundamental way the brain processes information by combining power and efficiency, and we’ve only scratched the surface of what these models of artificial intelligence can do.”

But knowing that they have built a better mousetrap, are the researchers concerned about how the new technology can be used?

Sederberg said he is optimistic that the AI ​​that hears best will be approached ethically, as all technology should be in theory.

“Right now, these companies are running into computational bottlenecks as they try to build more powerful and useful tools,” he said. “One must hope that the positive outweighs the negative. If you can download more thought processes onto computers, this will make us a more productive world, for better or for worse”.

Jacques, a new father, said: “It’s exciting to think that our work could spawn a new direction in AI.”

Leave a Comment