New AI model shows how machines can learn vision, language and sound together

An image showing how machines learn vision, language and sound together.

Most of us have watched television with the sound muted at one time or another. While it’s usually possible to follow the story at least to some degree, the lack of an audio track tends to limit our ability to fully appreciate what’s going on.

Likewise, it’s easy to miss a lot of information just by listening to sounds coming from another room. The multimodality of the combination of image, sound and other details greatly improves our understanding of what is happening, whether on television or in the real world.

It seems the same is true for artificial intelligence. A new question-response model called MERLOT RESERVE enables out-of-the-box prediction, revealing a strong common-sense multimodal understanding. It was recently developed by a team from the Allen Institute for Artificial Intelligence (AI2), the University of Washington and the University of Edinburgh.

Part of a new generation of artificial intelligence applications that enable semantic search, analysis and question answering (QA), the system was trained by having it “watch” 20 million YouTube videos. The capabilities demonstrated are already commercialized by startups such as Twelve laboratories and Clip.

MERLOT RESERVE (RESERVE for short), stands for Multimodal Event Representation Learning Over Time, with Re-entrant Supervision of Events, and builds on the team’s previous MERLOT model. It’s been pre-trained on millions of videos, learning from the combined input of their images, audio, and transcripts. Individual images allow the system to learn spatially while training at the video level gives it temporal information, training it on the relationships between elements that change over time.

“The way AI processes things is going to be different from the way humans do,” said computer scientist and project manager Rowan Zellers. “But there are some general principles that will be hard to avoid if we want to build robust AI systems. I think multimodality is definitely in that bucket.

Rowan Zellerresearcher at the University of Washington and the Allen Institute of Artificial Intelligence.

Because we live in a dynamic world, the team wanted to explore building machines that learn vision, language and sound together. In one of the examples from the newspaper, we see someone cooking popcorn. Just from the images and the dialogues, one can imagine the sounds that could accompany them. The sound of uncooked kernels moving across the metal surface of a pot can eventually turn into energetic “pops” as they burst into fluffy white popcorn.

Such prediction is known as “learning from re-entry”, where time-locked correlations allow one modality to educate others. Some developmental psychologists have theorized that this is how we learn visual and world knowledge, often without a teacher. It is also the basis of the name RESERVE: Re-entrant Supervision of Events.

The model is trained on 40-second video segments, where snippets of text and audio are “hidden” from the system. RESERVE then learns by selecting the correct hidden snippet from four multiple-choice options. This is followed by a selection from four possible justifications to justify his answer.

This approach has not only allowed RESERVE to achieve state-of-the-art results through its semi-supervised training, but also to make strong zero shot predictions. In this case, an example of a no-hit prediction might be a question like “What is the person doing?” This can be manually or automatically rewritten as a statement such as “The person is [MASK].” The model then performs a multiple-choice prediction on a set of provided options such as “cook popcorn” or “eat popcorn”.

RESERVE has been fine-tuned on several large-scale datasets used for cognitive-level visual understanding: VCR, TVQA, and Kinetics-600. RESERVE exhibited peak performance, outperforming previous work by 5%, 7% and 1.5% respectively. By incorporating audio, the model achieves an accuracy of 91.1% on Kinetics-600.

VCR (Visual Commonsense Reasoning) is a large-scale dataset without audio that is used for visual understanding at the cognitive level. TVQA is a large-scale video quality assurance dataset based on six popular TV shows (Friends, The Big Bang Theory, How I Met Your Mother, House MD, Grey’s Anatomy, and Chateau). Finally, Kinetics-600 is a collection of 650,000 video clips that cover hundreds of classes of human action.

According to the study document, which will be presented to IEEE/CVF International Conference on Computer Vision and Pattern Recognition in June, RESERVE shows significant performance improvements over competing models. For example, it requires one-fifth of the floating-point operations used by the VisualBERT multimodal model.

The project team predicts that the pre-trained video models could one day help visually impaired or deaf users or be used to extract information about video viewing trends. However, they also recognize that the datasets used to train RESERVE introduce unavoidable biases that need to be addressed.

Beyond spoken words, audio can provide a lot of additional contextual information. This shouldn’t surprise us, based on our own experiences, but it’s fascinating that AI performance can also be significantly improved. This may be because by synchronizing the additional information, new statistical correlations can be established.

“Audio is a lot of things. It’s not just voice, it’s also sound effects and hearing those sound effects improves your understanding of the world,” Zellers observed.

“Another thing is tone of voice, the dynamics of human communication. If you’re just looking at the words, without the audio context, you’re missing a lot. But if someone says that word with a specific emotion, then the model can do a lot better. And in fact, we find that it is.

MERLOT and RESERVE are part of the Mosaic team at AI2 which focuses on developing systems that can measure and develop machine savvy. Machine common sense has been an area of ​​interest in the field of artificial intelligence for decades. Being able to factor and anticipate the actual relationships between different objects and processes would make our AI tools much more useful for us.

However, it’s not enough to just load a bunch of facts and rules about how the world works into a system and expect it to work. The world is simply too complex to do that. We, on the other hand, learn by interacting with our environment through our various senses from the moment we are born. We gradually build an understanding of what is happening in the world and why. Some common sense machine projects use a similar approach. For MERLOT and RESERVE, the incorporation of additional modalities provides additional information just as our senses do.

“I think in the medium to long term, what really excites me is AI communicating with us in multiple modalities like audio and gesture so it can relate to what we do,” Zellers observed. The authors of the “MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound” project document are Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi and Yejin Choi . A RESERVE demo is available on AI2.

James G. Williams