Meta AI announces first self-supervised algorithm that works for speech, vision, and text

Meta AI has announced ‘data2vec’, the first high-performance self-supervised algorithm that works for speech, vision, and text.

Meta AI said in a blog post that it had applied data2vec separately to speech, images and text, and found it outperformed the previous best single-purpose algorithms for computer vision and speech. “It also represents a new paradigm of holistic self-supervised learning, where new research improves multiple modalities rather than just one,” it said in the post.

It also does not rely on contrastive learning or reconstructing the input example. In addition to helping accelerate progress in AI, data2vec brings it closer to building machines that learn seamlessly about different aspects of the world around them. “It will enable us to develop more adaptable AI, which we believe will be able to perform tasks beyond what today’s systems can do,” said Meta AI.

As part of this announcement, Meta has shared code and pretrained models on data2vec so that others in the research community can build upon it too.

How data2vec Works

Meta AI has spelled out how data2vec works. Much of AI is still based on supervised learning, which works exclusively with labeled data. But it’s simply not possible to collect labeled data for all the things we would like machines to do. For example, while researchers have done a lot of work in creating large-scale labeled data sets for English speech and text, it is not feasible to do this for the literally thousands of languages spoken on the planet.

Self-supervision enables computers to learn about the world just by observing it and then figuring out the structure of images, speech, or text. Having machines that don’t need to be explicitly taught to classify images or understand spoken language is simply much more scalable.

Research in self-supervised learning today is almost always focused on one particular modality. So, researchers working on one modality often take a very different approach from those working on another. For text, researchers train models to fill in blanks in sentences. Speech models, however, need to learn an inventory of the basic sounds of speech in order to predict missing sounds. In computer vision, models are often trained to assign similar representations to a color image of a cow and the same image flipped upside down, so it associates the two much more closely than it would with an unrelated image, such as that of a duck.

Algorithms also predict different units for each modality: pixels or visual tokens for images, words for text, and learned inventories of sounds for speech. A collection of pixels is very different from an audio waveform or a passage of text, and because of this, algorithm design has been tied to a specific modality. This means that algorithms are still functioning differently in each modality.

But data2vec simplifies this by training models to predict their own representations of the input data, regardless of the modality. By focusing on these representations — the layers of a neural network — instead of predicting visual tokens, words, or sounds, a single algorithm can work with completely different types of input. This removes the dependence on modality-specific targets in the learning task, claimed Meta AI.

Our method uses a teacher network to first compute target representations from an image, a piece of text, or a speech utterance. Next, we mask part of the input and repeat the process with a student network, which then predicts the latent representations of the teacher. The student model has to predict representations of the full input data even though it has a view of only some of the information. The teacher network is identical to the student model but with weights that are slightly out of date.
– Meta AI

How data2vec Works

Leave a Reply Cancel reply