Google’s Deepmind has published a paper proposing a family of machine learning models with the aim of doing more work with far less costly and time-consuming training.
The upside of that is, the tech giant claims, massive cost savings as training is quickly becoming prohibitively expensive. The downside is that it’s no small task to combine visual learning with a language model.
The model family, called Flamingo, is a few-shot visual language model (VLM) set of distinct software systems (versus a more monolithic model like GPT-3, for instance). Google’s Deepmind team says it outperforms all previous few-shot learning approaches, even those fine-tuned with orders of magnitude more data.
Flamingo is described in a preprint of Deepmind’s academic paper on the subject as being designed [PDF] to take combined text and image inputs to arrive at a text-only answer, leaving a fair bit of wiggle room for the models to do some interpretation. Deepmind uses an in-house dataset it created especially for multimodal ML research. All data is unlabeled and was retrieved from the public internet to the tune of 43.3 million instances consisting of 185 million images and 182GB of text.
To put it simply, here’s a good example of what Flamingo makes possible: during training, it was only given a few examples to achieve an inference task (identify an animal, solve a math problem, count types of animals in an image, etc). After being told what sort of inference its users wanted, it was given another image and asked to return explanatory text of the input.
Deepmind based Flamingo off of its own recently released 70-billion parameter Chinchilla language model, which was pre-trained. Deepmind “fused” the Chinchilla LM with visual learning elements “by adding novel architecture components in between” that keeps training data isolated and frozen, giving them the 80-billion parameter Flamingo FLM.
“A single Flamingo model can achieve state-of-the-art results on a wide array of tasks, performing competitively with approaches requiring task-specific fine-tuning on orders of magnitude more examples, and often requiring hand-engineered ‘tricks,'” Deepmind’s Flamingo contributors said.
The potential uses of this machine learning model are readily apparent, and aren’t restricted to what Flamingo is able to do with data – the model could also help the general state of machine learning, which is facing a problem of growing energy and computing needs to train newer models. According to one estimate, a single Google BERT training session emitted the same amount of carbon as a trans-American jet flight.
Deepmind didn’t make any mention of the energy costs needed to train a Flamingo model, though it does describe it as “computationally expensive to train.”
On the other hand, the paper said that Flamingo can be rapidly adapted to low-resource settings and for low-resource tasks, such as evaluating data for PII, social biases, stereotypes and other elements that can lead toward the oft-encountered issue of AI bias.
Despite that, Flamingo might not be anywhere near ready for prime time, and not because the model itself is bad: Deepmind admits limitations in few-shot training, namely that there are too many variables to account for when a training dataset is so small.
“There is no ‘golden’ few-shot method that would work well in all scenarios,” said the researchers behind Flamingo. ®