Allen Institute for AI researchers propose GPV-2: a web-supervised concept extension for general-purpose vision models


While much of the work in computer vision has focused on the development of task-specific models, there has recently been a push to develop vision systems that are more generic in nature. GPVs (General Purpose Vision) or General Purpose Vision Systems, unlike specialized models, attempt to facilitate learning a wide range of tasks natively, generalizing learned skills and concepts to new combinations of skills and concepts and to learn new tasks quickly.

GPVs are currently trained and evaluated on heavily supervised datasets, such as COCO and VISUALGENOME, and expose the models to a variety of skill-concept combinations. Learning COCO localization, for example, exposes models to 80 ideas in that skill. Expanding the conceptual vocabulary of the model to include new ideas in this paradigm requires the collection of fully supervised task data for each of these concepts.

Scaling up today’s manually annotated datasets to cover more than 10,000 concepts is impossible due to the high cost of developing high quality datasets. Learn skills like localization and VQA from today’s vision datasets; learn a large number of concepts using data from image search engines; and use architectures that can effectively transfer learned concepts through learned skills, according to recent research from the Allen Institute for AI and the University of Illinois.

Image search engines use text from related websites, visual attributes extracted from photos, and click data collected from millions of users searching and selecting relevant results every day to deliver surprisingly good results for million searches. They often provide high-quality, distraction-free, action-focused visuals that can be used to develop effective visual representations for subjects.

Significantly, searches can expand to thousands of questions quickly and inexpensively. The researchers spent just over $150 to collect a dataset of over a million photos, dubbed WEB10K, which covers around 10,000 words, 300 verbs and 150 adjectives, as well as thousands of noun-associations. verb and noun-adj.

Despite the fact that search engine data only provides meaningful oversight of the categorization task, research shows that current GPVs, GPV-1 and VL-T5, can learn concepts from web data. and improve other skills like captioning. They extend these models by offering GPV-2, a powerful general-purpose vision system that supports a wider range of modalities (and therefore tasks).

GPV-2 can take an image, a task description and a bounding box (which allows the user to point to a specific object or region of interest) as inputs and output text for any bounding box or the full picture. GPV-2 can support a wide range of capabilities, including classification and localization in vision, visual language skills such as VQA and captioning, and specialized skills such as contextual categorization and human-object interaction detection, thanks to its multiple entry and exit modes.

All tasks in GPV-2 are based on scoring, ordering and creation using the same text decoder applied to one or more image areas, ensuring that all tasks have the same weights and representations. The researchers also propose a simple recalibration approach to reduce the weight of tags that are overrepresented in training.


The team evaluates these GPVs using three criteria: (i) the COCO-SCE and COCO benchmarks, which are designed to test the transferability of skills concepts and the overall competence of skills on 80 main concepts of COCO in five skills; (ii) a new benchmark, DCE, which is based on the OPENIMAGES and VISUALGENOME datasets for broader concept assessment for the same five skills, but now on 492 OPENIMAGES concepts instead of the 80 in COCO; and the WEB10K dataset, composed of images.

The results suggest that online data benefits all three GPVs. Additionally, GPV-2 beats both GPV-1 and VL-T5 in all these tasks and shows significant gains when using online data, especially for captioning and categorization. In a 0 hit situation, GPV-2 also performs well on downstream tasks such as action and visual attribute detection. Finally, the researchers show how GPV-2 can be chained together to perform niche tasks such as human-object interaction detection without requiring task-specific design changes.

In summary, the main contributions are (a) WEB10K, a new web data source for learning over 10,000 visual concepts with a human-verified VQA benchmark; (b) demonstration of concept transfer from WEB10K to other tasks; (c) DCE, a repository covering five tasks and around 500 concepts for evaluating GPVs; and (d) GPV-2, an architecture that supports both input and output box and text modalities, enhances the concept of competence

GPV-2 facilitates concept transfer from online data to skills, but the results also show that additional work is needed, especially for tasks such as VQA or localization, which could be achieved through new architectures or new training protocols. The ability to handle other modalities (eg, video) and outputs (eg, segmentation) would allow the GPV-2 to serve even more tasks. Recent work in this area has shown promise and offers the possibility of translating the principles of online data to a wider range of jobs.


As the vision community develops more generic models, it becomes increasingly important to find effective ways to master a wide range of skills and concepts. The proposed work revisits the concept of web-supervised learning in the context of GPVs, demonstrating that learning capabilities from eight task-specific datasets and web-based concepts are a cost-effective and efficient way to expand the concepts.

The code to download the Web10K dataset is on Github.




James G. Williams