A team of AI researchers proposes “GLIPv2”: a unified framework for learning vision-language representation (VL) that serves both localization tasks and VL comprehension tasks

Source: https://arxiv.org/pdf/2206.05836v1.pdf
This Article is written as a summary by Marktechpost Staff based on the article 'TaiChi: Open Source Library for Few-Shot NLP'. All Credit For This Research Goes To Researchers on This Project. Checkout the paper and github

Please Don't Forget To Join Our ML Subreddit

With advances in object identification and recognition, understanding the context of elements in an image has become increasingly crucial. For example, an umbrella and a human are recognized in an image, and it is useful to know if the person is carrying the umbrella. The search for solutions to these problems has increased public interest in the development of general purpose vision systems.

General-purpose vision systems, also known as vision base models, simultaneously tackle multiple vision tasks, such as image categorization, object identification, and visual language understanding (LV). The integration of localization tasks such as object identification, segmentation and VL understanding is particularly relevant (e.g. VQA and image captioning). A long-standing challenge is the integration of localization and understanding, which aims for mutual benefit, a streamlined pre-training method, and lower pre-training costs. However, these two types of tasks seem to be very different: localization tasks are visual only and require fine-grained output (e.g. bounding boxes or pixel masks), while LV comprehension tasks put emphasis on merging two modalities and require high-level semantics. outputs (for example, replies or captions).

Researchers prior to this study attempted to integrate these tasks into a basic multitasking approach, where a low-level visual encoder is shared between the tasks, and two different high-level branches are created for LV location and understanding. , respectively. Localization tasks are still visual only and do not take advantage of the rich semantics of visual language data.

Free 2 Minute AI NewsletterJoin over 500,000 AI people

In this study, “VL grounding” is identified as a location and comprehension skill. VL grounding involves understanding an input language and locating referenced items in the image (check Figure 1). As a unified model for LV localization and comprehension tasks, an anchored LV comprehension model (GLIPv2) is constructed.

Localization + VL understanding = embedded VL understanding. Localization challenges involve localization and semantic classification, where classification can be presented as a problem of understanding the VL using the correspondence classification method. Location data is converted to VL ground data as needed. The vast VL understanding data (image-text pairings) can be simply self-formed into VL grounding data. Accordingly, GLIPv2 includes a unified pre-training procedure in which all task data is converted to base data, and GLIPv2 is pre-trained to perform grounded VL understanding.

A more powerful VL grounding challenge is contrastive inter-image region-word learning. As a pre-training task, GLIP recommends the phrase grounding task, which we believe is a simple challenge that does not use data information correctly.

Source: https://arxiv.org/pdf/2206.05836v1.pdf

In Figure 1, for example, the phrase “grounding challenge” simply requires the model to match a given image region to one of three sentences in the text input, i.e. -say “green, pink striped or plain white umbrella?” This 1 of 3 option is relatively simple, only requiring understanding of colors, but it loses a lot of information in grounding data: umbrellas are not other colors, such as black, yellow , etc. ; objects in these regions are umbrellas but not other categories, such as automobile, bicycle.

GLIPv2 offers a reciprocal gain of VL localization and understanding.

1) Experimental results reveal that a single GLIPv2 model yields performance close to SoTA on various localization and comprehension tasks.

2) GLIPv2 demonstrates improved zero-shot and few-shot transfer learning capability for object recognition and open-world instance segmentation tasks on the LVIS dataset and the benchmark “Object Detection in the Wild (ODinW)”, thanks to the semantically rich annotations of image-text data.

3) GLIPv2 supports language-guided identification and segmentation, with new SoTA performance on Flick30K-entities Phrase Two and PhraseCut SEO image segmentation tasks.

4) Because GLIPv2 is inherently a grounding model, it produces VL comprehension models with excellent grounding capabilities that are self-explanatory and easy to debug. For example, GLIPv2 can answer questions while locating declared elements when refined on VQA

The code is accessible on GitHub and the demo is available on a Colab notebook.

James G. Williams