[email protected] AI Researchers Release “GRIT”: A General Robust Image Task Benchmark Test to Evaluate Computer Vision Model Performance

Most computer vision (CV) models are trained and evaluated on a small number of concepts and with a strong assumption that the images and annotations in the training and test sets are similarly distributed. Therefore, despite significant advances in CV, the flexibility and generality of vision systems still fall short of the abilities of humans to learn from diverse sources and generalize to new data sources and tasks.

The lack of a consistent technique and benchmark for measuring performance across distribution changes is one of the barriers to developing more flexible and general-purpose computer vision systems.

A new study by [email protected] The Institute for AI and the University of Illinois team present the General Robust Image Task Benchmark (GRIT), a unified model for assessing overall CV model proficiency in performance and robustness across a variety of image, concept and data source prediction tasks. Object classification, object location, referencing expressions, visual question answering, semantic segmentation, human keypoint estimation, and surface normal estimation are among the seven GRIT tasks. These tasks test a variety of visual skills, including the ability to make predictions for insights from other data sources or tasks and robustness to image disturbances and calibration measurements.

Most existing benchmarks are only evaluated in iid situations. Instead, the model is tested using data sources and insights that are not allowed to be seen during training in the GRIT restricted track. To perform tasks involving new concepts, the model must be able to transfer skills between concepts.

Source: https://arxiv.org/pdf/2204.13653.pdf

GRIT assesses image stability against image disturbances. The performance of each task is evaluated on a collection of examples with and without 20 distinct types of image distortions of varying strengths, such as JPEG compression, motion blur, Gaussian noise, and universal adversarial disturbance.

GRIT encourages the construction of large-scale baseline models for the vision while allowing fair comparison of models with minimal computational resources. GRIT’s two tracks, Restricted and Unrestricted, help achieve this goal. The Unrestricted track makes it easy to create large models with few limits on the training data that can be used. Given a rich but limited set of training data sources, the restricted track allows researchers to focus on skill concept transfer and effective learning. The restricted track levels the playing field for researchers in terms of computational resource requirements and access to the same data sources by limiting training data to publicly available sources.

The following design principles guide all design decisions in the development of GRIT:

  1. Delete Tasks: Researchers choose vision and visual language tasks with a clear task definition and unambiguous ground truth. They don’t include captioning, for example, because there are different ways to caption an image. Additionally, in cases that include VQA tasks with multiple possible answers, they follow the VQA standards assessment technique of having various answer possibilities and responding to text normalization to reduce ambiguity as much as possible.
  2. Generality and robustness tests: Each task includes evaluation samples from various data sources, as well as concepts not found in the task’s training data and image disturbances. This metric assesses the ability to convey information between different data sources and concepts and the resistance to visual distortion.
  3. Conceptual diversity and balance: Task samples are chosen to represent a wide variety of equally distributed concepts. The team further grouped the objects (noun concepts) into 24 concept categories (e.g. animals, food, tools).
  4. Evaluation by sample: To summarize performance, all metrics are calculated at the sample level to average them over different subsets of data.
  5. Knowledge assessment and benchmarking: For each prediction, models must predict a confidence score, which is used to analyze model knowledge, degree of misinformation, and belief calibration.
  6. Using existing datasets: To ensure that annotations and tasks are validated, the team can, where possible, search for tasks from existing and well-established datasets and choose annotations from selected sources that are hidden or maybe unused.
  7. Level playing field: A restricted track with a fixed set of publicly available training data sources is offered. It provides a fair comparison of submissions and enables researchers with limited data and computational resources to participate and contribute to unique, robust and effective learning methods.
  8. Promote unified models: All contributions must provide the total number of parameters used in the models. It also serves as a simple, though imprecise, indicator of computational and sampling efficiency. While participants can use completely separate models for each task, they should choose models that share parameters between tasks and have fewer parameters.

The researchers believe that combining these advances into more robust general-purpose systems that do not require architectural changes can withstand the distribution change issues that plague vision and vision language models in the open world.

This Article is written as a summay by Marktechpost Staff based on the Research Paper 'Meta-Learning Sparse Compression Networks'. All Credit For This Research Goes To The Researchers of This Project. Check out the paper, github, project and reference article.

Please Don't Forget To Join Our ML Subreddit

James G. Williams