Outside Knowldege Visual Question Answering Version 2
Abstract

Visual question answering (VQA) lies at the intersection of language and vision research. It functions as a building block for multimodal conversational AI and serves as a testbed for assessing a model's capability for open-domain scene under- standing. While progress in this area was initially accelerated with the 2015 release of the popular and large dataset "VQA", new datasets are required to continue this research momentum. For example, the 2019 Outside Knowledge VQA dataset "OKVQA" extends VQA by adding more challenging questions that require complex, factual, and commonsense knowledge. However, in our analysis, we found that 41.4% of the dataset needed to be corrected and 10.6% needed to be removed. This paper describes the analysis, corrections, and removals completed and presents a new dataset: OK-VQA Version 2.0. To gain insights into the impact of the changes on OK-VQA research, the paper presents results on state-of-the-art models retrained with this new dataset. The side-by-side comparisons show that one method in particular, Knowledge Augmented Transformer for Vision-and-Language, extends its relative lead over competing methods.

Download & Citation
Authors
Benjamin Z. Reichman
Anirudh Sundar
Christopher Richardson
Tamara Zubatiy
Prithwijit Chowdhury
Aaryan Shah
Jack Truxal
Micah Grimes
Dristi Shah
Woo Ju Chee
Saif Punjwani
Atishay Jain
Larry Heck
Examples