SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Zhecan Wang; Haoxuan You; Liunian Harold Li; Alireza Zareian; Suji Park; Yiqing Liang; Kai-Wei Chang; Shih-Fu Chang

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Zhecan Wang, Haoxuan You, Liunian Harold Li, Alireza Zareian, Suji Park, Yiqing Liang, Kai-Wei Chang, Shih-Fu Chang

[AAAI-22] Main Track

Keywords
Poster Session 6 @ Red 4, Poster Session 7 @ Red 4, Poster Session 6, Poster Session 7

Download Paper

Enter the Virtual Venue

Abstract: Answering complex questions about images is an ambitious goal for machine intelligence, which requires a joint understanding of images, text, and commonsense knowledge, as well as a strong reasoning ability. Recently, multimodal Transformers have made a great progress in the task of Visual Commonsense Reasoning (VCR), by jointly understanding visual objects and text tokens through layers of cross-modality attention. However, these approaches do not utilize the rich structure of the scene and the interactions between objects which are essential in answering complex commonsense questions. We propose a

\textbf{S}cene \textbf{G}raph \textbf{E}nhanced \textbf{I}mage-\textbf{T}ext \textbf{L}earning ({\bf SGEITL}) framework to incorporate visual scene graph in commonsense reasoning. In order to exploit the scene graph structure, at the model structure level, we propose a multihop graph transformer for regularizing attention interaction among hops. As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in visual scene graph. Moreover, we introduce a method to train and generate domain relevant visual scene graph using textual annotations in a weakly-supervised manner. Extensive experiments on VCR and other tasks show significant performance boost compared with the state-of-the-art methods, and prove the efficacy of each proposed component.

Introduction Video

Sessions where this paper appears

Timezone

Poster Session 6

Red 4

{ "name":"SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning (Poster Session 6)", "description":"", "startDate":"02-26-2022", "endDate":"02-26-2022", "startTime": "00:45", "endTime": "02:30", "location": "Red 4", "timeZone": "US/Pacific", "options":[ "Apple", "Google", "iCal", "Microsoft365", "Outlook.com", "Yahoo" ] }

Poster Session 6
Poster Session 7

Red 4

{ "name":"SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning (Poster Session 7)", "description":"", "startDate":"02-26-2022", "endDate":"02-26-2022", "startTime": "08:45", "endTime": "10:30", "location": "Red 4", "timeZone": "US/Pacific", "options":[ "Apple", "Google", "iCal", "Microsoft365", "Outlook.com", "Yahoo" ] }

Poster Session 7