Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

11-2012

Abstract

In real-life, it is easier to provide a visual cue when asking a question about a possibly unfamiliar topic, for example, asking the question, “Where was this crop circle found?”. Providing an image of the instance is far more convenient than texting a verbose description of the visual properties, especially when the name of the query instance is not known. Nevertheless, having to identify the visual instance before processing the question and eventually returning the answer makes multimodal question-answering technically challenging. This paper addresses the problem of visual-totext naming through the paradigm of answering-by-search in a two-stage computational framework, which is composed out of instance search (IS) and similar question ranking (QR). In IS, names of the instances are inferred from similar visual examples searched through a million-scale image dataset. For recalling instances of non-planar and non-rigid shapes, spatial configurations that emphasize topology consistency while allowing for local variations in matches have been incorporated. In QR, the candidate names of the instance are statistically identified from search results and directly utilized to retrieve similar questions from communitycontributed QA (cQA) archives. By parsing questions into syntactic trees, a fuzzy matching between the inquirer’s question and cQA questions is performed to locate answers and recommend related questions to the inquirer. The proposed framework is evaluated on a wide range of visual instances (e.g., fashion, art, food, pet, logo, and landmark) over various QA categories (e.g., factoid, definition, how-to, and opinion).

Keywords

multimedia question answering, similar question search, visual instance search

Discipline

Graphics and Human Computer Interfaces | Theory and Algorithms

Research Areas

Intelligent Systems and Optimization

Publication

Proceedings of the 20th ACM international conference on Multimedia, MM 2012, Nara, Japan, October 29 - November 2

First Page

609

Last Page

618

ISBN

9781450310895

Identifier

10.1145/2393347.2393432

Publisher

ACM

City or Country

Nara, Japan

Citation

ZHANG, Wei; PANG, Lei; and NGO, Chong-wah. Snap-and-ask: Answering multimodal question by naming visual instance. (2012). Proceedings of the 20th ACM international conference on Multimedia, MM 2012, Nara, Japan, October 29 - November 2. 609-618.
Available at: https://ink.library.smu.edu.sg/sis_research/6441

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Download

Included in

Graphics and Human Computer Interfaces Commons, Theory and Algorithms Commons

COinS

Research Collection School Of Computing and Information Systems

Snap-and-ask: Answering multimodal question by naming visual instance

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Snap-and-ask: Answering multimodal question by naming visual instance

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Included in

Share

Search

Links

Browse

Links