Publication Type

PhD Dissertation

Version

publishedVersion

Publication Date

5-2025

Abstract

The rapid advancement and proliferation of pre-trained vision-language models (VLMs) have heralded a new era in the realm of artificial intelligence (AI), opening up unprecedented opportunities and challenges alike. This dissertation sets forth on an ambitious and comprehensive journey to critically evaluate the performance and limitations of pre-trained VLMs, particularly in complex and challenging contexts that test the bounds of their capabilities. Our focus is twofold: to rigorously assess the extent of bias embedded in these models, and to meticulously scrutinize their reasoning abilities, highlighting parallels and disparities between machine and human cognition.

We initiate our exploration with a targeted investigation into the stereotypical biases ingrained within VLMs, uncovering the subtle yet pervasive ways in which these biases manifest. To facilitate this analysis, we introduce VLStereoSet, a meticulously curated probing dataset specifically designed to probe and quantify the susceptibility of models to stereotypical bias. Our suite of probing tasks and tailored metrics provide a multi-faceted framework to measure stereotypical bias across various dimensions, encompassing both overall tendencies and specific intra/intermodal biases. Through empirical experiments conducted on a selection of six representative VLMs, we uncover and document the prevalence of stereotypical biases, ranging from gender biases to a spectrum of social stereotypes, revealing their widespread and concerning presence in state-of-the-art models.

Building on this foundation, the second strand of our research pivots to examining the reasoning capacities of VLMs, particularly their ability to navigate and make sense of scenarios that deviate from common sense and conventional knowledge. In a world where human cognition seamlessly interprets and rationalizes counter-intuitive scenarios, we introduce ROME (reasoning beyond commonsense knowledge), a novel and challenging dataset crafted to push the boundaries of VLMs and test their ability to reason in unconventional and unexpected situations. The insights gleaned from our experimental results shine a light on the existing chasm between human-like reasoning and the current capabilities of AI models, underscoring the critical need and urgency to enhance the reasoning faculties of AI systems, enabling them to think beyond the confines of common sense and embrace a more nuanced and sophisticated understanding of the world.

Extending this investigation, we further examine the challenge of underspecification in VLMs, where ambiguous statements necessitate contextual disambiguation through visual grounding. To this end, we introduce FOCUS (Fully Observed Context with Underspecified Sentences), a dataset specifically designed to probe and analyze underspecification. Our expanded dataset, comprising 2000 image-text pairs, enables a robust evaluation, incorporating dual-image representations per instance to assess model biases toward specific interpretations. Empirical results across a range of VLMs, including proprietary models such as GPT-4o-mini, reveal systematic failures in resolving underspecified inputs, highlighting a critical gap in model interpretability. These findings underscore the necessity of developing AI systems that align more closely with human interpretative processes, ensuring more reliable and context-aware reasoning in real-world applications.

In summary, this dissertation endeavors to provide a comprehensive and nuanced evaluation of pre-trained VLMs, critically examining their susceptibility to bias and their capacity for advanced reasoning. Through rigorous experimentation and analysis, we aim to unravel the complexities of these models, identify their shortcomings, and pave the way for future advancements that bring us closer to realizing the full potential of AI.

Degree Awarded

PhD in Computer Science

Discipline

Programming Languages and Compilers

Supervisor(s)

JIANG, Jing

First Page

1

Last Page

100

Publisher

Singapore Management University

City or Country

Singapore

Copyright Owner and License

Author

Share

COinS