Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

9-2024

Abstract

Multimodal Large Language Models (MLLMs) demonstrate exceptional problem-solving capabilities, but few research studies aim to gauge the ability to generate visual instruction tuning data. This paper proposes to explore the potential of empowering MLLMs to generate data independently without relying on GPT-4. We introduce Genixer, a comprehensive data generation pipeline consisting of four key steps: (i) instruction data collection, (ii) instruction template design, (iii) empowering MLLMs, and (iv) data generation and filtering. Additionally, we outline two modes of data generation: task-agnostic and task-specific, enabling controllable output. We demonstrate that a synthetic VQA-like dataset trained with LLaVA1.5 enhances performance on 10 out of 12 multimodal benchmarks. Additionally, the grounding MLLM Shikra, when trained with a REC-like synthetic dataset, shows improvements on 7 out of 8 REC datasets. Through experiments and synthetic data analysis, our findings are: (1) current MLLMs can serve as robust data generators without assistance from GPT-4V; (2) MLLMs trained with task-specific datasets can surpass GPT-4V in generating complex instruction tuning data; (3) synthetic datasets enhance performance across various multimodal benchmarks and help mitigate model hallucinations.

Keywords

Large Language Models, LLMs, Data generation pipeline, Data generators, MLLMs, Multimodal Large Language Models

Discipline

Artificial Intelligence and Robotics | Computer Sciences

Research Areas

Data Science and Engineering; Intelligent Systems and Optimization

Publication

18th European Conference on Computer Vision (ECCV 2024) : Milan, Italy, September 29 - October 4

Identifier

10.48550/arXiv.2312.06731

Publisher

European Conference on Computer Vision

City or Country

Milan, Italy

Citation

ZHAO, Henry Hengyuan; ZHOU, Pan; and SHOU, Mike Zheng. Genixer : Empowering multimodal Large Language Models as a powerful data generator. (2024). 18th European Conference on Computer Vision (ECCV 2024) : Milan, Italy, September 29 - October 4.
Available at: https://ink.library.smu.edu.sg/sis_research/9600

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Download

Included in

Artificial Intelligence and Robotics Commons

COinS

Research Collection School Of Computing and Information Systems

Genixer : Empowering multimodal Large Language Models as a powerful data generator

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Identifier

Publisher

City or Country

Citation

Creative Commons License

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Genixer : Empowering multimodal Large Language Models as a powerful data generator

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Identifier

Publisher

City or Country

Citation

Creative Commons License

Included in

Share

Search

Links

Browse

Links