Research Collection School Of Computing and Information Systems

From holistic to localized: Local enhanced adapters for efficient visual instruction fine-tuning

Pengkun JIAO
Bin ZHU, Singapore Management UniversityFollow
Jingjing CHEN
Chong-wah NGO, Singapore Management UniversityFollow
Yugang JIANG

Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

10-2025

Abstract

Efficient Visual Instruction Fine-Tuning (EVIT) seeks to adapt Multimodal Large Language Models (MLLMs) to downstream tasks with minimal computational overhead. However, as task diversity and complexity increase, EVIT faces significant challenges in resolving data conflicts. To address this limitation, we propose the Dual Low-Rank Adaptation (Dual-LoRA), a holistic-to-local framework that enhances the adapter’s capacity to address data conflict through dual structural optimization. Specifically, we utilize two subspaces: a skill space for stable, holistic knowledge retention, and a rank-rectified task space that locally activates the holistic knowledge. Additionally, we introduce Visual Cue Enhancement (VCE), a multi-level local feature aggregation module designed to enrich the visionlanguage projection with local details. Our approach is both memory- and time-efficient, requiring only 1.16× the inference time of the standard LoRA method (with injection into the query and value projection layers), and just 73% of the inference time of a 4-expert LoRA-MoE. Extensive experiments on various downstream tasks and general MLLM benchmarks validate the effectiveness of our proposed methods. Our project page are publicly available at https://github.com/pengkun-jiao/DualLoRA

Discipline

Graphics and Human Computer Interfaces

Research Areas

Intelligent Systems and Optimization

Areas of Excellence

Digital transformation

Publication

Proceedings of the 2025 International Conference on Computer Vision, Honolulu, Hawaii, October 19-23

First Page

Last Page

City or Country

Honolulu, Hawai'i, USA

Citation

JIAO, Pengkun; ZHU, Bin; CHEN, Jingjing; NGO, Chong-wah; and JIANG, Yugang. From holistic to localized: Local enhanced adapters for efficient visual instruction fine-tuning. (2025). Proceedings of the 2025 International Conference on Computer Vision, Honolulu, Hawaii, October 19-23. 1-10.
Available at: https://ink.library.smu.edu.sg/sis_research/10473

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Download

Included in

Graphics and Human Computer Interfaces Commons

COinS

Research Collection School Of Computing and Information Systems

From holistic to localized: Local enhanced adapters for efficient visual instruction fine-tuning

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

City or Country

Citation

Creative Commons License

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

From holistic to localized: Local enhanced adapters for efficient visual instruction fine-tuning

Author

Publication Type

Version

Publication Date

Abstract

Discipline

Research Areas

Areas of Excellence

Publication

First Page

Last Page

City or Country

Citation

Creative Commons License

Included in

Share

Search

Links

Browse

Links