Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
2-2023
Abstract
Alignment between image and text has shown promising im provements on patch-level pre-trained document image mod els. However, investigating more effective or finer-grained alignment techniques during pre-training requires a large amount of computation cost and time. Thus, a question natu rally arises: Could we fine-tune the pre-trained models adap tive to downstream tasks with alignment objectives and achieve comparable or better performance? In this paper, we pro pose a new model architecture with alignment-enriched tuning (dubbed AETNet) upon pre-trained document image models, to adapt downstream tasks with the joint task-specific super vised and alignment-aware contrastive objective. Specifically, weintroduce an extra visual transformer as the alignment-ware image encoder and an extra text transformer as the alignment ware text encoder before multimodal fusion. We consider alignment in the following three aspects: 1) document-level alignment by leveraging the cross-modal and intra-modal con trastive loss; 2) global-local alignment for modeling localized and structural information in document images; and 3) local level alignment for more accurate patch-level information. Ex periments on various downstream tasks show that AETNet can achieve state-of-the-art performance on various downstream tasks. Notably, AETNet consistently outperforms state-of-the art pre-trained models, such as LayoutLMv3 with fine-tuning techniques, on three different downstream tasks. Code is available at https://github.com/MAEHCM/AET.
Discipline
Artificial Intelligence and Robotics | Databases and Information Systems
Publication
AAAI '23: 37th AAAI Conference on Artificial Intelligence, Washington, DC, February 7-14
Volume
37
First Page
2590
Last Page
2598
ISBN
9781577358800
Publisher
AAAI Press
City or Country
Washington
Citation
WANG, Lei; HE, Jiabang; XU, Xing; LIU, Ning; and LIU, Hui.
Alignment-enriched tuning for patch-level pre-trained document image models. (2023). AAAI '23: 37th AAAI Conference on Artificial Intelligence, Washington, DC, February 7-14. 37, 2590-2598.
Available at: https://ink.library.smu.edu.sg/sis_research/9318
Copyright Owner and License
Authors
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.