Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

2-2023

Abstract

Alignment between image and text has shown promising im provements on patch-level pre-trained document image mod els. However, investigating more effective or finer-grained alignment techniques during pre-training requires a large amount of computation cost and time. Thus, a question natu rally arises: Could we fine-tune the pre-trained models adap tive to downstream tasks with alignment objectives and achieve comparable or better performance? In this paper, we pro pose a new model architecture with alignment-enriched tuning (dubbed AETNet) upon pre-trained document image models, to adapt downstream tasks with the joint task-specific super vised and alignment-aware contrastive objective. Specifically, weintroduce an extra visual transformer as the alignment-ware image encoder and an extra text transformer as the alignment ware text encoder before multimodal fusion. We consider alignment in the following three aspects: 1) document-level alignment by leveraging the cross-modal and intra-modal con trastive loss; 2) global-local alignment for modeling localized and structural information in document images; and 3) local level alignment for more accurate patch-level information. Ex periments on various downstream tasks show that AETNet can achieve state-of-the-art performance on various downstream tasks. Notably, AETNet consistently outperforms state-of-the art pre-trained models, such as LayoutLMv3 with fine-tuning techniques, on three different downstream tasks. Code is available at https://github.com/MAEHCM/AET.

Discipline

Artificial Intelligence and Robotics | Databases and Information Systems

Publication

AAAI '23: 37th AAAI Conference on Artificial Intelligence, Washington, DC, February 7-14

Volume

First Page

2590

Last Page

2598

ISBN

9781577358800

Publisher

AAAI Press

City or Country

Washington

Citation

WANG, Lei; HE, Jiabang; XU, Xing; LIU, Ning; and LIU, Hui. Alignment-enriched tuning for patch-level pre-trained document image models. (2023). AAAI '23: 37th AAAI Conference on Artificial Intelligence, Washington, DC, February 7-14. 37, 2590-2598.
Available at: https://ink.library.smu.edu.sg/sis_research/9318

Copyright Owner and License

Authors

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Download

Included in

Artificial Intelligence and Robotics Commons, Databases and Information Systems Commons

COinS

Research Collection School Of Computing and Information Systems

Alignment-enriched tuning for patch-level pre-trained document image models

Publication Type

Version

Publication Date

Abstract

Discipline

Publication

Volume

First Page

Last Page

ISBN

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Alignment-enriched tuning for patch-level pre-trained document image models

Author

Publication Type

Version

Publication Date

Abstract

Discipline

Publication

Volume

First Page

Last Page

ISBN

Publisher

City or Country

Citation

Copyright Owner and License

Creative Commons License

Included in

Share

Search

Links

Browse

Links