Attribute-centric cross-modal alignment for weakly supervised text-based person re-ID
Publication Type
Journal Article
Publication Date
9-2025
Abstract
Weakly supervised text-based person re-identification (Text-ReID) confronts the challenge of matching target person images with textual descriptions, hindered by the absence of identity annotations during training. Traditional approaches, which rely solely on global features, overlook the rich, fine-grained information within both text and image modalities. Besides, merely aligning features at the semantic level is insufficient due to the significant differences in feature representation spaces between the two modalities. Existing methods also neglect the information inequality caused by person-irrelevant factors in images. In this paper, we introduce a novel framework called Attribute-Centric Cross-modal Alignment (ACCA), specifically designed to overcome these issues. Our approach concentrates on two main aspects: visual-text attribute alignment and prediction distribution alignment. To effectively capture fine-grained information without identity labels, we implement a visual-text attribute alignment method based on momentum contrastive learning to synchronize visual and textual attribute features within a unified embedding space. We also propose a unique strategy for negative sample filtering and enrichment, creating robust and comprehensive negative attribute sample spaces to support the attribute alignment. Additionally, we establish two methods of label-free prediction distribution alignment to encourage the learning of invariant feature representations across modalities. The first method, bias-reduction distribution alignment, aligns features and predictions within each text-image pair by utilizing semantic information from the text and reduces the impact of person-irrelevant factors in images. The second method, global-attribute distribution alignment, enhances the interaction between global and local prediction distributions across visual and textual modalities. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets validate our superior performances across all standard benchmarks.
Keywords
Pedestrians, Visualization, Semantics, Training, Hair, Contrastive learning, Artificial intelligence, Lighting, Legged locomotion, Identification of persons
Discipline
Graphics and Human Computer Interfaces
Research Areas
Intelligent Systems and Optimization
Publication
IEEE Transactions on Multimedia
First Page
1
Last Page
15
ISSN
1520-9210
Identifier
10.1109/TMM.2025.3608947
Publisher
Institute of Electrical and Electronics Engineers
Citation
XU, Jiajia; CAI, Weiwei; XU, Xuemiao; XIE, Yi; ZHANG, Huaidong; and HE, Shengfeng.
Attribute-centric cross-modal alignment for weakly supervised text-based person re-ID. (2025). IEEE Transactions on Multimedia. 1-15.
Available at: https://ink.library.smu.edu.sg/sis_research/10809
Additional URL
https://doi.org/10.1109/TMM.2025.3608947