Attribute-centric cross-modal alignment for weakly supervised text-based person re-ID

Publication Type

Journal Article

Publication Date

9-2025

Abstract

Weakly supervised text-based person re-identification (Text-ReID) confronts the challenge of matching target person images with textual descriptions, hindered by the absence of identity annotations during training. Traditional approaches, which rely solely on global features, overlook the rich, fine-grained information within both text and image modalities. Besides, merely aligning features at the semantic level is insufficient due to the significant differences in feature representation spaces between the two modalities. Existing methods also neglect the information inequality caused by person-irrelevant factors in images. In this paper, we introduce a novel framework called Attribute-Centric Cross-modal Alignment (ACCA), specifically designed to overcome these issues. Our approach concentrates on two main aspects: visual-text attribute alignment and prediction distribution alignment. To effectively capture fine-grained information without identity labels, we implement a visual-text attribute alignment method based on momentum contrastive learning to synchronize visual and textual attribute features within a unified embedding space. We also propose a unique strategy for negative sample filtering and enrichment, creating robust and comprehensive negative attribute sample spaces to support the attribute alignment. Additionally, we establish two methods of label-free prediction distribution alignment to encourage the learning of invariant feature representations across modalities. The first method, bias-reduction distribution alignment, aligns features and predictions within each text-image pair by utilizing semantic information from the text and reduces the impact of person-irrelevant factors in images. The second method, global-attribute distribution alignment, enhances the interaction between global and local prediction distributions across visual and textual modalities. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets validate our superior performances across all standard benchmarks.

Keywords

Pedestrians, Visualization, Semantics, Training, Hair, Contrastive learning, Artificial intelligence, Lighting, Legged locomotion, Identification of persons

Discipline

Graphics and Human Computer Interfaces

Research Areas

Intelligent Systems and Optimization

Publication

IEEE Transactions on Multimedia

First Page

1

Last Page

15

ISSN

1520-9210

Identifier

10.1109/TMM.2025.3608947

Publisher

Institute of Electrical and Electronics Engineers

Additional URL

https://doi.org/10.1109/TMM.2025.3608947

This document is currently not available here.

Share

COinS