Bridging the modality gap: Advancing multimodal human pose estimation with modality-adaptive pose estimator and novel benchmark datasets
Publication Type
Conference Proceeding Article
Publication Date
4-2025
Abstract
Visual and infrared images represent two indispensable modalities that complement each other, offering unique insights into human pose estimation under different lighting conditions. However, existing efforts have predominantly focused on single modality, leading to significant challenges when transitioning to multimodal environments. The performance degradation observed in state-of-the-art models on multimodal images can be attributed to the substantial modality gap and the absence of multimodal benchmarks. To address this critical gap, we introduce novel visible-infrared multimodal human pose datasets where the two modality images are well balanced and accurately labeled. Leveraging these datasets, we establish the comprehensive benchmark to facilitate rigorous analysis and enhancement of multimodal human pose estimation techniques. Our findings underscore the limitations posed by modality variance on state-of-the-art methods. To overcome this challenge, we propose a method-agnostic scheme called Modality-Adaptive Pose Estimation, designed to seamlessly integrate into existing approaches. By employing Modality-Specific Batch Normalization and Modality Adaptive Loss, our approach enhances feature interactions between the two modalities, yielding superior performance. Extensive experiments conducted with popular baseline methods demonstrate the efficacy of our proposed approach in achieving state-of-the-art results on both modalities. We believe that our benchmarks offer a robust platform for investigating robustness and will significantly contribute to advancing research in this field.
Keywords
Human pose estimation, Multimodal, Visible, Infrared, Benckmark
Discipline
Artificial Intelligence and Robotics | Databases and Information Systems
Research Areas
Intelligent Systems and Optimization
Publication
Proceedings of the 13th International Conference, CVM 2025, Hong Kong, China, April 19-21
First Page
125
Last Page
153
Identifier
10.1007/978-981-96-5815-2_8
Publisher
Springer
City or Country
Cham
Citation
XIA, Jiangnan; ZHANG, Zhiyuan; GUO, Yanyin; WU, Qilong; LI, Yi; CHENG, Jianghan; and LI Junwei.
Bridging the modality gap: Advancing multimodal human pose estimation with modality-adaptive pose estimator and novel benchmark datasets. (2025). Proceedings of the 13th International Conference, CVM 2025, Hong Kong, China, April 19-21. 125-153.
Available at: https://ink.library.smu.edu.sg/sis_research/10173
Additional URL
https://doi.org/10.1007/978-981-96-5815-2_8