Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
10-2025
Abstract
In an NBA game scenario, consider the challenge of locating and analyzing the 3D poses of players performing a user-specified action, such as attempting a shot. Traditional 3D human pose estimation (3DHPE) methods often fall short in such complex, multi-person scenes due to their lack of semantic integration and reliance on isolated pose data. To address these limitations, we introduce Language-Driven 3D Human Pose Estimation (L3DHPE), a novel approach that extends 3DHPE to general multi-person contexts by incorporating detailed language descriptions. We present Panoptic-L3D, the first dataset designed for L3DHPE, featuring 3,838 linguistic annotations for 1,476 individuals across 588 videos, with 6,035 masks and 91k frame-level 3D skeleton annotations. Additionally, we propose Cascaded Pose Perception (CPP), a benchmarking method that simultaneously performs language-driven mask segmentation and 3D pose estimation within a unified model. CPP first learns 2D pose information, utilizes a body fusion module to aid in mask segmentation, and employs a mask fusion module to mitigate mask noise before outputting 3D poses. Extensive evaluation of CPP and existing benchmarks on Panoptic-L3D demonstrates the necessity of this novel task and dataset for advancing 3DHPE. Our dataset is available at https://languagedriven3dposeestimation.github.io/.
Keywords
3D human pose estimation, text-motion interaction
Discipline
Artificial Intelligence and Robotics | Graphics and Human Computer Interfaces
Areas of Excellence
Digital transformation
Publication
MM '25: Proceedings of the 33rd ACM International Conference on Multimedia, Dublin, Ireland, October 27-31
First Page
12761
Last Page
12768
Identifier
10.1145/3746027.3758216
Publisher
ACM
City or Country
New York
Citation
SHEN, Tingrui; LIU, Bangzhen; FAN, Zhirun; ZHANG, Shiting; PAN, Weifeng; FAN, Sun; CAO, Dan; and HE, Shengfeng.
Language-driven 3D human pose estimation in multi-person scenarios: A new dataset and approach. (2025). MM '25: Proceedings of the 33rd ACM International Conference on Multimedia, Dublin, Ireland, October 27-31. 12761-12768.
Available at: https://ink.library.smu.edu.sg/sis_research/10793
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1145/3746027.3758216
Included in
Artificial Intelligence and Robotics Commons, Graphics and Human Computer Interfaces Commons