Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

10-2025

Abstract

In an NBA game scenario, consider the challenge of locating and analyzing the 3D poses of players performing a user-specified action, such as attempting a shot. Traditional 3D human pose estimation (3DHPE) methods often fall short in such complex, multi-person scenes due to their lack of semantic integration and reliance on isolated pose data. To address these limitations, we introduce Language-Driven 3D Human Pose Estimation (L3DHPE), a novel approach that extends 3DHPE to general multi-person contexts by incorporating detailed language descriptions. We present Panoptic-L3D, the first dataset designed for L3DHPE, featuring 3,838 linguistic annotations for 1,476 individuals across 588 videos, with 6,035 masks and 91k frame-level 3D skeleton annotations. Additionally, we propose Cascaded Pose Perception (CPP), a benchmarking method that simultaneously performs language-driven mask segmentation and 3D pose estimation within a unified model. CPP first learns 2D pose information, utilizes a body fusion module to aid in mask segmentation, and employs a mask fusion module to mitigate mask noise before outputting 3D poses. Extensive evaluation of CPP and existing benchmarks on Panoptic-L3D demonstrates the necessity of this novel task and dataset for advancing 3DHPE. Our dataset is available at https://languagedriven3dposeestimation.github.io/.

Keywords

3D human pose estimation, text-motion interaction

Discipline

Artificial Intelligence and Robotics | Graphics and Human Computer Interfaces

Areas of Excellence

Digital transformation

Publication

MM '25: Proceedings of the 33rd ACM International Conference on Multimedia, Dublin, Ireland, October 27-31

First Page

12761

Last Page

12768

Identifier

10.1145/3746027.3758216

Publisher

ACM

City or Country

New York

Additional URL

https://doi.org/10.1145/3746027.3758216

Share

COinS