Despite progress in human motion capture, existing multi-view methods often face challenges in estimating the 3D pose and shape of multiple closely interacting people. This difficulty arises from reliance on accurate 2D joint estimations, which are hard to obtain due to occlusions and body contact when people are in close interaction. To address this, we propose a novel method leveraging the personalized implicit neural avatar of each individual as a prior, which significantly improves the robustness and precision of this challenging pose estimation task. Concretely, the avatars are efficiently reconstructed via layered volume rendering from sparse multi-view videos. The reconstructed avatar prior allows for the direct optimization of 3D poses based on color and silhouette rendering loss, bypassing the issues associated with noisy 2D detections. To handle interpenetration, we propose a collision loss on the overlapping shape regions of avatars to add penetration constraints. Moreover, both 3D poses and avatars are optimized in an alternating manner. Our experimental results demonstrate state-of-the-art performance on several public datasets.
Given a dynamic scene captured by a sparse set of RGB cameras, our goal is to estimate the 3D pose and shape of multiple people even if they interact closely. To address this challenging task, our key idea is to first reconstruct the personalized avatar of each individual in the scene and leverage them as a strong prior to refine the appearance and pose alternately.
We first introduce an efficient pipeline to create avatars of multiple people in a scene. Specifically, we leverage an accelerated neural radiance field to represent the shape and appearance of each individual in canonical space and deform it at an interactive rate. We then adapt layered volume rendering to our pipeline, which composites the rendering of avatars into one image, thus enabling direct learning from multi-view video inputs.
Thanks to the learned avatar prior for each individual, we can enhance 3D pose optimization via a combination of RGB and silhouette rendering loss. While previous work heavily relies on noisy 2D joint detection, we show that employing such pixel-wise color and silhouette information can largely increase precision and robustness. Moreover, a collision loss is introduced to avoid interpenetration. Finally, we alternate between avatar learning and pose optimization to get complete and accurate 3D human poses.
@inproceedings{lu2024avatarpose,
title={AvatarPose: Avatar-guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-view Videos},
author={Feichi Lu, Zijian Dong, Jie Song, Otmar Hilliges},
booktitle={ECCV},
year={2024}
}