HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis

CVPR 2026

Mingjin Chen^{1 *} Junhao Chen^{3 *} Zhaoxin Fan^{2 †} Yujian Lee⁴ Zichen Dang¹
Lili Wang⁵ Yawen Cui¹ Lap-Pui Chau¹ Yi Wang^{1 †}

¹Dept. of EEE, The Hong Kong Polytechnic University ²Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, School of Artificial Intelligence, Beihang University
³Tsinghua University ⁴Beijing Normal-Hong Kong Baptist University ⁵State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University

^* Equal Contribution ^† Corresponding Author

arXiv Paper Video Demo Code (Coming Soon)

Abstract

Recent methods have made notable progress in the visual quality of hand-object interaction video synthesis. However, most approaches rely on 2D control signals that lack spatial expressiveness and limit the utilization of synthetic 3D conditional data. To address these limitations, we propose HVG-3D, a unified framework for 3D-aware hand-object interaction (HOI) video synthesis conditioned on explicit 3D representations. Specifically, we develop a diffusion-based architecture augmented with a 3D ControlNet, which encodes geometric and motion cues from 3D inputs to enable explicit 3D reasoning during video synthesis. To achieve high-quality synthesis, HVG-3D is designed with two core components: (i) a 3D-aware HOI video generation diffusion architecture that encodes geometric and motion cues from 3D inputs for explicit 3D reasoning; and (ii) a hybrid pipeline for constructing input and condition signals, enabling flexible and precise control during both training and inference. During inference, given a single real image and a 3D control signal from either simulation or real data, HVG-3D generates high-fidelity, temporally consistent videos with precise spatial and temporal control. Experiments on the TASTE-Rob dataset demonstrate that HVG-3D achieves state-of-the-art spatial fidelity, temporal coherence, and controllability, while enabling effective utilization of both real and simulated data.

Method Overview

HVG-3D features a 3D-aware diffusion architecture and a hybrid pipeline for constructing input and condition signals from both real and simulated data.

Figure. Overview of the HVG-3D framework. The model takes a single RGB image and a 3D condition signal (point cloud or tracking sequence) as input. A dedicated 3D ControlNet encodes geometric and motion cues, which are injected into the video diffusion backbone for high-fidelity, temporally consistent hand-object interaction video generation.

Video Demo

HVG-3D generates realistic, 3D-consistent hand-object interaction videos from a single image and a 3D control signal — from either real-world data or a simulator.

Demo video showcasing HVG-3D's ability to synthesize temporally coherent hand-object interaction videos conditioned on explicit 3D representations.