DemoGen: Synthetic Demonstration Generation
for Data-Efficient Visuomotor Policy Learning

1Tsinghua Embodied AI Lab @ IIIS, Tsinghua University, 2Shanghai Qi Zhi Institute, 3Shanghai AI Lab

* indicates equal contributions


The O.O.D. generalization capabilities of visuomotor policies empowered by DemoGen-generated synthetic demonstrations, given only one human-collected demonstration per task.


Abstract


Visuomotor policies have shown great promise in robotic manipulation but often require substantial amounts of human-collected data for effective performance. A key reason underlying the data demands is their limited spatial generalization capability, which necessitates extensive data collection across different object configurations. In this work, we present DemoGen, a low-cost, fully synthetic approach for automatic demonstration generation. Using only one human-collected demonstration per task, DemoGen generates spatially augmented demonstrations by adapting the demonstrated action trajectory to novel object configurations. Visual observations are synthesized by leveraging 3D point clouds as the modality and rearranging the subjects in the scene via 3D editing. Empirically, DemoGen significantly enhances policy performance across a diverse range of real-world manipulation tasks, showing its applicability even in challenging scenarios involving deformable objects, dexterous hand end-effectors, and bimanual platforms. Furthermore, DemoGen can be extended to enable additional out-of-distribution capabilities, including disturbance resistance and obstacle avoidance.


DemoGen Methods


DemoGen adapts actions in the source demonstration to novel object configurations by incorporating the ideas of Task and Motion Planning (TAMP). Specifically, the source trajectory is decomposed into object-centric motion segments moving in free space and skill segments involving on-object manipulation through contact. During generation, the skill segments are transformed as a whole, and the motion segments are replanned via motion planning. The corresponding visual observations are synthesized by choosing 3D point clouds as the modality and rearranging the objects and robot end-effectors in the scene via 3D editing.



DemoGen for Spatial Generalization

Simulated Experiments

The effectiveness of DemoGen is verified on 8 modified MetaWorld tasks with enlarged object randomization ranges. We report the maximum/averaged success rates over 3 seeds of visuomotor policies trained on DemoGen-generated datasets with only one source demonstration per task. The results indicate DemoGen has the potential to maintain the policy performance with over 20x reduced human effort for data collection.

Real-World Protocols

We adhere to a rigorous protocol for evaluating spatial generalization. (a) Workspace setups. (b) The full-size evaluation workspace covering the full entend of the robot end-effectors on the table surface. (c) The DemoGen generation strategy targeting the evaluated configurations along with small-range perturbations.

Tasks & Source Demonstrations

We successfully apply DemoGen to a diverse range of tasks on single-arm & bi-manual platforms, using gripper & dexterous-hand end-effectors, from third-person & egocentric visual observation viewpoints, and with rigid-body & deformable/fluid objects. To minimize the human effort, we collect only one source demonstration per task for subsequent demonstration generation.


Evaluation Videos

For quantitative evaluation, we conudct a total of 530 policy rollouts on the 8 tasks with fully randomized object configurations within the feasible workspace range. Here, we provide the videos of 4 successful rollouts and 1 failed rollout for each of the 8 tasks.

Quantitative Results

Compared with the source demonstrations, DemoGen-generated datasets enable the agents to display a more adaptive response to diverse evaluated configurations, resulting in significantly higher success rates. Additionally, we visualize the spatial heatmaps of success rates for the evaluated configurations, showing diminished success rates on configurations more far away from the demonstrated ones. We attribute this decline to the visual mismatch problem caused by single-view observations.

Time Cost for Generating Real-World Demonstrations

DemoGen adopts a cost-effective fully synthetic manner. It takes DemoGen only 22 seconds to generate 2214 demonstration trajectories, that is, ~147k observation-action pairs. In contrast, MimicGen generates demonstrations via expensive on-robot rollouts, hindering its deployment in real-world scenarios.




DemoGen for Disturbance Resistance

Augmentation for Disturbance Resistance (ADR)


To mimic the recovery process from external disturbance, we develop a specialized generation strategy called Augmentation for Disturbance Resistance (ADR), where asynchronous transformations are applied to the disturbed object and the robot end-effector.

Evaluation Videos

For quantitative comparison, we conduct policy rollouts using both the DemoGen w/ ADR-enabled policy and regular DemoGen-enabled policy, where we manually drag the pizza crust twice towards the neighboring cross markers. We repeat each setting for 5 times to produce reliable results. Here, we provide the videos for all of the 50 policy rollouts.

Quantitative Results


For quantitative evaluation, we take pictures of the pizza crust after sauce spreading and calculate the sauce coverage on the crust via color thresholding. Additionally, we report a normalized sauce coverage score, where 0 represents no operation taken and 100 corresponds to human expert performance. The ADR strategy significantly outperforms the baseline strategy designed for spatial generation; it even approaches the human expert performance. This highlights the ability to resist disturbances does not emerge naturally but is acquired through targeted disturbance-involved demonstrations.

Robustness under Multiple Disturbances

Still starting with only one human-collected demonstration as the source demonstration, we show that the DemoGen w/ ADR-enabled policy is robust under up to 5 times of random disturbances in a row.



DemoGen for Obstacle Avoidance

Augmentation for Obstacle Avoidance


To generate obstacle-involved demonstrations, we augment the real-world point cloud observations by sampling points from simple geometries, such as boxes and cones, and fusing these points into the original scene. Obstacle-avoiding trajectories are generated by a motion planning tool, ensuring collision-free actions.


Evaluation Videos

For evaluation, we position obstacles with diverse shapes in the middle of the workspace. The DemoGen-enabled policy successfully avoids various unseen obstacles. Notably, in scenarios without obstacles, the agent follows the lower trajectory observed in the source demonstrations, indicating its responsiveness to environmental variations.



Empirical Study: Spatial Generalization of Visuomotor Policies

(Left) Qualitative visualization of the spatial effective range. The grid maps display discretized tabletop workspaces from a bird's-eye view under different demonstration configurations. In general, the spatial effective range of visuomotor policies can be approximated by the union of the areas surrounding the demonstrated object placements.
(Right) Quantitative benchmarking on the spatial generalization capacity. Both 3D representations and pre-trained 2D visual encoders contribute to improved spatial generalization capabilities. However, they are unable to fundamentally solve the spatial generalization problem. The spatial capacity is still developed through extensive traversal of the workspace from the given demonstrations.


Limitation: The Visual Mismatch Problem

As objects move through 3D space, their appearance changes due to variations in perspective. Under the constraint of a single-view observation, synthetic demonstrations consistently reflect a fixed side of the object's appearance seen in the source demonstration. This discrepancy causes a visual mismatch between the synthetic and real-captured data.



Acknowledgments

We would like to give special thanks to Galaxea Inc. for providing the R1 robot and Jianning Cui, Ke Dong, Haoyin Fan, and Yixiu Li for their technical support. We also thank Gu Zhang, Han Zhang, and Songbo Hu for hardware setup and data collection, Yifeng Zhu and Tianming Wei for discussing the controllers in the simulator, and Widyadewi Soedarmadji for the presentation advice. Tsinghua University Dushi Program supports this project.



BibTeX

@article{xue2025demogen,
  title={DemoGen: Synthetic Demonstration Generation for Data-Efficient Visuomotor Policy Learning},
  author={Xue, Zhengrong and Deng, Shuying and Chen, Zhenyang and Wang, Yixuan and Yuan, Zhecheng and Xu, Huazhe},
  journal={arXiv preprint arXiv:2502.16932},
  year={2025}
}