DM1 addresses two fundamental challenges in flow-based robotic control:
Diffusion policies require 50-100 neural function evaluations (NFEs), preventing real-time deployment. Flow-based models reduce steps but still require iterative ODE integration.
Our Solution: MeanFlow enables true 1-NFE generation by directly predicting average velocity fields, achieving 20-40× speedup.
One-step generation methods suffer from representation collapse where distinct observations map to nearly identical embeddings, degrading performance on complex tasks.
Our Solution: Dispersive regularization encourages feature diversity across multiple embedding layers without architectural modifications.
To comprehensively evaluate the performance and effectiveness of the DM1 framework, our analysis focuses on the following four research questions:
Answer: Yes. DM1 achieves 20-40× faster inference compared to baseline methods while maintaining competitive performance. With only 5 denoising steps, DM1 attains:
Answer: Yes. Dispersive regularization significantly improves success rates across all tasks by preventing representation collapse. As shown in Figure 3 below, MeanFlow with dispersive regularization (MF+Disp) consistently outperforms vanilla MeanFlow (MF), especially on complex tasks like Transport.
Task | Baseline (32-128 steps) | DM1 (5 steps) | Improvement | Speedup |
---|---|---|---|---|
Lift | ~85% | 99% | +14% | 6.4-25.6× |
Can | Variable | High success | +10-20% | 20-40× |
Square | Moderate | Improved | +15-25% | 20-40× |
Transport | Low | Significantly improved | +20-30% | 20-40× |
Answer: Among the four dispersive regularization variants (InfoNCE-L2, InfoNCE-Cosine, Hinge, Covariance-based), InfoNCE-Cosine performs best. Figure 4 below shows the analysis across different regularization weights:
Key Findings:
Answer: Yes. We validated DM1 on a Franka-Emika-Panda robot with eye-in-hand RGB camera (96×96×3) using an NVIDIA RTX 2080 GPU, demonstrating robust sim-to-real transfer.
Per-stage latency breakdown (ms) for Lift task on physical robot. MF: MeanFlow (Ours), SC: ShortCut, RF: ReFlow. Numbers in parentheses indicate denoising steps.
Planner | Camera | State | Prep. | MF(1) | MF(5) | SC(32) | RF(128) | Planning | Send | T-MF(1) | T-MF(5) | T-SC(32) | T-RF(128) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cartesian | 5.4 | 0.1 | 0.4 | 2.4 | 10.5 | 76.5 | 305.3 | 1.7 | 1.1 | 11.1 | 19.2 | 85.2 | 314.1 |
BiT-RRT* | 7.6 | 0.2 | 0.5 | 2.4 | 11.4 | 77.0 | 306.9 | 89.9 | 2.5 | 103.1 | 112.1 | 177.7 | 407.6 |
RRTConnect* | 8.4 | 0.2 | 0.4 | 2.5 | 13.8 | 79.1 | 312.8 | 152.4 | 2.8 | 166.7 | 178.0 | 243.3 | 477.0 |
RRT* | 8.7 | 0.2 | 0.5 | 2.4 | 12.2 | 78.0 | 308.6 | 604.8 | 2.6 | 619.3 | 629.0 | 694.8 | 925.4 |
Key Observations: