DMPO: Dispersive MeanFlow Policy Optimization

One Step Is Enough
Guowei Zou, Haitao Wang, Hejun Wu, Yukun Qian, Yuhang Wang, Weibing Li
Sun Yat-sen University

TL;DR: We propose DMPO (Dispersive MeanFlow Policy Optimization), a unified framework that enables true one-step generation for real-time robotic control through three key components: MeanFlow for mathematically-derived single-step inference, dispersive regularization to prevent representation collapse, and RL fine-tuning to surpass expert demonstrations. DMPO achieves competitive or superior performance with 5-20× inference speedup, exceeding real-time requirements (>120Hz) and reaching hundreds of Hertz on high-performance GPUs.

Overview

DMPO Overview
From efficiency-performance trade-off to practical real-time control. Top: Existing methods lie on the trade-off curve: multi-step approaches (DPPO, ReinFlow) achieve strong performance but slow inference, while one-step methods (CP, MP1, 1-DP) are fast but unstable. DMPO breaks this trade-off by occupying the upper-right region. Bottom: DMPO's two-stage approach enables both fast inference and high performance.

Approach Overview

DMPO addresses three interconnected challenges in real-time robotic control:

Challenge 1: Inference Efficiency

Multi-step sampling in diffusion and flow-based policies incurs significant latency, while distillation-based one-step methods require complex training pipelines.

Our Solution: MeanFlow enables mathematically-derived single-step inference without knowledge distillation, achieving 694x speedup.

Challenge 2: Representation Collapse

One-step generation methods risk mapping distinct observations to indistinguishable representations, degrading action quality.

Our Solution: Dispersive regularization encourages feature diversity across embeddings, preventing collapse without architectural modifications.

Challenge 3: Performance Ceiling

Pure imitation learning cannot surpass expert demonstrations, yet RL fine-tuning is impractical with slow multi-step inference.

Our Solution: One-step inference enables efficient PPO fine-tuning, breaking through the imitation learning ceiling.

DMPO Framework
DMPO Framework Overview. Stage 1 (Top & Middle): Pre-training with dispersive MeanFlow. MeanFlow learns velocity fields that transform noise into actions via Vision Transformer encoding with dispersive losses to prevent representation collapse. Stage 2 (Bottom): PPO fine-tuning formulated as a two-layer policy factorization.

Our Contributions

  • Framework: We introduce DMPO, a unified framework enabling stable one-step generation via principled co-design of architecture and algorithms, with 5-20× speedup over multi-step baselines.
  • Theory: We establish the first information-theoretic foundation proving dispersive regularization is necessary for stable one-step generation, and derive the first mathematical formulation for RL fine-tuning of one-step policies.
  • Validation: We achieve state-of-the-art on RoboMimic and OpenAI Gym benchmarks, and validate real-time control (>120Hz) on a Franka robot.

Stage 1: Pre-Training Results

RQ1: Can one-step generation match or exceed multi-step diffusion policies while achieving faster inference?

Answer: Yes. DMPO achieves dramatic inference efficiency gains with true one-step generation:

Efficiency vs Success Rate
Inference efficiency vs. success rate trade-off across four RoboMimic tasks. The upper-left region (fast + high success) is ideal. MF and MF+Disp lie on the Pareto frontier, achieving 6-10x speedup over ShortCut and 25-40x over ReFlow while maintaining superior success rates.
RQ2: Is dispersive regularization essential for preventing mode collapse in one-step generation?

Answer: Yes. Dispersive regularization significantly improves success rates by preventing representation collapse:

Success Rate Comparison
Success rate vs. denoising steps on four RoboMimic tasks (Lift, Can, Square, Transport). MeanFlow variants achieve near-saturated performance at 1-5 steps, while ReFlow and ShortCut require 32-128 steps. Dispersive regularization reduces variance on complex tasks.

Stage 2: Fine-Tuning Results

RQ3: Can online RL fine-tuning push beyond the performance ceiling of offline expert data?

Answer: Yes. DMPO with only 1 denoising step achieves competitive or superior performance compared to all baselines:

RoboMimic Manipulation Tasks

RoboMimic Fine-tuning Results
PPO Fine-tuning on RoboMimic tasks (Can, Square, Transport). DMPO (blue) achieves competitive or superior performance with only 1 denoising step compared to DPPO (20 steps), Gaussian baseline, and ReinFlow variants.

OpenAI Gym Locomotion & Kitchen Tasks

Gym Fine-tuning Results
PPO Fine-tuning on OpenAI Gym locomotion (Hopper, Walker2d, Ant, Humanoid) and Kitchen manipulation tasks. DMPO with 1-step inference matches or outperforms multi-step baselines.

Comparison with One-Step Baselines

Method NFE Distill. Lift Can Square Transport
DP-C (Teacher) 100 - 97% 96% 82% 46%
CP 1 Yes - - 65% 38%
OneDP-S 1 Yes - - 77% 72%
MP1 1 No 95% 80% 35% 38%
DMPO (Ours) 1 No 100% 100% 83% 88%

Model Efficiency Comparison

Model Vision Params Steps Time (4090) Freq Speedup
DP (DDPM) ResNet-18x2 281M 100 391.1ms 2.6Hz 1x
CP ResNet-18x2 285M 1 5.4ms 187Hz 73x
MP1 PointNet 256M 1 4.1ms 244Hz 96x
DMPO (Ours) light ViT 1.78M 1 0.6ms 1770Hz 694x

Real-World Deployment

RQ4: Does DMPO transfer to real-world robotic systems?

Answer: Yes. We validated DMPO on a Franka-Emika-Panda robot with Intel RealSense D435i camera using an NVIDIA RTX 2080 GPU, demonstrating robust sim-to-real transfer.

Real Robot Experiments
Real-world deployment on Franka Panda robot. Left: Hardware setup with Intel RealSense D435i camera. Right: Comparison between MP1 baseline (top row, fails on Lift and Can due to imprecise grasping caused by representation collapse) and DMPO (rows 2-3, succeeds on all four tasks including Square and Transport).

Key Results

  • Real-time control: 9.6ms total latency enabling >100Hz control frequency
  • Network inference: Only 2.6ms for 1-step DMPO (4.6-18x faster than baselines)
  • Robust execution: Successfully completed all 4 manipulation tasks
  • Sim-to-real transfer: Policies trained in simulation transfer effectively to physical hardware

Holistic Comparison

Radar charts comparing DMPO against baselines across eight evaluation dimensions: Inference Speed, Model Lightweight, Success Rate, Data Efficiency, Representation Quality, Distillation Free, Beyond Demos, and Training Stability. Each dimension is scored on a 1-5 scale.

Holistic Radar Comparison
Holistic radar comparison across eight dimensions. (a) RL fine-tuning methods: DMPO forms the outer envelope, achieving top scores across all dimensions. (b) Generation methods: DMPO outperforms all baselines by combining one-step inference with lightweight architecture, high data efficiency, and the ability to go beyond demonstrations through RL fine-tuning.

Key Insights

  • RL Fine-tuning Methods: While ReinFlow and DPPO share the same lightweight architecture and data efficiency as DMPO, they require multi-step inference (20+ steps). Only DMPO achieves top scores across all eight dimensions.
  • Generation Methods: Multi-step baselines (DP, FP) suffer from slow inference. Distilled one-step methods (1-DP, CP) cannot surpass demonstrations. Teacher-free MP1 suffers from representation collapse. DMPO is the only method achieving top performance across all dimensions.

Citation

@misc{zou2026stepenoughdispersivemeanflow, title={One Step Is Enough: Dispersive MeanFlow Policy Optimization}, author={Guowei Zou and Haitao Wang and Hejun Wu and Yukun Qian and Yuhang Wang and Weibing Li}, year={2026}, eprint={2601.20701}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2601.20701}, }

Related Work

  • Diffusion Policy (RSS 2023): Pioneered diffusion models for visuomotor control
  • DPPO (ICLR 2025): RL fine-tuning for diffusion policies
  • ReinFlow (NeurIPS 2025): Flow matching with online RL fine-tuning
  • Consistency Policy (RSS 2024): Distilled one-step generation
  • OneDP (ICML 2025): One-step diffusion policy via distillation
  • MP1 (AAAI 2026): MeanFlow for robotic manipulation
  • MeanFlow (NeurIPS 2025): Mean flows for one-step generative modeling