DMPO: Dispersive MeanFlow Policy Optimization

Overview

From efficiency-performance trade-off to practical real-time control. Top: Existing methods lie on the trade-off curve: multi-step approaches (DPPO, ReinFlow) achieve strong performance but slow inference, while one-step methods (CP, MP1, 1-DP) are fast but unstable. DMPO breaks this trade-off by occupying the upper-right region. Bottom: DMPO's two-stage approach enables both fast inference and high performance.

Approach Overview

DMPO addresses three interconnected challenges in real-time robotic control:

Challenge 1: Inference Efficiency

Multi-step sampling in diffusion and flow-based policies incurs significant latency, while distillation-based one-step methods require complex training pipelines.

Our Solution: MeanFlow enables mathematically-derived single-step inference without knowledge distillation, achieving 694x speedup.

Challenge 2: Representation Collapse

One-step generation methods risk mapping distinct observations to indistinguishable representations, degrading action quality.

Our Solution: Dispersive regularization encourages feature diversity across embeddings, preventing collapse without architectural modifications.

Challenge 3: Performance Ceiling

Pure imitation learning cannot surpass expert demonstrations, yet RL fine-tuning is impractical with slow multi-step inference.

Our Solution: One-step inference enables efficient PPO fine-tuning, breaking through the imitation learning ceiling.

DMPO Framework Overview. Stage 1 (Top & Middle): Pre-training with dispersive MeanFlow. MeanFlow learns velocity fields that transform noise into actions via Vision Transformer encoding with dispersive losses to prevent representation collapse. Stage 2 (Bottom): PPO fine-tuning formulated as a two-layer policy factorization.

Our Contributions

Framework: We introduce DMPO, a unified framework enabling stable one-step generation via principled co-design of architecture and algorithms, with 5-20× speedup over multi-step baselines.
Theory: We establish the first information-theoretic foundation proving dispersive regularization is necessary for stable one-step generation, and derive the first mathematical formulation for RL fine-tuning of one-step policies.
Validation: We achieve state-of-the-art on RoboMimic and OpenAI Gym benchmarks, and validate real-time control (>120Hz) on a Franka robot.

Stage 1: Pre-Training Results

RQ1: Can one-step generation match or exceed multi-step diffusion policies while achieving faster inference?

Answer: Yes. DMPO achieves dramatic inference efficiency gains with true one-step generation:

Inference efficiency vs. success rate trade-off across four RoboMimic tasks. The upper-left region (fast + high success) is ideal. MF and MF+Disp lie on the Pareto frontier, achieving 6-10x speedup over ShortCut and 25-40x over ReFlow while maintaining superior success rates.

RQ2: Is dispersive regularization essential for preventing mode collapse in one-step generation?

Answer: Yes. Dispersive regularization significantly improves success rates by preventing representation collapse:

Success rate vs. denoising steps on four RoboMimic tasks (Lift, Can, Square, Transport). MeanFlow variants achieve near-saturated performance at 1-5 steps, while ReFlow and ShortCut require 32-128 steps. Dispersive regularization reduces variance on complex tasks.

Stage 2: Fine-Tuning Results

RQ3: Can online RL fine-tuning push beyond the performance ceiling of offline expert data?

Answer: Yes. DMPO with only 1 denoising step achieves competitive or superior performance compared to all baselines:

RoboMimic Manipulation Tasks

PPO Fine-tuning on RoboMimic tasks (Can, Square, Transport). DMPO (blue) achieves competitive or superior performance with only 1 denoising step compared to DPPO (20 steps), Gaussian baseline, and ReinFlow variants.

OpenAI Gym Locomotion & Kitchen Tasks

PPO Fine-tuning on OpenAI Gym locomotion (Hopper, Walker2d, Ant, Humanoid) and Kitchen manipulation tasks. DMPO with 1-step inference matches or outperforms multi-step baselines.

Comparison with One-Step Baselines

Method	NFE	Distill.	Lift	Can	Square	Transport
DP-C (Teacher)	100	-	97%	96%	82%	46%
CP	1	Yes	-	-	65%	38%
OneDP-S	1	Yes	-	-	77%	72%
MP1	1	No	95%	80%	35%	38%
DMPO (Ours)	1	No	100%	100%	83%	88%

Model Efficiency Comparison

Model	Vision	Params	Steps	Time (4090)	Freq	Speedup
DP (DDPM)	ResNet-18x2	281M	100	391.1ms	2.6Hz	1x
CP	ResNet-18x2	285M	1	5.4ms	187Hz	73x
MP1	PointNet	256M	1	4.1ms	244Hz	96x
DMPO (Ours)	light ViT	1.78M	1	0.6ms	1770Hz	694x

Real-World Deployment

RQ4: Does DMPO transfer to real-world robotic systems?

Answer: Yes. We validated DMPO on a Franka-Emika-Panda robot with Intel RealSense D435i camera using an NVIDIA RTX 2080 GPU, demonstrating robust sim-to-real transfer.

Real-world deployment on Franka Panda robot. Left: Hardware setup with Intel RealSense D435i camera. Right: Comparison between MP1 baseline (top row, fails on Lift and Can due to imprecise grasping caused by representation collapse) and DMPO (rows 2-3, succeeds on all four tasks including Square and Transport).

Key Results

Real-time control: 9.6ms total latency enabling >100Hz control frequency
Network inference: Only 2.6ms for 1-step DMPO (4.6-18x faster than baselines)
Robust execution: Successfully completed all 4 manipulation tasks
Sim-to-real transfer: Policies trained in simulation transfer effectively to physical hardware

Holistic Comparison

Radar charts comparing DMPO against baselines across eight evaluation dimensions: Inference Speed, Model Lightweight, Success Rate, Data Efficiency, Representation Quality, Distillation Free, Beyond Demos, and Training Stability. Each dimension is scored on a 1-5 scale.

Holistic radar comparison across eight dimensions. (a) RL fine-tuning methods: DMPO forms the outer envelope, achieving top scores across all dimensions. (b) Generation methods: DMPO outperforms all baselines by combining one-step inference with lightweight architecture, high data efficiency, and the ability to go beyond demonstrations through RL fine-tuning.

Key Insights

RL Fine-tuning Methods: While ReinFlow and DPPO share the same lightweight architecture and data efficiency as DMPO, they require multi-step inference (20+ steps). Only DMPO achieves top scores across all eight dimensions.
Generation Methods: Multi-step baselines (DP, FP) suffer from slow inference. Distilled one-step methods (1-DP, CP) cannot surpass demonstrations. Teacher-free MP1 suffers from representation collapse. DMPO is the only method achieving top performance across all dimensions.

Citation

@misc{zou2026stepenoughdispersivemeanflow, title={One Step Is Enough: Dispersive MeanFlow Policy Optimization}, author={Guowei Zou and Haitao Wang and Hejun Wu and Yukun Qian and Yuhang Wang and Weibing Li}, year={2026}, eprint={2601.20701}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2601.20701}, }

Related Work

Diffusion Policy (RSS 2023): Pioneered diffusion models for visuomotor control
DPPO (ICLR 2025): RL fine-tuning for diffusion policies
ReinFlow (NeurIPS 2025): Flow matching with online RL fine-tuning
Consistency Policy (RSS 2024): Distilled one-step generation
OneDP (ICML 2025): One-step diffusion policy via distillation
MP1 (AAAI 2026): MeanFlow for robotic manipulation
MeanFlow (NeurIPS 2025): Mean flows for one-step generative modeling