D²PPO

Diffusion Policy Policy Optimization with Dispersive Loss

Guowei Zou, Weibing Li, Hejun Wu, Yukun Qian, Yuhang Wang, and Haitao Wang*

*Corresponding author

Abstract

Diffusion policies excel at robotic manipulation by naturally modeling multimodal action distributions in high-dimensional spaces. Nevertheless, diffusion policies suffer from diffusion representation collapse: semantically similar observations are mapped to indistinguishable features, ultimately impairing their ability to handle subtle but critical variations required for complex robotic manipulation. To address this problem, we propose D²PPO (Diffusion Policy Policy Optimization with Dispersive Loss). D²PPO introduces dispersive loss regularization that combats representation collapse by treating all hidden representations within each batch as negative pairs. D²PPO compels the network to learn discriminative representations of similar observations, thereby enabling the policy to identify subtle yet crucial differences necessary for precise manipulation. On RoboMimic benchmarks, D²PPO achieves an average improvement of 22.7% in pre-training and 26.1% after fine-tuning, setting new SOTA results. Real-world experiments on Franka Emika Panda robot validate the practicality of our method.

Method Overview

D²PPO Method Overview

Figure 1: D²PPO Framework Overview. The complete two-stage training paradigm: Left: Pre-training stage with Vision Transformer (ViT) feature extraction and dispersive loss regularization to prevent representation collapse; Top-right: Action diffusion process showing iterative denoising from Gaussian noise to final actions; Bottom-right: Fine-tuning stage with policy gradient optimization using two-layer MDP formulation for environment interaction.

Robomimic Manipulation Tasks

Diffusion Representation Collapse

Figure 2: Diffusion Representation Collapse Problem and Solution. (a) Similar observations in robotic manipulation scenarios can lead to different outcomes - correct grasping (green) vs. incorrect grasping (red) that results in task failure. (b) Without dispersive loss: features cluster together, leading to representation collapse where similar observations are mapped to nearly identical representations. (c) With dispersive loss: features are well-distributed in the representation space, enabling the model to distinguish subtle but critical differences between similar observations.

Key Features

Dispersive Loss Regularization

"Contrastive learning without positive pairs" that spreads representations in hidden space, requiring no additional parameters.

Two-Stage Training

Pre-training with dispersive loss for representation diversity, then policy gradient fine-tuning for optimal performance.

Policy Gradient Integration

Novel gradient computation through iterative denoising, enabling effective RL with significant performance improvements.

D²PPO Pre-training Performance Analysis

D²PPO Pre-training Results

Figure 3: Comprehensive pre-training experimental results using D²PPO with dispersive loss across four robotic manipulation tasks. (a) Performance comparison showing baseline (DPPO) versus D²PPO success rates with error bars, demonstrating consistent improvements across all tasks. (b) Distribution of improvement rates across five dispersive loss variants. (c) Task difficulty correlation analysis showing the relationship between log-normalized task complexity and maximum improvement rates. (d) Method suitability matrix heatmap displaying improvement percentages for each dispersive loss variant across tasks.

Policy Gradient Fine-tuning Performance

D²PPO Fine-tuning Results

Figure 4: Policy gradient fine-tuning results across four robotic manipulation tasks. The learning curves demonstrate that D²PPO consistently achieves superior sample efficiency and final performance compared to baseline DPPO and Gaussian policies, with enhanced representations translating into superior reinforcement learning performance.

Simulation Results

D²PPO Performance on Robomimic Tasks

Lift Task

Basic object manipulation task

Can Task

Cylindrical object grasping task

Square Task

Precise peg-in-hole placement task

Transport Task

Multi-object coordination task

Real Robot Validation

Franka Panda Arm Experiments

Real robot validation on a Franka Panda arm confirms the practical effectiveness of our approach.

Lift Success
with Dispersive Loss
Square Success
with Dispersive Loss
Square Fail
without Dispersive Loss
Can Success
with Dispersive Loss
Transport Success
with Dispersive Loss
Transport Fail
without Dispersive Loss

Experimental Results

+22.7% Pre-training Improvement+26.1% Fine-tuning Improvement4/4 Tasks Improved0.94 Average Success Rate

Key Findings

Representation Collapse: Identified diffusion representation collapse as the core problem. Dispersive Loss: Early-layer regularization for simple tasks, late-layer for complex tasks. SOTA Results: 94% average success rate with 22.7% pre-training and 26.1% fine-tuning improvements. Real-World Validation: Successful deployment on Franka Panda robot across all benchmark tasks.

Citation

@misc{zou2025d2ppodiffusionpolicypolicy,
      title={D2PPO: Diffusion Policy Policy Optimization with Dispersive Loss}, 
      author={Guowei Zou and Weibing Li and Hejun Wu and Yukun Qian and Yuhang Wang and Haitao Wang},
      year={2025},
      eprint={2508.02644},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2508.02644}, 
}