R3D: Revisiting 3D Policy Learning

1Zhejiang University     2ShanghaiTech University
*Equal Contribution     Corresponding Author
TL;DR: We diagnose why scaling 3D policy learning fails — Batch Normalization and missing data augmentation — and propose a transformer-diffusion architecture that significantly outperforms SOTA baselines on simulation and real-world benchmarks.

Abstract

3D policy learning promises superior generalization and cross-embodiment transfer, but training instabilities and severe overfitting have prevented the adoption of powerful 3D models. We systematically diagnose these failures — identifying missing 3D data augmentation and Batch Normalization as primary causes — and propose a scalable transformer-based 3D encoder coupled with a Diffusion Transformer decoder. Our method significantly outperforms state-of-the-art baselines on challenging manipulation benchmarks.


Motivation

A persistent "scaling paradox" plagues 3D policy learning: more powerful backbones lead to worse results. We systematically investigate why, and identify two overlooked root causes.


Revisiting 3D Policy Learning

Overcoming the Scaling Paradox: From Batch Norm to Layer Norm

Naively replacing PointNet with a stronger Uni3D encoder causes severe performance degradation. We trace the root cause to Batch Normalization (BN), which struggles under the small batch sizes typical in imitation learning. Switching to Layer Normalization (LN) makes Uni3D not only trainable, but significantly better than PointNet. LN is a more robust default for 3D policies.

BN vs. LN Success Rate (%)

Method Beat Hammer Move Card Place Shoe Avg.
DP3 (PointNet+BN) 0 3 0 1.0
DP3 (PointNet+LN) 79 57 43 59.6
DP3 (Uni3D+BN) 0 0 0 0.0
DP3 (Uni3D+LN) 86 60 48 64.7

Mitigating Overfitting via 3D Data Augmentation

Data Augmentation Learning Curves

Without augmentation, success rates decline as training progresses. Our pipeline stabilizes training and significantly boosts generalization.

3D policy training is prone to overfitting. We introduce a three-part augmentation pipeline:

1
FPS Randomization

Randomize point order each step to prevent over-reliance on deterministic sampling order.

2
Color Jitter

Random brightness, contrast, and saturation variations on RGB channels of the point cloud.

3
Noise & Dropout

Gaussian noise on coordinates and proprioception, plus random point dropout for robustness to incomplete inputs.


Improving 3D Policy Architecture

R3D Pipeline Architecture

Pipeline. A point cloud encoder (pre-trained on 3D segmentation) outputs dense spatial tokens, which the Diffusion Transformer decoder attends to via cross-attention.

Decoder Comparison

Decoder comparison. DP3 collapses features to a global vector and uses FiLM conditioning. Ours preserves dense spatial tokens and uses cross-attention for spatially-aware denoising.

Spatially-Aware Decoding

Cross-attention between action queries and dense geometric tokens preserves spatial resolution for precise manipulation — unlike DP3's global feature + FiLM.

3D Encoder Pretraining

Pretrain on ScanNet, ARKitScenes, and PartNeXt via the PointSAM architecture, injecting geometric priors that accelerate convergence and improve robustness.

Multi-Objective Decoding

Jointly decode end-effector poses and joint angles using causal attention masking, providing proprioceptive grounding without degrading primary joint control.


Simulation Experiments

Evaluated on RoboTwin 2.0 (bimanual, 50 demos per task, 1024-point clouds) and ManiSkill2 (single-arm, 1000 demos). We outperform 2D, 2.5D, and 3D baselines across all benchmarks.

Simulation Benchmark Tasks

Simulation settings. Point clouds are obtained by cropping the table/ground and downsampling to 1024 points.

RoboTwin 2.0 — Easy Setting

Beat Block Hammer

Move Card Away

Turn Switch

Open Microwave

Place Shoe

RoboTwin 2.0 — Easy Success Rate (%)

Method Beat Hammer Move Card Turn Switch Open Micro. Place Shoe Avg.
DP + eVGGT 17 4 29 69 9 25.6
Spatial Forcing 47 60 37 85 22 50.2
DP3 72 68 46 61 58 61.0
RDT 77 43 35 37 35 45.4
DP 42 47 36 5 23 30.6
ManiFlow 94 74 43 98 56 73.0
R3D 91 85 56 100 87 83.8

RoboTwin 2.0 — Hard Setting

Beat Block Hammer

Lift Pot

Adjust Bottle

Burger Fries

Press Stapler

RoboTwin 2.0 — Hard Success Rate (%)

Method Beat Hammer Lift Pot Adjust Bottle Burger Fries Press Stapler Avg.
ACT + eVGGT 7 2 20 10 14 10.6
DP + eVGGT 18 14 52 26 20 26.0
DP3 8 10 62 18 15 22.6
RDT 9 12 50 20 20 22.2
ManiFlow 31 65 96 72 44 61.6
R3D 37 67 98 95 27 64.8

ManiSkill2

Pick Cube

Stack Cube

Peg Insertion

ManiSkill2 Success Rate (%)

Method Pick Cube Stack Cube PegIns. Grasp PegIns. Align PegIns. Insert Avg.
DP + SPuNet 71 4 82 9 1 33.4
DP + PointNet 70 0 83 16 1 34.0
ACT + PonderV2 87 35 65 23 2 42.4
DP3 57 14 41 3 0 23.0
R3D 97 24 97 75 21 55.2

Real-World Experiments

The designed tasks involve pick and place, hinged, long-horizon, and soft body tasks. Experiments are conducted on an xArm6 robot with two Intel RealSense D435 cameras (eye-to-hand + eye-in-hand). Point clouds are subsampled to 8,192 points. Each method is evaluated over 50 trials.

Place Kettle

Open Drawer

Fold Towel

Place Cup

Stack Three Blocks

Fit Banana

Real-World Results Success Rate (%)

Task DP Pi0 DP3 ManiFlow R3D
Place Kettle 64 46 40 36 76
Open Drawer 46 48 28 46 64
Fold Towel 36 62 54 60 66
Average 48.7 52.0 40.7 47.3 68.7

Single Camera View vs. Two Camera Views

Our spatially-aware design better exploits multi-view geometry while maintaining strong single-view performance.

Single vs. Two Views Success Rate (%)

Task DP DP3 ManiFlow R3D
1-view 2-view 1-view 2-view 1-view 2-view 1-view 2-view
Place Kettle 28 64 18 40 20 36 40 76
Open Drawer 36 46 12 28 40 46 52 64
Fold Towel 28 36 48 54 54 60 62 66
Avg. 30.7 48.7 26.0 40.7 38.0 47.3 51.3 68.7

Robustness to Dynamic Lighting

Disco lighting changes scene colors dynamically during evaluation. Our color jitter augmentation provides strong robustness to such variations.

Normal vs. Disco Light Success Rate (%)

Task DP DP3 ManiFlow R3D
Nor. Disc. Nor. Disc. Nor. Disc. Nor. Disc.
Place Kettle 64 44 40 32 36 32 76 58
Open Drawer 46 34 28 22 46 36 64 56
Fold Towel 36 32 54 38 60 54 66 62
Avg. 48.7 36.7 40.7 30.7 47.3 40.7 68.7 58.7

Scaling and Ablation Study

Network Scaling

ViT-tiny (53MB) is optimal for 1024-point clouds; ViT-small (115MB) is preferred for 8192-point real-world inputs. Decoder depth of 4 attention blocks offers the best tradeoff.

Decoder Depth Ablation

Success rates peak at 4–8 attention blocks; deeper decoders overfit.

Encoder Size — 1024 Points Success Rate (%)

Task tiny (53MB) small (115MB) base (366MB)
Move Playingcard 85 92 83
Turn Switch 56 55 50
Place Shoe 87 68 70
Open Microwave 100 93 77
Beat Block Hammer 91 76 85
Average 83.8 76.8 73.0

Encoder Size vs. Demonstrations Avg. over 5 Easy tasks

Encoder 50 demos 100 demos 200 demos
PointSAM-tiny 83.8 95.3 97.8
PointSAM-small 76.8 89.0 92.4
PointSAM-base 73.0 86.5 88.2

Encoder Size — 8192 Points Real-world dense setting

Task tiny (53MB) small (115MB) base (366MB)
Move Playingcard 65 87 72
Turn Switch 47 56 64
Place Shoe 75 79 85
Open Microwave 79 98 92
Beat Block Hammer 74 89 88
Average 68.0 81.8 80.2

Ablation Study

Dense feature conditioning and encoder pretraining are the two primary drivers of performance. Auxiliary EE prediction provides additional gains.

Dense vs Global Feature

(a) Dense vs. global feature conditioning — dense features yield higher and more stable success rates.

Pretrained vs Scratch

(b) Pretrained vs. train-from-scratch encoder — pretraining accelerates convergence significantly.

Component Ablation Success Rate (%)

Method Place Shoe Move Card Beat Hammer Turn Switch Average
Pretrained PointSAM (Global Feat.) 60 76 86 28 62.50
Pretrained PointSAM (Dense Feat.) 92 70 94 54 77.50
Scratch PointSAM (Dense Feat.) 60 58 90 55 65.75
R3D (Pretrained Dense + Aux. EE) 87 85 91 56 79.75

BibTeX

@article{hong2026r3d,
  title={R3D: Revisiting 3D Policy Learning},
  author={Hong, Zhengdong and Wu, Shenrui and Cui, Haozhe and Zhao, Boyi and Ji, Ran and He, Yiyang and Zhang, Hangxing and Ke, Zundong and Wang, Jun and Zhang, Guofeng and Gu, Jiayuan},
  journal={arXiv preprint arXiv:2604.15281},
  year={2026}
}