R3D: Revisiting 3D Policy Learning

Abstract

3D policy learning promises superior generalization and cross-embodiment transfer, but training instabilities and severe overfitting have prevented the adoption of powerful 3D models. We systematically diagnose these failures — identifying missing 3D data augmentation and Batch Normalization as primary causes — and propose a scalable transformer-based 3D encoder coupled with a Diffusion Transformer decoder. Our method significantly outperforms state-of-the-art baselines on challenging manipulation benchmarks.

Motivation

A persistent "scaling paradox" plagues 3D policy learning: more powerful backbones lead to worse results. We systematically investigate why, and identify two overlooked root causes.

Revisiting 3D Policy Learning

Overcoming the Scaling Paradox: From Batch Norm to Layer Norm

Naively replacing PointNet with a stronger Uni3D encoder causes severe performance degradation. We trace the root cause to Batch Normalization (BN), which struggles under the small batch sizes typical in imitation learning. Switching to Layer Normalization (LN) makes Uni3D not only trainable, but significantly better than PointNet. LN is a more robust default for 3D policies.

BN vs. LN Success Rate (%)

Method	Beat Hammer	Move Card	Place Shoe	Avg.
DP3 (PointNet+BN)	0	3	0	1.0
DP3 (PointNet+LN)	79	57	43	59.6
DP3 (Uni3D+BN)	0	0	0	0.0
DP3 (Uni3D+LN)	86	60	48	64.7

Mitigating Overfitting via 3D Data Augmentation

Without augmentation, success rates decline as training progresses. Our pipeline stabilizes training and significantly boosts generalization.

3D policy training is prone to overfitting. We introduce a three-part augmentation pipeline:

1

FPS Randomization

Randomize point order each step to prevent over-reliance on deterministic sampling order.

2

Color Jitter

Random brightness, contrast, and saturation variations on RGB channels of the point cloud.

3

Noise & Dropout

Gaussian noise on coordinates and proprioception, plus random point dropout for robustness to incomplete inputs.

Improving 3D Policy Architecture

Pipeline. A point cloud encoder (pre-trained on 3D segmentation) outputs dense spatial tokens, which the Diffusion Transformer decoder attends to via cross-attention.

Decoder comparison. DP3 collapses features to a global vector and uses FiLM conditioning. Ours preserves dense spatial tokens and uses cross-attention for spatially-aware denoising.

Spatially-Aware Decoding

Cross-attention between action queries and dense geometric tokens preserves spatial resolution for precise manipulation — unlike DP3's global feature + FiLM.

3D Encoder Pretraining

Pretrain on ScanNet, ARKitScenes, and PartNeXt via the PointSAM architecture, injecting geometric priors that accelerate convergence and improve robustness.

Multi-Objective Decoding

Jointly decode end-effector poses and joint angles using causal attention masking, providing proprioceptive grounding without degrading primary joint control.

Simulation Experiments

Evaluated on RoboTwin 2.0 (bimanual, 50 demos per task, 1024-point clouds) and ManiSkill2 (single-arm, 1000 demos). We outperform 2D, 2.5D, and 3D baselines across all benchmarks.

Simulation settings. Point clouds are obtained by cropping the table/ground and downsampling to 1024 points.

RoboTwin 2.0 — Easy Setting

Beat Block Hammer

Move Card Away

Turn Switch

Open Microwave

Place Shoe

RoboTwin 2.0 — Easy Success Rate (%)

Method	Beat Hammer	Move Card	Turn Switch	Open Micro.	Place Shoe	Avg.
DP + eVGGT	17	4	29	69	9	25.6
Spatial Forcing	47	60	37	85	22	50.2
DP3	72	68	46	61	58	61.0
RDT	77	43	35	37	35	45.4
DP	42	47	36	5	23	30.6
ManiFlow	94	74	43	98	56	73.0
R3D	91	85	56	100	87	83.8

RoboTwin 2.0 — Hard Setting

Beat Block Hammer

Lift Pot

Adjust Bottle

Burger Fries

Press Stapler

RoboTwin 2.0 — Hard Success Rate (%)

Method	Beat Hammer	Lift Pot	Adjust Bottle	Burger Fries	Press Stapler	Avg.
ACT + eVGGT	7	2	20	10	14	10.6
DP + eVGGT	18	14	52	26	20	26.0
DP3	8	10	62	18	15	22.6
RDT	9	12	50	20	20	22.2
ManiFlow	31	65	96	72	44	61.6
R3D	37	67	98	95	27	64.8

ManiSkill2

Pick Cube

Stack Cube

Peg Insertion

ManiSkill2 Success Rate (%)

Method	Pick Cube	Stack Cube	PegIns. Grasp	PegIns. Align	PegIns. Insert	Avg.
DP + SPuNet	71	4	82	9	1	33.4
DP + PointNet	70	0	83	16	1	34.0
ACT + PonderV2	87	35	65	23	2	42.4
DP3	57	14	41	3	0	23.0
R3D	97	24	97	75	21	55.2

Real-World Experiments

The designed tasks involve pick and place, hinged, long-horizon, and soft body tasks. Experiments are conducted on an xArm6 robot with two Intel RealSense D435 cameras (eye-to-hand + eye-in-hand). Point clouds are subsampled to 8,192 points. Each method is evaluated over 50 trials.

Place Kettle

Open Drawer

Fold Towel

Place Cup

Stack Three Blocks

Fit Banana

Real-World Results Success Rate (%)

Task	DP	Pi0	DP3	ManiFlow	R3D
Place Kettle	64	46	40	36	76
Open Drawer	46	48	28	46	64
Fold Towel	36	62	54	60	66
Average	48.7	52.0	40.7	47.3	68.7

Single Camera View vs. Two Camera Views

Our spatially-aware design better exploits multi-view geometry while maintaining strong single-view performance.

Single vs. Two Views Success Rate (%)

Task	DP		DP3		ManiFlow		R3D
Task	1-view	2-view	1-view	2-view	1-view	2-view	1-view	2-view
Place Kettle	28	64	18	40	20	36	40	76
Open Drawer	36	46	12	28	40	46	52	64
Fold Towel	28	36	48	54	54	60	62	66
Avg.	30.7	48.7	26.0	40.7	38.0	47.3	51.3	68.7

Robustness to Dynamic Lighting

Disco lighting changes scene colors dynamically during evaluation. Our color jitter augmentation provides strong robustness to such variations.

Normal vs. Disco Light Success Rate (%)

Task	DP		DP3		ManiFlow		R3D
Task	Nor.	Disc.	Nor.	Disc.	Nor.	Disc.	Nor.	Disc.
Place Kettle	64	44	40	32	36	32	76	58
Open Drawer	46	34	28	22	46	36	64	56
Fold Towel	36	32	54	38	60	54	66	62
Avg.	48.7	36.7	40.7	30.7	47.3	40.7	68.7	58.7

Scaling and Ablation Study

Network Scaling

ViT-tiny (53MB) is optimal for 1024-point clouds; ViT-small (115MB) is preferred for 8192-point real-world inputs. Decoder depth of 4 attention blocks offers the best tradeoff.

Success rates peak at 4–8 attention blocks; deeper decoders overfit.

Encoder Size — 1024 Points Success Rate (%)

Task	tiny (53MB)	small (115MB)	base (366MB)
Move Playingcard	85	92	83
Turn Switch	56	55	50
Place Shoe	87	68	70
Open Microwave	100	93	77
Beat Block Hammer	91	76	85
Average	83.8	76.8	73.0

Encoder Size vs. Demonstrations Avg. over 5 Easy tasks

Encoder	50 demos	100 demos	200 demos
PointSAM-tiny	83.8	95.3	97.8
PointSAM-small	76.8	89.0	92.4
PointSAM-base	73.0	86.5	88.2

Encoder Size — 8192 Points Real-world dense setting

Task	tiny (53MB)	small (115MB)	base (366MB)
Move Playingcard	65	87	72
Turn Switch	47	56	64
Place Shoe	75	79	85
Open Microwave	79	98	92
Beat Block Hammer	74	89	88
Average	68.0	81.8	80.2

Ablation Study

Dense feature conditioning and encoder pretraining are the two primary drivers of performance. Auxiliary EE prediction provides additional gains.

(a) Dense vs. global feature conditioning — dense features yield higher and more stable success rates.

(b) Pretrained vs. train-from-scratch encoder — pretraining accelerates convergence significantly.

Component Ablation Success Rate (%)

Method	Place Shoe	Move Card	Beat Hammer	Turn Switch	Average
Pretrained PointSAM (Global Feat.)	60	76	86	28	62.50
Pretrained PointSAM (Dense Feat.)	92	70	94	54	77.50
Scratch PointSAM (Dense Feat.)	60	58	90	55	65.75
R3D (Pretrained Dense + Aux. EE)	87	85	91	56	79.75

BibTeX

@article{hong2026r3d,
  title={R3D: Revisiting 3D Policy Learning},
  author={Hong, Zhengdong and Wu, Shenrui and Cui, Haozhe and Zhao, Boyi and Ji, Ran and He, Yiyang and Zhang, Hangxing and Ke, Zundong and Wang, Jun and Zhang, Guofeng and Gu, Jiayuan},
  journal={arXiv preprint arXiv:2604.15281},
  year={2026}
}