3D policy learning promises superior generalization and cross-embodiment transfer, but training instabilities and severe overfitting have prevented the adoption of powerful 3D models. We systematically diagnose these failures — identifying missing 3D data augmentation and Batch Normalization as primary causes — and propose a scalable transformer-based 3D encoder coupled with a Diffusion Transformer decoder. Our method significantly outperforms state-of-the-art baselines on challenging manipulation benchmarks.
A persistent "scaling paradox" plagues 3D policy learning: more powerful backbones lead to worse results. We systematically investigate why, and identify two overlooked root causes.
Naively replacing PointNet with a stronger Uni3D encoder causes severe performance degradation. We trace the root cause to Batch Normalization (BN), which struggles under the small batch sizes typical in imitation learning. Switching to Layer Normalization (LN) makes Uni3D not only trainable, but significantly better than PointNet. LN is a more robust default for 3D policies.
| Method | Beat Hammer | Move Card | Place Shoe | Avg. |
|---|---|---|---|---|
| DP3 (PointNet+BN) | 0 | 3 | 0 | 1.0 |
| DP3 (PointNet+LN) | 79 | 57 | 43 | 59.6 |
| DP3 (Uni3D+BN) | 0 | 0 | 0 | 0.0 |
| DP3 (Uni3D+LN) | 86 | 60 | 48 | 64.7 |
Without augmentation, success rates decline as training progresses. Our pipeline stabilizes training and significantly boosts generalization.
3D policy training is prone to overfitting. We introduce a three-part augmentation pipeline:
Randomize point order each step to prevent over-reliance on deterministic sampling order.
Random brightness, contrast, and saturation variations on RGB channels of the point cloud.
Gaussian noise on coordinates and proprioception, plus random point dropout for robustness to incomplete inputs.
Pipeline. A point cloud encoder (pre-trained on 3D segmentation) outputs dense spatial tokens, which the Diffusion Transformer decoder attends to via cross-attention.
Decoder comparison. DP3 collapses features to a global vector and uses FiLM conditioning. Ours preserves dense spatial tokens and uses cross-attention for spatially-aware denoising.
Cross-attention between action queries and dense geometric tokens preserves spatial resolution for precise manipulation — unlike DP3's global feature + FiLM.
Pretrain on ScanNet, ARKitScenes, and PartNeXt via the PointSAM architecture, injecting geometric priors that accelerate convergence and improve robustness.
Jointly decode end-effector poses and joint angles using causal attention masking, providing proprioceptive grounding without degrading primary joint control.
Evaluated on RoboTwin 2.0 (bimanual, 50 demos per task, 1024-point clouds) and ManiSkill2 (single-arm, 1000 demos). We outperform 2D, 2.5D, and 3D baselines across all benchmarks.
Simulation settings. Point clouds are obtained by cropping the table/ground and downsampling to 1024 points.
Beat Block Hammer
Move Card Away
Turn Switch
Open Microwave
Place Shoe
| Method | Beat Hammer | Move Card | Turn Switch | Open Micro. | Place Shoe | Avg. |
|---|---|---|---|---|---|---|
| DP + eVGGT | 17 | 4 | 29 | 69 | 9 | 25.6 |
| Spatial Forcing | 47 | 60 | 37 | 85 | 22 | 50.2 |
| DP3 | 72 | 68 | 46 | 61 | 58 | 61.0 |
| RDT | 77 | 43 | 35 | 37 | 35 | 45.4 |
| DP | 42 | 47 | 36 | 5 | 23 | 30.6 |
| ManiFlow | 94 | 74 | 43 | 98 | 56 | 73.0 |
| R3D | 91 | 85 | 56 | 100 | 87 | 83.8 |
Beat Block Hammer
Lift Pot
Adjust Bottle
Burger Fries
Press Stapler
| Method | Beat Hammer | Lift Pot | Adjust Bottle | Burger Fries | Press Stapler | Avg. |
|---|---|---|---|---|---|---|
| ACT + eVGGT | 7 | 2 | 20 | 10 | 14 | 10.6 |
| DP + eVGGT | 18 | 14 | 52 | 26 | 20 | 26.0 |
| DP3 | 8 | 10 | 62 | 18 | 15 | 22.6 |
| RDT | 9 | 12 | 50 | 20 | 20 | 22.2 |
| ManiFlow | 31 | 65 | 96 | 72 | 44 | 61.6 |
| R3D | 37 | 67 | 98 | 95 | 27 | 64.8 |
Pick Cube
Stack Cube
Peg Insertion
| Method | Pick Cube | Stack Cube | PegIns. Grasp | PegIns. Align | PegIns. Insert | Avg. |
|---|---|---|---|---|---|---|
| DP + SPuNet | 71 | 4 | 82 | 9 | 1 | 33.4 |
| DP + PointNet | 70 | 0 | 83 | 16 | 1 | 34.0 |
| ACT + PonderV2 | 87 | 35 | 65 | 23 | 2 | 42.4 |
| DP3 | 57 | 14 | 41 | 3 | 0 | 23.0 |
| R3D | 97 | 24 | 97 | 75 | 21 | 55.2 |
The designed tasks involve pick and place, hinged, long-horizon, and soft body tasks. Experiments are conducted on an xArm6 robot with two Intel RealSense D435 cameras (eye-to-hand + eye-in-hand). Point clouds are subsampled to 8,192 points. Each method is evaluated over 50 trials.
Place Kettle
Open Drawer
Fold Towel
Place Cup
Stack Three Blocks
Fit Banana
| Task | DP | Pi0 | DP3 | ManiFlow | R3D |
|---|---|---|---|---|---|
| Place Kettle | 64 | 46 | 40 | 36 | 76 |
| Open Drawer | 46 | 48 | 28 | 46 | 64 |
| Fold Towel | 36 | 62 | 54 | 60 | 66 |
| Average | 48.7 | 52.0 | 40.7 | 47.3 | 68.7 |
Our spatially-aware design better exploits multi-view geometry while maintaining strong single-view performance.
| Task | DP | DP3 | ManiFlow | R3D | ||||
|---|---|---|---|---|---|---|---|---|
| 1-view | 2-view | 1-view | 2-view | 1-view | 2-view | 1-view | 2-view | |
| Place Kettle | 28 | 64 | 18 | 40 | 20 | 36 | 40 | 76 |
| Open Drawer | 36 | 46 | 12 | 28 | 40 | 46 | 52 | 64 |
| Fold Towel | 28 | 36 | 48 | 54 | 54 | 60 | 62 | 66 |
| Avg. | 30.7 | 48.7 | 26.0 | 40.7 | 38.0 | 47.3 | 51.3 | 68.7 |
Disco lighting changes scene colors dynamically during evaluation. Our color jitter augmentation provides strong robustness to such variations.
| Task | DP | DP3 | ManiFlow | R3D | ||||
|---|---|---|---|---|---|---|---|---|
| Nor. | Disc. | Nor. | Disc. | Nor. | Disc. | Nor. | Disc. | |
| Place Kettle | 64 | 44 | 40 | 32 | 36 | 32 | 76 | 58 |
| Open Drawer | 46 | 34 | 28 | 22 | 46 | 36 | 64 | 56 |
| Fold Towel | 36 | 32 | 54 | 38 | 60 | 54 | 66 | 62 |
| Avg. | 48.7 | 36.7 | 40.7 | 30.7 | 47.3 | 40.7 | 68.7 | 58.7 |
ViT-tiny (53MB) is optimal for 1024-point clouds; ViT-small (115MB) is preferred for 8192-point real-world inputs. Decoder depth of 4 attention blocks offers the best tradeoff.
Success rates peak at 4–8 attention blocks; deeper decoders overfit.
| Task | tiny (53MB) | small (115MB) | base (366MB) |
|---|---|---|---|
| Move Playingcard | 85 | 92 | 83 |
| Turn Switch | 56 | 55 | 50 |
| Place Shoe | 87 | 68 | 70 |
| Open Microwave | 100 | 93 | 77 |
| Beat Block Hammer | 91 | 76 | 85 |
| Average | 83.8 | 76.8 | 73.0 |
| Encoder | 50 demos | 100 demos | 200 demos |
|---|---|---|---|
| PointSAM-tiny | 83.8 | 95.3 | 97.8 |
| PointSAM-small | 76.8 | 89.0 | 92.4 |
| PointSAM-base | 73.0 | 86.5 | 88.2 |
| Task | tiny (53MB) | small (115MB) | base (366MB) |
|---|---|---|---|
| Move Playingcard | 65 | 87 | 72 |
| Turn Switch | 47 | 56 | 64 |
| Place Shoe | 75 | 79 | 85 |
| Open Microwave | 79 | 98 | 92 |
| Beat Block Hammer | 74 | 89 | 88 |
| Average | 68.0 | 81.8 | 80.2 |
Dense feature conditioning and encoder pretraining are the two primary drivers of performance. Auxiliary EE prediction provides additional gains.
(a) Dense vs. global feature conditioning — dense features yield higher and more stable success rates.
(b) Pretrained vs. train-from-scratch encoder — pretraining accelerates convergence significantly.
| Method | Place Shoe | Move Card | Beat Hammer | Turn Switch | Average |
|---|---|---|---|---|---|
| Pretrained PointSAM (Global Feat.) | 60 | 76 | 86 | 28 | 62.50 |
| Pretrained PointSAM (Dense Feat.) | 92 | 70 | 94 | 54 | 77.50 |
| Scratch PointSAM (Dense Feat.) | 60 | 58 | 90 | 55 | 65.75 |
| R3D (Pretrained Dense + Aux. EE) | 87 | 85 | 91 | 56 | 79.75 |
@article{hong2026r3d,
title={R3D: Revisiting 3D Policy Learning},
author={Hong, Zhengdong and Wu, Shenrui and Cui, Haozhe and Zhao, Boyi and Ji, Ran and He, Yiyang and Zhang, Hangxing and Ke, Zundong and Wang, Jun and Zhang, Guofeng and Gu, Jiayuan},
journal={arXiv preprint arXiv:2604.15281},
year={2026}
}