RISE: 3D Perception Makes Real-World Robot Imitation Simple and Effective
License: arXiv.org perpetual non-exclusive license
arXiv:2404.12281v2 [cs.RO] 21 Apr 2024
\NewDocumentCommand\emojirise
[Uncaptioned image]

RISE: 3D Perception Makes Real-World Robot
Imitation Simple and Effective

Chenxi Wang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Hongjie Fang22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Hao-Shu Fang22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT*, Cewu Lu22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT 11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Shanghai Noematrix Intelligence Technology Ltd.22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Shanghai Jiao Tong University.* Hao-Shu Fang and Cewu Lu are the corresponding authors.Author e-mails: chenxi.wang@noematrix.cn, galaxies@sjtu.edu.cn, fhaoshu@gmail.com, lucewu@sjtu.edu.cn
Abstract

Precise robot manipulations require rich spatial information in imitation learning. Image-based policies model object positions from fixed cameras, which are sensitive to camera view changes. Policies utilizing 3D point clouds usually predict keyframes rather than continuous actions, posing difficulty in dynamic and contact-rich scenarios. To utilize 3D perception efficiently, we present RISE, an end-to-end baseline for real-world imitation learning, which predicts continuous actions directly from single-view point clouds. It compresses the point cloud to tokens with a sparse 3D encoder. After adding sparse positional encoding, the tokens are featurized using a transformer. Finally, the features are decoded into robot actions by a diffusion head. Trained with 50 demonstrations for each real-world task, RISE surpasses currently representative 2D and 3D policies by a large margin, showcasing significant advantages in both accuracy and efficiency. Experiments also demonstrate that RISE is more general and robust to environmental change compared with previous baselines. Project website: rise-policy.github.io.

I Introduction

End-to-end policy learning framework plays an increasingly important role in robotic manipulations. With the advancements in robotics, researchers are recognizing the significance of integrating perception, planning, and execution into a continuous learning process. Recent work has made significant strides in imitation learning in an end-to-end fashion [1, 4, 9, 10, 57]. These methods allow robots to learn directly from perceptual data, bypassing the need for laborious hand-crafted feature engineering or planning steps, which opens new possibilities for addressing complex manipulation tasks and drives research and progress in the field of manipulation [39].

Spatial information is crucial for precise manipulations. For example, to pour a cup of water, the robot is required to understand the position and orientation of the cup to adjust its movements carefully and prevent water from spilling out. Image-based imitation learning tends to learn implicit spatial representations from fixed camera views [1, 4, 9, 15, 47, 57]. Many of these approaches utilize distinct image encoders for each view and increase the number of cameras to enhance stability and precision, consequently increasing the number of network parameters and computational overhead. In addition, such methods are susceptible to variations in camera poses, introducing a significant challenge to effective model deployment in real-world scenarios.

Spatial information can also be represented using 3D point clouds, which have demonstrated remarkable advantages in robotic tasks like general grasping [8, 46]. Recently, imitation learning based on point clouds is drawing increasing interest in our community [2, 11, 12, 14, 19, 43, 51, 54]. Most of the 3D-based methods learn to predict the next keyframe as opposed to continuous actions, which often struggle with tasks involving frequent contacts and abrupt environmental changes. Meanwhile, addressing the annotation of keyframes at scale for real-world data necessitates additional manual effort. The reason that they struggle to predict continuous actions lies in the difficulty of achieving real-time performance, which relies heavily on efficient 3D feature encoding and remains a challenge for point cloud based methods.

In this work, we propose an end-to-end imitation baseline, RISE, a method leveraging 3D perception to make real-world robot imitation simple and effective. RISE takes point clouds as input directly, and outputs continuous action trajectories for the immediate future. It employs a shallow 3D encoder built with sparse convolutions [5], which effectively utilizes the advantages of conventional convolution architectures and avoids redundant computations on empty 3D spaces. Such design makes RISE able to obtain precise spatial information with a single camera, reducing the additional cost of hardware accumulation. The encoded point features are then mapped to the action space by a transformer. Considering the unordered characteristic of 3D points, we use sparse positional encoding, a function of coordinates, to help the transformer master the relative relationships among different point tokens in 3D space. Although the point tokens are not distributed in continuous positions like language or image tokens, we can still observe stable results in scenes with variable object locations. With sparse positional encoding, the point features can be easily modeled by transformers and naturally embedded into multimodal inputs. To show the benefits of 3D perception, we only consider RISE with point cloud input in this work. Finally, the action features are decoded into continuous trajectories by a diffusion head [4].

We test RISE in 6 real-world tasks, including pick-and-place, 6-DoF pouring, push-to-goal (with tool), and long-horizon tasks. To verify the generalization ability of the policy with object locations, all the objects are randomly arranged throughout the entire workspace. Trained on 50 demonstrations for each task, RISE significantly outperforms other representative methods and keeps stable when the number of objects increases. We also find that RISE is more robust to environmental disturbance, such as changing camera views and increasing table height. Such generalization enhances the error tolerance of real-world deployment.

II Related Work

II-A Imitation Learning for Robotics

Imitation learning is a machine learning paradigm where a robot learns to operate by observing and mimicking expert demonstrations. Behavior cloning (BC) [35], as the most direct form of imitation learning, aims to identify a mapping from observations to corresponding robot actions with the supervision of the given demonstrations. Despite its simplicity, BC has shown promising potential in learning robotic manipulations [1, 4, 21, 28, 43, 47, 57].

2D Imitation Learning. 2D image data is commonly used in imitation learning. One intuitive approach is to utilize pre-trained representation models for images [24, 25, 26, 31, 38] to convert them into 1D representations and map these transformed observations to the action space either through a BC policy [55] or non-parametric nearest neighbour [34]. Unfortunately, current pre-trained representation models are not general enough to handle diverse experimental environments, encountering trouble achieving satisfactory results in real-world settings. Thus, many researchers learn such mapping in an end-to-end manner [1, 4, 21, 28, 32, 40, 47, 57, 60] and have demonstrated impressive performance across many tasks. Specifically, ACT [57] adopts a CVAE scheme [44] with transformer backbones [48] and ResNet image encoders [17] to model the variability of human data, while Diffusion Policy [4] directly utilizes diffusion process [18] to express multimodal action distributions generatively. Nonetheless, these policies are sensitive to camera positions and often fail to capture 3D spatial information about the objects in the environments.

3D Imitation Learning. The formulation of incorporating 3D information in the imitation learning framework is under active exploration. The most straightforward method is to apply projections to transform the 3D point cloud to several 2D image views and transfer the task to multi-view image-based policy learning [12, 14], which requires the virtual viewpoints to be carefully designed to ensure performance. Moreover, due to sparse and noisily sensed point clouds, [12] fails to grasp slim objects like marker pens in real-world experiments. [19, 43, 54] process point clouds to dense voxel grids and apply 3D convolutions. Since high-resolution 3D feature maps require expensive computes, these methods have to trade off performance against cost. [11, 51] featurize point clouds by projecting multi-view image features to 3D world to avoid dense convolutions. However, such feature fusion techniques struggle to capture the consistent 3D representation from different views accurately. Recently, a concurrent work DP3 [53] also leverages 3D perception in robotic manipulation policies, but our real-world evaluations in §IV-F demonstrate that it cannot handle demonstrations with various representations limited by its network capacity.

As mentioned before, most of the current 3D robotic imitation learning methods predict keyframes instead of continuous action, which makes it hard to annotate and limits their capacity. Besides, many of these methods only show results in simulation environments like RLBench [20] and CALVIN [29]. In this work, we aim to evaluate our method in a more challenging setting: continuous action control with a noisy single-view partial point cloud in the real world.

II-B 3D Perception

3D perception has received considerable attention from researchers in the computer vision and robotics communities. It can be roughly divided into the following three categories:

Projection-based. This approach initially projects the 3D point cloud onto multiple images on different planes and then employs traditional multi-view image perception techniques. It is widely applied in shape recognition [16], object detection [3, 23] and robotic manipulations [12, 14] due to its simplicity. However, the projections can lead to the geometric information loss of the 3D data, and the sensitivity to the choice of projection planes may result in inferior performance [56].

Refer to caption
Figure 0: Overview of RISE architecture. The input of RISE is a noisy point cloud captured from the real world. A 3D encoder built with sparse convolution is employed to compress the point cloud into tokens. The tokens are fed into the transformer encoder after adding sparse positional encoding. A readout token is used to query the action features from the transformer decoder. Conditioned on the action features, the Gaussian samples are denoised into continuous actions iteratively using a diffusion head.

Point-based. Early researchers directly utilized 3D convolutional neural networks (CNNs) to process 3D point cloud data based on dense volumetric representations [6, 50, 59]. Still, the sparsity of 3D data makes the vanilla approaches inefficient and memory-intensive. To solve this problem, researchers have explored using octrees for memory footprint reduction [41, 49], utilizing sparse convolutions to minimize unnecessary computations in inactive regions to improve efficiency and effectiveness [5, 13], and aggregating features across point sets directly using different network architectures [33, 36, 37, 56].

NeRF-based. Neural radiance fields (NeRFs) [30] have demonstrated impressive performance on high-fidelity 3D scene synthesis and scene representation extractions. In recent years, some studies [7, 42, 52, 54] have employed features extracted from pre-trained 2D foundational models as additional supervisory signals in NeRF training for scene feature extraction and distillation. Nevertheless, NeRF training requires image data from multiple views, which poses obstacles for scaling up in real-world environments. Additionally, it does not align with our single-view setting.

III Method

Given a point cloud 𝒪t={Pit=(xit,yit,zit,rit,git,bit)}superscript𝒪𝑡superscriptsubscript𝑃𝑖𝑡superscriptsubscript𝑥𝑖𝑡superscriptsubscript𝑦𝑖𝑡superscriptsubscript𝑧𝑖𝑡superscriptsubscript𝑟𝑖𝑡superscriptsubscript𝑔𝑖𝑡superscriptsubscript𝑏𝑖𝑡\mathcal{O}^{t}=\{P_{i}^{t}=(x_{i}^{t},y_{i}^{t},z_{i}^{t},r_{i}^{t},g_{i}^{t}% ,b_{i}^{t})\}caligraphic_O start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = { italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) } as the observation at time t𝑡titalic_t, RISE aims to predict the next n𝑛nitalic_n-step robot actions 𝒜t={At+1,At+2,,At+n}superscript𝒜𝑡subscript𝐴𝑡1subscript𝐴𝑡2subscript𝐴𝑡𝑛\mathcal{A}^{t}=\{A_{t+1},A_{t+2},\cdots,A_{t+n}\}caligraphic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = { italic_A start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t + 2 end_POSTSUBSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT }, where Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT contains the translation, rotation and width of the gripper. Due to the large domain gap between point clouds and robot actions, it is challenging to learn the approximation f:𝒪t𝒜t:𝑓superscript𝒪𝑡superscript𝒜𝑡f:\mathcal{O}^{t}\rightarrow\mathcal{A}^{t}italic_f : caligraphic_O start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT → caligraphic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT directly. To model the process, RISE is decomposed into three functions: a sparse 3D encoder hE:𝒪tPt:subscriptEsuperscript𝒪𝑡superscriptsubscriptP𝑡h_{\mathrm{E}}:\mathcal{O}^{t}\rightarrow\mathcal{F}_{\mathrm{P}}^{t}italic_h start_POSTSUBSCRIPT roman_E end_POSTSUBSCRIPT : caligraphic_O start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT → caligraphic_F start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, a transformer hT:PtAt:subscriptTsuperscriptsubscriptP𝑡superscriptsubscriptA𝑡h_{\mathrm{T}}:\mathcal{F}_{\mathrm{P}}^{t}\rightarrow\mathcal{F}_{\mathrm{A}}% ^{t}italic_h start_POSTSUBSCRIPT roman_T end_POSTSUBSCRIPT : caligraphic_F start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT → caligraphic_F start_POSTSUBSCRIPT roman_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and an action decoder hD:At𝒜t:subscriptDsuperscriptsubscriptA𝑡superscript𝒜𝑡h_{\mathrm{D}}:\mathcal{F}_{\mathrm{A}}^{t}\rightarrow\mathcal{A}^{t}italic_h start_POSTSUBSCRIPT roman_D end_POSTSUBSCRIPT : caligraphic_F start_POSTSUBSCRIPT roman_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT → caligraphic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, where PtsuperscriptsubscriptP𝑡\mathcal{F}_{\mathrm{P}}^{t}caligraphic_F start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and AtsuperscriptsubscriptA𝑡\mathcal{F}_{\mathrm{A}}^{t}caligraphic_F start_POSTSUBSCRIPT roman_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT denote the features of point clouds and actions respectively.

III-A Modeling Point Clouds using Sparse 3D Encoder

The most significant difference between point cloud data and images is that point clouds are sparse and unorganized, which makes CNNs unsuitable to be applied to the points. For inputs at different scales, the computation efficiency and flexibility of a model should be taken into consideration. We employ a 3D encoder built on sparse convolution [5]. It keeps most of the standard convolution, while only computes outputs on predefined coordinates. Such an operator saves computation and inherits the core advantage of conventional convolution.

The sparse 3D encoder hEsubscriptEh_{\mathrm{E}}italic_h start_POSTSUBSCRIPT roman_E end_POSTSUBSCRIPT adopts a shallow ResNet architecture [17]. It is composed of one initial convolution layer, four residual blocks, and one final convolution layer, with five 2×2\times2 × sparse pooling layers between every two components. The number of layers can be freely increased, while the evaluation results demonstrate that a shallow encoder is sufficient for our experiments.

By hEsubscriptEh_{\mathrm{E}}italic_h start_POSTSUBSCRIPT roman_E end_POSTSUBSCRIPT, the voxelized point cloud 𝒪tsuperscript𝒪𝑡\mathcal{O}^{t}caligraphic_O start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is encoded to sparse point features PtsuperscriptsubscriptP𝑡\mathcal{F}_{\mathrm{P}}^{t}caligraphic_F start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT in an efficient way, avoiding redundant computing on huge empty space. PtsuperscriptsubscriptP𝑡\mathcal{F}_{\mathrm{P}}^{t}caligraphic_F start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is then fed into the transformer hTsubscriptTh_{\mathrm{T}}italic_h start_POSTSUBSCRIPT roman_T end_POSTSUBSCRIPT as sparse tokens. For 𝒪tsuperscript𝒪𝑡\mathcal{O}^{t}caligraphic_O start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT cropped in 1×1×1m3111superscriptm31\times 1\times 1\mathrm{m}^{3}1 × 1 × 1 roman_m start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT space, PtsuperscriptsubscriptP𝑡\mathcal{F}_{\mathrm{P}}^{t}caligraphic_F start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT contains only 60 similar-to\sim 80 tokens. Although the token number is less than the one in ACT [57] (300 per image), experiments in §IV-F show that point cloud based ACT still outperforms the original implementation.

III-B Transformer with Sparse Point Tokens

Refer to caption
Figure 1: Definition of the tasks in the experiments. During evaluation, each task is randomly initialized within the robot workspace. For each task, only 3 to 5 setups from the evaluations are depicted in the figure for clarity.

We adopt transformer [48] to implement the mapping from point features PtsuperscriptsubscriptP𝑡\mathcal{F}_{\mathrm{P}}^{t}caligraphic_F start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT to action features AtsuperscriptsubscriptA𝑡\mathcal{F}_{\mathrm{A}}^{t}caligraphic_F start_POSTSUBSCRIPT roman_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. While the positional encoding for image tokens is dense and natural, sparse point tokens could not be processed in the same manner. We instead introduce sparse positional encoding for point tokens.

Let (x,y,z)𝑥𝑦𝑧(x,y,z)( italic_x , italic_y , italic_z ) be the coordinate of the point token P𝑃Pitalic_P with d𝑑ditalic_d-dimension, the position of P𝑃Pitalic_P is defined as

posk=kv+c,k{x,y,z},pos=[posx,posy,posz],\begin{split}&pos_{k}=\frac{k}{v}+c,\ \ k\in\{x,y,z\},\\ &pos=[pos_{x},pos_{y},pos_{z}],\end{split}start_ROW start_CELL end_CELL start_CELL italic_p italic_o italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG italic_k end_ARG start_ARG italic_v end_ARG + italic_c , italic_k ∈ { italic_x , italic_y , italic_z } , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_p italic_o italic_s = [ italic_p italic_o italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p italic_o italic_s start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_p italic_o italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ] , end_CELL end_ROW (1)

where c𝑐citalic_c and v𝑣vitalic_v are fixed offsets, and []delimited-[][\cdot][ ⋅ ] stands for vector concatenation. The encoding dimension along each axis is dx=dy=d/3subscript𝑑𝑥subscript𝑑𝑦𝑑3d_{x}=d_{y}=\lfloor d/3\rflooritalic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = ⌊ italic_d / 3 ⌋, dz=ddxdysubscript𝑑𝑧𝑑subscript𝑑𝑥subscript𝑑𝑦d_{z}=d-d_{x}-d_{y}italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = italic_d - italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT. The position encoding of P𝑃Pitalic_P is computed by SPE=[SPEx,SPEy,SPEz]𝑆𝑃𝐸𝑆𝑃superscript𝐸𝑥𝑆𝑃superscript𝐸𝑦𝑆𝑃superscript𝐸𝑧SPE=[SPE^{x},SPE^{y},SPE^{z}]italic_S italic_P italic_E = [ italic_S italic_P italic_E start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , italic_S italic_P italic_E start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT , italic_S italic_P italic_E start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ] where

{SPE(pos,2i)k=sinposk100002i/dkSPE(pos,2i+1)k=cosposk100002i/dk,k{x,y,z}\left\{\begin{aligned} &SPE^{k}_{(pos,2i)}=\sin\frac{pos_{k}}{10000^{2i/d_{k}}% }\\ &SPE^{k}_{(pos,2i+1)}=\cos\frac{pos_{k}}{10000^{2i/d_{k}}}\end{aligned}\right.% ,\quad k\in\{x,y,z\}{ start_ROW start_CELL end_CELL start_CELL italic_S italic_P italic_E start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_p italic_o italic_s , 2 italic_i ) end_POSTSUBSCRIPT = roman_sin divide start_ARG italic_p italic_o italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 10000 start_POSTSUPERSCRIPT 2 italic_i / italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_S italic_P italic_E start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_p italic_o italic_s , 2 italic_i + 1 ) end_POSTSUBSCRIPT = roman_cos divide start_ARG italic_p italic_o italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 10000 start_POSTSUPERSCRIPT 2 italic_i / italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW , italic_k ∈ { italic_x , italic_y , italic_z } (2)

With the help of sparse positional encoding, we effectively capture intricate 3D spatial relationships among unordered points, which enables seamless embedding of the 3D features into conventional transformers.

The transformer hTsubscriptTh_{\mathrm{T}}italic_h start_POSTSUBSCRIPT roman_T end_POSTSUBSCRIPT utilizes an encoder-decoder architecture, taking point features PtsuperscriptsubscriptP𝑡\mathcal{F}_{\mathrm{P}}^{t}caligraphic_F start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as input tokens without other proprioceptive signals. In the transformer decoding step, we use one readout token to query action features AtsuperscriptsubscriptA𝑡\mathcal{F}_{\mathrm{A}}^{t}caligraphic_F start_POSTSUBSCRIPT roman_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

III-C Diffusion as Action Decoder

The action decoder hDsubscript𝐷h_{D}italic_h start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT is implemented as a denoising process by diffusion [4, 18, 22]. Conditioning on AtsuperscriptsubscriptA𝑡\mathcal{F}_{\mathrm{A}}^{t}caligraphic_F start_POSTSUBSCRIPT roman_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, hDsubscriptDh_{\mathrm{D}}italic_h start_POSTSUBSCRIPT roman_D end_POSTSUBSCRIPT denoises the Gaussian noises 𝒩(0,σ2I)𝒩0superscript𝜎2𝐼\mathcal{N}(0,\sigma^{2}I)caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ) to actions 𝒜tsuperscript𝒜𝑡\mathcal{A}^{t}caligraphic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT iteratively. The denoising process of step k𝑘kitalic_k is

𝒜k1t=α(𝒜ktγϵθ(𝒪t,𝒜kt,k)+𝒩(0,σ2I)),superscriptsubscript𝒜𝑘1𝑡𝛼superscriptsubscript𝒜𝑘𝑡𝛾subscriptitalic-ϵ𝜃superscript𝒪𝑡superscriptsubscript𝒜𝑘𝑡𝑘𝒩0superscript𝜎2𝐼\mathcal{A}_{k-1}^{t}=\alpha(\mathcal{A}_{k}^{t}-\gamma\epsilon_{\theta}(% \mathcal{O}^{t},\mathcal{A}_{k}^{t},k)+\mathcal{N}(0,\sigma^{2}I)),caligraphic_A start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_α ( caligraphic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_γ italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_O start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , caligraphic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_k ) + caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ) ) , (3)

where ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is a network predicting noises with parameters θ𝜃\thetaitalic_θ, α𝛼\alphaitalic_α, γ𝛾\gammaitalic_γ and σ𝜎\sigmaitalic_σ are hyperparameters related to k𝑘kitalic_k in noise schedule. The objective function is the simplified objective in [18]. We use the DDIM scheduler [45] to accelerate the inference speed in real-world experiments.

The regression head is also frequently employed due to its simplicity [11, 14, 21, 57], whereas the diffusion head excels in handling scenes with multiple targets. Moreover, diffusion produces diverse trajectories to the same target, as opposed to averaging learned trajectories [4].

For all tasks in our experiments, RISE adopts a unified action representation in the camera coordinate system, which is composed of translations, rotations, and gripper widths. We opt for absolute position for translation and 6D representation [58] for rotation with continuity considerations.

IV Experiments

IV-A Setup

In the experiments, we use a Flexiv Rizon robotic arm111https://www.flexiv.com/product/rizon equipped with a Dahuan AG-95 gripper222https://en.dh-robotics.com/product/ag for interacting with objects. Two Intel RealSense D435 RGBD cameras333https://www.intelrealsense.com/depth-camera-d435 are installed for scene perception. One global camera is positioned in front of the robot, while the other in-hand camera is mounted on the end-effector of the arm. For 3D perception, only the global camera is used to generate a noisy single-view partial point cloud; while for image-based policies, both cameras are used for a better understanding of spatial geometries. All devices are linked to a workstation with an Intel Core i9-10900K CPU and an NVIDIA RTX 3090 GPU for both data collection and evaluation.

We carefully designed 6 tasks from 4 types for the experiments as illustrated in Fig. 1: the pick-and-place tasks (Collect Cups and Collect Pens), the 6-DoF tasks (Pour Balls), the push-to-goal tasks (Push Block and Push Ball) and the long-horizon tasks (Stack Blocks). The data collection setup and process are the same as [9], i.e., end-effector tele-operations using a haptic device. Unless specifically stated, we gathered 50 expert demonstrations for each task as the training data for the policy, and each policy was tested for 20 consecutive trials to evaluate its performance. During evaluations, objects in the task are randomly initialized within the robot workspace of approximately 50cm×70cm50cm70cm50\text{cm}\times 70\text{cm}50 cm × 70 cm.

IV-B Pick-and-Place Tasks

As the cornerstone of robotic manipulations, pick-and-place tasks emphasize precise manipulations of objects and efficient generalizations of the robot policy within the workspace. As shown in Fig. 1, the Collect Cups task is designed to evaluate the policy performance in predicting the translation part of the action, whereas the Collect Pens task is utilized to further assess the ability of the policy in predicting the planner rotation part of the action.

We employ two representative image-based policies as our baselines: ACT [57] and Diffusion Policy [4]. We also evaluate a keyframe-based 3D policy Act3D [11], the current state-of-the-art policy on RLBench [20] in the experiments. For Act3D, we utilize a simple action planner to execute the predicted keyposes, preventing collisions with other objects in the workspace.

For each of the five scenarios where there are 1, 2, 3, 4, and 5 objects in the workspace, we collected 10 demonstrations, resulting in a total of 50 demonstrations used for the policy training. During evaluations, we conducted 10 trials for each of the five scenarios, tallying the number of cups placed into the large metal cup or the number of pens placed into the bowl, and calculating the completion rate. For each object, we imposed a runtime limit of 20 keyframes for keyframe-based policies and 300 steps for continuous-control policies.

Refer to caption
(a) Collect Cups
Refer to caption
(b) Collect Pens
Figure 2: Experimental results of the pick-and-place tasks.

The evaluation results are depicted in Fig. 2. In the Collect Cups task, RISE achieves a completion rate of over 90% when the number of cups is less than 3. Even in complex environments with 4 or 5 cups, RISE maintains a completion rate of over 65%, showcasing capabilities surpassing all baselines. Similar trends are observed in the Collect Pens task: RISE consistently outperforms all baselines, demonstrating its ability to not only predict the translation part but also accurately forecast planner rotation. We also discover that Act3D performs comparably to image-based baselines. Moreover, given that Act3D requires specially designed motion planners for more complicated actions and cannot provide immediate responses to sudden changes in the environment, we therefore only employ ACT and Diffusion Policy as baselines in our subsequent experiments.

IV-C 6-DoF Tasks

The 6-DoF Pour Balls task is designed to test the ability of robot policies to forecast actions with complex spatial rotations, instead of the simple planner rotation in the pick-and-place tasks. As shown in Fig. 1, the robotic arm needs to undergo complex spatial rotations to complete the task, at times approaching its kinematic limits. During evaluations, 10 balls are initialized in the cup. Besides the action success rates of the policy, the number of balls poured into the cup is also recorded to calculate the completion rates. The runtime limit for this task is set as 1200 steps.

The experimental results are shown in Tab. I. From the action success rates, it is evident that RISE can learn actions with complex spatial rotations more effectively compared to image-based policies. Additionally, its execution of the pouring action is more precise regarding pouring positions, resulting in higher task completion rates. This also highlights the effectiveness of 3D perception, which can capture more accurate spatial relationships between objects.

Method Success Rate (%) Completion Rate (%)
Grasp Pour Place Overall If Poured
ACT [57] 30 30 0 13.0 43.3
Diffusion Policy [4] 55 55 35 30.5 55.5
RISE (ours) 80 80 70 49.0 61.3
TABLE I: Experimental results of the 6-DoF task Pour Balls.

IV-D Push-to-Goal Tasks

Robot policies should generate immediate feedback to environmental dynamics, enabling adaptation to object movements within the environment to accomplish tasks. To this end, we designed two push-to-goal tasks Push Block and Push Ball, as shown in Fig. 1. Furthermore, the Push Ball task requires the robot to use a tool (a marker pen) to complete the task, thereby testing the ability of the policy to utilize tools. In the evaluation process, we compute the distance d𝑑ditalic_d from the object center to the goal area. If the object center is within the goal area (d=0𝑑0d=0italic_d = 0), it is considered a success, as shown in Tab. II (left). We then calculate the task success rate and the average distance as metrics. The runtime limit for this task is set as 1200 steps.

The evaluation results are shown in Tab. II (right). In the Push Block task, RISE slightly surpasses Diffusion Policy in terms of success rate, while pushing the block closer to the goal area. However, in the Push Ball task, RISE outperforms Diffusion Policy by a significant margin, demonstrating its effective 3D perception of object position changes and ability to rapidly adjust policy action outputs accordingly. We also observe that ACT struggles in both tasks and frequently causes the robot to make hard contact with the objects and triggers the mechanical emergency stop, which might be attributed to its imprecision in scene perception.

[Uncaptioned image]
Task Method Success Rate (%) Average Distance d𝑑ditalic_d (cm) \downarrow
Push Block Diffusion Policy [4] 50 5.67
RISE (ours) 55 3.51
Push Ball Diffusion Policy [4] 30 6.05
RISE (ours) 60 4.89
TABLE II: Evaluation metrics illustrations (left) and experimental results (right) of the push-to-goal tasks Push Block and Push Ball.

IV-E Long-Horizon Tasks

Long-horizon tasks are crucial in robotic manipulations as they highlight the impact of error accumulation in actions over extended horizons, providing insights into the robustness and adaptability of the policy. Therefore, we designed the long-horizon task Stack Blocks to assess the policy’s ability in this regard, given that blocks are more likely to topple as the stack grows. We emphasize that this task is more challenging than previous pick-and-place tasks because (a) some blocks are only slightly smaller than gripper width, requiring precise controls in grasping; (b) the policy needs to recognize the sizes of the blocks and select a suitable stacking order to ensure the stability of the resulting stack; and (c) as the stack grows, the policy needs to dynamically adjust the placement height of the blocks to ensure that they do not collide with existing blocks and can be smoothly placed on top of them.

For three cases where there are 2, 3, and 4 blocks in the workspace, we collected 10, 20, and 20 demonstrations respectively, totaling 50 demonstrations for the policy training. During evaluations, 10 trials are conducted for each case, and the average number of successfully stacked blocks is reported to measure the policy performance. We set the runtime limit for the task as 600, 1200, and 1800 steps for each case respectively.

Method # Blocks
2 3 4
ACT [57] 0.6 / 1 0.5 / 2 0.3 / 3
Diffusion Policy [4] 0.7 / 1 0.5 / 2 0.5 / 3
RISE (ours) 0.8 / 1 1.5 / 2 0.9 / 3
TABLE III: Experimental results (average stacked blocks) of the long-horizon task Stack Blocks.

The experimental results are shown in Tab. III. In the simple scenario with only two blocks, all policies yielded similar results; however, as the number of blocks increased, RISE progressively surpassed the baselines significantly, showcasing its ability to adapt well to long-horizon tasks and control accumulated errors. Moreover, we observe that the baselines exhibit a higher frequency of the aforementioned issues (b) and (c) compared to RISE, implying that powered by 3D perception, RISE has a deeper understanding of the scene, and its predictions of actions are also more precise.

IV-F Effectiveness of 3D Perception

In this section, we explore how 3D perception enhances the performance of robot manipulation policies on the Collect Cups task with 5 cups. We replace the image encoder of the image-based policies ACT and Diffusion Policy with the sparse 3D encoder used in RISE. The experiment results are shown in Tab. IV. We observe a significant improvement in the performance of ACT and Diffusion Policy after applying 3D perception even with fewer camera views, surpassing the 3D policy Act3D, which reflects the effectiveness of our 3D perception module in manipulation policies.

Method 3D # Cameras Completion Rate (%)
ACT [57] 2    12
1    32 20absent20{}^{\uparrow 20}start_FLOATSUPERSCRIPT ↑ 20 end_FLOATSUPERSCRIPT
Diffusion Policy [4] 2    24
1    36 12absent12{}^{\uparrow 12}start_FLOATSUPERSCRIPT ↑ 12 end_FLOATSUPERSCRIPT
DP3* [53] 1     -
Act3D [11] 1    28
RISE (ours) 1    66
TABLE IV: Effectiveness test of 3D perception on the Collect Cups task with 5 cups (10 trials). 2D version of policies take images from both global and in-hand cameras as input. * DP3 fails to learn in our setting, see text for a more detailed analysis.
[Uncaptioned image]
Method Completion Rate (%)
Axis-wise Natural
DP3, hor. 4 0 0
DP3, hor. 8 0 0
DP3, hor. 16 20 0
DP3, hor. 24 40 0
RISE 80 100
TABLE V: Analysis of the failures of DP3 in our experiments. (left) Illustrations of the axis-wise and natural action. (right) Results of the Collect Cups task with 1 cup when using demonstrations with different action representations for training (10 trials).
[Uncaptioned image]
Method Completion Rate (%)
Original Disturbance
Bowl Light Height CamView
ACT [57] 80 70 10¯absent¯10{}^{\downarrow\underline{10}}start_FLOATSUPERSCRIPT ↓ under¯ start_ARG 10 end_ARG end_FLOATSUPERSCRIPT 40 40absent40{}^{\downarrow 40}start_FLOATSUPERSCRIPT ↓ 40 end_FLOATSUPERSCRIPT 0 80absent80{}^{\downarrow 80}start_FLOATSUPERSCRIPT ↓ 80 end_FLOATSUPERSCRIPT 0 80absent80{}^{\downarrow 80}start_FLOATSUPERSCRIPT ↓ 80 end_FLOATSUPERSCRIPT
Diffusion Policy [4] 70 50 20absent20{}^{\downarrow 20}start_FLOATSUPERSCRIPT ↓ 20 end_FLOATSUPERSCRIPT 30 40absent40{}^{\downarrow 40}start_FLOATSUPERSCRIPT ↓ 40 end_FLOATSUPERSCRIPT 0 70absent70{}^{\downarrow 70}start_FLOATSUPERSCRIPT ↓ 70 end_FLOATSUPERSCRIPT 0 70absent70{}^{\downarrow 70}start_FLOATSUPERSCRIPT ↓ 70 end_FLOATSUPERSCRIPT
Act3D [11] 70 40 30absent30{}^{\downarrow 30}start_FLOATSUPERSCRIPT ↓ 30 end_FLOATSUPERSCRIPT 60 10¯absent¯10{}^{\downarrow\underline{10}}start_FLOATSUPERSCRIPT ↓ under¯ start_ARG 10 end_ARG end_FLOATSUPERSCRIPT 50 20¯absent¯20{}^{\downarrow\underline{20}}start_FLOATSUPERSCRIPT ↓ under¯ start_ARG 20 end_ARG end_FLOATSUPERSCRIPT 10 60absent60{}^{\downarrow 60}start_FLOATSUPERSCRIPT ↓ 60 end_FLOATSUPERSCRIPT
RISE (ours) 90 80 10¯absent¯10{}^{\downarrow\underline{10}}start_FLOATSUPERSCRIPT ↓ under¯ start_ARG 10 end_ARG end_FLOATSUPERSCRIPT 80 10¯absent¯10{}^{\downarrow\underline{10}}start_FLOATSUPERSCRIPT ↓ under¯ start_ARG 10 end_ARG end_FLOATSUPERSCRIPT 80 10¯absent¯10{}^{\downarrow\underline{10}}start_FLOATSUPERSCRIPT ↓ under¯ start_ARG 10 end_ARG end_FLOATSUPERSCRIPT 50 40¯absent¯40{}^{\downarrow\underline{40}}start_FLOATSUPERSCRIPT ↓ under¯ start_ARG 40 end_ARG end_FLOATSUPERSCRIPT
TABLE VI: Generalization test setup and experimental results of the Collect Pens task with 1 pen (10 trials).

We also evaluate the recently proposed DP3 [53] in this experimental setting. However, DP3 appears to struggle in learning meaningful actions from our demonstration data. After communicating with the authors, one potential reason is that they are using RealSense L515 in their original experiments, and we adopt RealSense D435 in our experiments. The point cloud from D435 is noiser, making it more challenging for networks to learn. By using sparse convolution, RISE is more robust to the noise in point cloud. Besides, after delving into their real robot experiments, we found that instead of natural actions, axis-wise actions are used in their demonstration data, as illustrated in Tab. V (left). Hence, we collect 50 demonstrations on the Collect Cups task with 1 cup using axis-wise and natural action representations respectively. These demonstrations are then used for DP3 policy training. After carefully tuning hyper-parameters such as horizons and color utilizations, we report the evaluation results in Tab. V, with the best completion rate of 40%. We suspect that the limited network capacity of the 3D encoder of DP3 prevents it from modeling the diverse state-action pairs present in the real-world human teleoperated demonstrations, leading it to only handle a smaller set of state-action pairs under the axis-wise action representations. On the contrary, RISE can handle real-world demonstrations with various action representations and maintain satisfactory performances. Lastly, compared to the evaluation setup in the DP3 paper, we allow objects to be placed anywhere in the entire workspace. This results in a greater variation of object locations, making the task more challenging.

IV-G Generalization Test

We evaluate the generalization ability of different methods on the Collect Pens task with 1 pen under different levels of environmental disturbances as follows.

  • L1.

    Generalize to objects with similar shapes but different colors. In this task, we replace the original green bowl with a pink one, which is denoted as Bowl in Tab. VI.

  • L2.

    Generalize to different light conditions in the environment, denoted as Light in Tab. VI.

  • L3.

    Generalize to new workspace configurations in the environment. In this task, we elevate the workspace by 13cm to form a new workspace configuration, which is denoted as Height in Tab. VI.

  • L4.

    Generalize to new camera viewpoints, denoted as CamView in Tab. VI.

The graphical illustration of the environmental disturbances and the evaluation results are shown in Tab. VI. We can observe that the image-based policies can achieve a decent L1-level and some L2-level generalizations, but they cannot reach the L3-level and L4-level generalizations involving spatial transformations. Act3D, as a 3D policy, demonstrates good generalization up to L3-level disturbances; however, it nearly completely fails in the L4-level generalization test. RISE exhibits strong generalization abilities across all levels of testing, even in the most challenging L4-level tests involving changes in the camera view.

V Conclusion

In this paper, we present RISE, an efficient end-to-end policy utilizing 3D perception for real-world robot manipulation. RISE compresses point clouds with a sparse 3D encoder, followed by sparse positional encoding and a transformer to obtain action features. The features are decoded into continuous actions by a diffusion head. RISE significantly outperforms currently representative 2D and 3D policies in multiple tasks, demonstrating great advantages in both accuracy and efficiency. Our ablations verify the effectiveness of 3D perception and the generalization of RISE under different levels of environmental disturbances. We hope our baseline inspires the integration of 3D perception into real-world policy learning.

References

  • [1] Anthony Brohan et al. “RT-1: Robotics Transformer for Real-World Control at Scale” In Robotics: Science and Systems, 2023
  • [2] Shizhe Chen, Ricardo Garcia Pinel, Cordelia Schmid and Ivan Laptev “PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation” In Conference on Robot Learning, 2023, pp. 1761–1781 PMLR
  • [3] Xiaozhi Chen et al. “Multi-View 3D Object Detection Network for Autonomous Driving” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1907–1915
  • [4] Cheng Chi et al. “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion” In Robotics: Science and Systems, 2023
  • [5] Christopher Choy, JunYoung Gwak and Silvio Savarese “4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3075–3084
  • [6] Angela Dai et al. “ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5828–5839
  • [7] Danny Driess et al. “Reinforcement Learning with Neural Radiance Fields” In Advances in Neural Information Processing Systems 35, 2022, pp. 16931–16945
  • [8] Hao-Shu Fang et al. “AnyGrasp: Robust and Efficient Grasp Perception in Spatial and Temporal Domains” In IEEE Transactions on Robotics IEEE, 2023
  • [9] Hao-Shu Fang et al. “RH20T: A Robotic Dataset for Learning Diverse Skills in One-Shot” In RSS 2023 Workshop on Learning for Task and Motion Planning, 2023
  • [10] Hongjie Fang et al. “Low-cost exoskeletons for learning whole-arm manipulation in the wild” In IEEE International Conference on Robotics and Automation, 2024
  • [11] Théophile Gervet, Zhou Xian, Nikolaos Gkanatsios and Katerina Fragkiadaki “Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation” In Conference on Robot Learning, 2023, pp. 3949–3965 PMLR
  • [12] Ankit Goyal et al. “RVT: Robotic View Transformer for 3D Object Manipulation” In Conference on Robot Learning, 2023, pp. 694–710 PMLR
  • [13] Benjamin Graham, Martin Engelcke and Laurens Van Der Maaten “3D Semantic Segmentation with Submanifold Sparse Convolutional Networks” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9224–9232
  • [14] Pierre-Louis Guhur et al. “Instruction-Driven History-Aware Policies for Robotic Manipulations” In Conference on Robot Learning, 2022, pp. 175–187 PMLR
  • [15] Huy Ha, Pete Florence and Shuran Song “Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition” In Conference on Robot Learning, 2023, pp. 3766–3777 PMLR
  • [16] Abdullah Hamdi, Silvio Giancola and Bernard Ghanem “MVTN: Multi-View Transformation Network for 3D Shape Recognition” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1–11
  • [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun “Deep Residual Learning for Image Recognition” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778
  • [18] Jonathan Ho, Ajay Jain and Pieter Abbeel “Denoising Diffusion Probabilistic Models” In Advances in Neural Information Processing Systems 33, 2020, pp. 6840–6851
  • [19] Stephen James, Kentaro Wada, Tristan Laidlow and Andrew J Davison “Coarse-to-Fine Q-Attention: Efficient Learning for Visual Robotic Manipulation via Discretisation” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13739–13748
  • [20] Stephen James, Zicong Ma, David Rovick Arrojo and Andrew J Davison “RLBench: The Robot Learning Benchmark & Learning Environment” In IEEE Robotics and Automation Letters 5.2 IEEE, 2020, pp. 3019–3026
  • [21] Eric Jang et al. “BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning” In Conference on Robot Learning PMLR, 2021, pp. 991–1002
  • [22] Michael Janner, Yilun Du, Joshua Tenenbaum and Sergey Levine “Planning with Diffusion for Flexible Behavior Synthesis” In International Conference on Machine Learning, 2022, pp. 9902–9915 PMLR
  • [23] Bo Li, Tianlei Zhang and Tian Xia “Vehicle Detection from 3D Lidar Using Fully Convolutional Network” In Robotics: Science and Systems, 2016
  • [24] Yecheng Jason Ma et al. “LIV: Language-Image Representations and Rewards for Robotic Control” In International Conference on Machine Learning, 2023, pp. 23301–23320 PMLR
  • [25] Yecheng Jason Ma et al. “VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training” In International Conference on Learning Representations, 2023
  • [26] Arjun Majumdar et al. “Where are We in the Search for an Artificial Visual Cortex for Embodied Intelligence?” In ICRA 2023 Workshop on Pretraining for Robotics, 2023
  • [27] Ajay Mandlekar et al. “RoboTurk: A Crowdsourcing Platform for Robotic Skill Learning through Imitation” In Conference on Robot Learning, 2018, pp. 879–893 PMLR
  • [28] Ajay Mandlekar et al. “What Matters in Learning from Offline Human Demonstrations for Robot Manipulation” In Conference on Robot Learning, 2021, pp. 1678–1690 PMLR
  • [29] Oier Mees, Lukas Hermann, Erick Rosete-Beas and Wolfram Burgard “CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks” In IEEE Robotics and Automation Letters 7.3 IEEE, 2022, pp. 7327–7334
  • [30] Ben Mildenhall et al. “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis” In Communications of the ACM 65.1 ACM New York, NY, USA, 2021, pp. 99–106
  • [31] Suraj Nair et al. “R3M: A Universal Visual Representation for Robot Manipulation” In Conference on Robot Learning, 2022, pp. 892–909 PMLR
  • [32] Abhishek Padalkar et al. “Open X-Embodiment: Robotic Learning Datasets and RT-X Models” In arXiv preprint arXiv:2310.08864, 2023
  • [33] Xuran Pan et al. “3D Object Detection with PointFormer” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7463–7472
  • [34] Jyothish Pari, Nur Muhammad (Mahi) Shafiullah, Sridhar Pandian Arunachalam and Lerrel Pinto “The Surprising Effectiveness of Representation Learning for Visual Imitation” In Robotics: Science and Systems, 2022
  • [35] Dean A Pomerleau “ALVINN: An Autonomous Land Vehicle in a Neural Network” In Advances in Neural Information Processing Systems 1, 1988
  • [36] Charles R Qi, Hao Su, Kaichun Mo and Leonidas J Guibas “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 652–660
  • [37] Guocheng Qian et al. “PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies” In Advances in Neural Information Processing Systems 35, 2022, pp. 23192–23204
  • [38] Ilija Radosavovic et al. “Real-World Robot Learning with Masked Visual Pre-Training” In Conference on Robot Learning, 2022, pp. 416–426 PMLR
  • [39] Rouhollah Rahmatizadeh, Pooya Abolghasemi, Ladislau Bölöni and Sergey Levine “Vision-Based Multi-Task Manipulation for Inexpensive Robots using End-to-End Learning from Demonstration” In IEEE International Conference on Robotics and Automation, 2018, pp. 3758–3765 IEEE
  • [40] Scott E. Reed et al. “A Generalist Agent” In Transactions on Machine Learning Research, 2022
  • [41] Gernot Riegler, Ali Osman Ulusoy and Andreas Geiger “OctNet: Learning Deep 3D Representations at High Resolutions” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3577–3586
  • [42] William Shen et al. “Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation” In Conference on Robot Learning, 2023, pp. 405–424 PMLR
  • [43] Mohit Shridhar, Lucas Manuelli and Dieter Fox “Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation” In Conference on Robot Learning, 2022, pp. 785–799 PMLR
  • [44] Kihyuk Sohn, Honglak Lee and Xinchen Yan “Learning Structured Output Representation using Deep Conditional Generative Models” In Advances in Neural Information Processing Systems 28, 2015
  • [45] Jiaming Song, Chenlin Meng and Stefano Ermon “Denoising Diffusion Implicit Models” In The International Conference on Learning Representations, 2021
  • [46] Martin Sundermeyer, Arsalan Mousavian, Rudolph Triebel and Dieter Fox “Contact-Graspnet: Efficient 6-DoF Grasp Generation in Cluttered Scenes” In IEEE International Conference on Robotics and Automation, 2021, pp. 13438–13444 IEEE
  • [47] Octo Model Team et al. “Octo: An Open-Source Generalist Robot Policy”, 2023
  • [48] Ashish Vaswani et al. “Attention is All You Need” In Advances in Neural Information Processing Systems 30, 2017
  • [49] Peng-Shuai Wang et al. “O-CNN: Octree-Based Convolutional Neural Networks for 3D Shape Analysis” In ACM Transactions On Graphics 36.4 ACM New York, NY, USA, 2017, pp. 1–11
  • [50] Zhirong Wu et al. “3D ShapeNets: A Deep Representation for Volumetric Shapes” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1912–1920
  • [51] Zhou Xian et al. “ChainedDiffuser: Unifying Trajectory Diffusion and Keypose Prediction for Robotic Manipulation” In Conference on Robot Learning, 2023, pp. 2323–2339 PMLR
  • [52] Jianglong Ye, Naiyan Wang and Xiaolong Wang “FeatureNeRF: Learning Generalizable NeRFs by Distilling Foundation Models” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8962–8973
  • [53] Yanjie Ze et al. “3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations” In arXiv preprint arXiv:2403.03954, 2024
  • [54] Yanjie Ze et al. “GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields” In Conference on Robot Learning, 2023, pp. 284–301 PMLR
  • [55] Tianhao Zhang et al. “Deep Imitation Learning for Complex Manipulation Tasks from Virtual Reality Teleoperation” In IEEE International Conference on Robotics and Automation, 2018, pp. 5628–5635 IEEE
  • [56] Hengshuang Zhao et al. “Point Transformer” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 16259–16268
  • [57] Tony Z Zhao, Vikash Kumar, Sergey Levine and Chelsea Finn “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware” In Robotics: Science and Systems, 2023
  • [58] Yi Zhou et al. “On the Continuity of Rotation Representations in Neural Networks” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5745–5753
  • [59] Yin Zhou and Oncel Tuzel “VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4490–4499
  • [60] Brianna Zitkovich et al. “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control” In Conference on Robot Learning, 2023, pp. 2165–2183

APPENDIX

V-A Tasks Parameters

We list the parameters of the demonstrations for different tasks in this paper in Tab. VII. The axis-wise action representation is implemented with keyboard teleoperation (one key to control movement, or rotation, or gripper action in each direction). We observe that although the axis-wise action representation results in fewer steps during demonstrations, its teleoperation time was approximately 3x as long as that of the natural teleoperation, aligning with the findings in [27].

Task Name Notes # Demos Avg. Steps
Avg. Teleop.
Time (s)
Collect Cups 1 cup 10 117.4 19.37
2 cups 10 225.0 34.73
3 cups 10 345.3 54.84
4 cups 10 451.4 71.07
5 cups 10 520.0 76.02
*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 1 cup,
natural action
50 102.7 17.06
*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 1 cup,
axis-wise action
50 30.2 45.93
Collect Pens 1 pen 10 179.4 52.47
2 pens 10 278.2 62.71
3 pens 10 411.5 91.88
4 pens 10 556.1 124.22
5 pens 10 694.1 157.15
Pour Balls 50 185.4 50.69
Push Block 50 204.3 51.72
Push Ball 50 223.1 46.00
Stack Blocks 2 blocks 10 148.6 32.06
3 blocks 20 286.2 59.22
4 blocks 20 401.8 79.83
TABLE VII: Parameters of the collected demonstrations for different tasks. “Avg. Teleop. Time” stands for the average teleoperation time for collecting one demonstration. *{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT denotes that these data are only used for the comparison experiments with DP3.

The evaluation settings for different tasks is summarized in Tab. VIII. Compared to the average steps in demonstrations, the maximum steps in evaluations prove to be sufficient.

Task Name Notes # Trials Max. Steps Max. Keyframes
Collect Cups 1 cup 10 300 20
2 cups 10 600 40
3 cups 10 900 60
4 cups 10 1200 80
5 cups 10 1500 100
Collect Pens 1 pen 10 300 20
2 pens 10 600 40
3 pens 10 900 60
4 pens 10 1200 80
5 pens 10 1500 100
Pour Balls 20 1200 N/A
Push Block 20 1200 N/A
Push Ball 20 1200 N/A
Stack Blocks 2 blocks 10 600 N/A
3 blocks 10 1200 N/A
4 blocks 10 1800 N/A
TABLE VIII: Evaluation settings for different tasks.

V-B Implementation Details

Data Processing. The point cloud is created from a single-view RGB-D image. Both input point clouds and output actions are in the camera coordinate system. We crop the point clouds with the range of x,y[0.5m,0.5m]𝑥𝑦0.5m0.5mx,y\in[-0.5\mathrm{m},0.5\mathrm{m}]italic_x , italic_y ∈ [ - 0.5 roman_m , 0.5 roman_m ], z[0m,1m]𝑧0m1mz\in[0\mathrm{m},1\mathrm{m}]italic_z ∈ [ 0 roman_m , 1 roman_m ], and normalize the translation values to [1,1]11[-1,1][ - 1 , 1 ] with the range of x,y[0.35m,0.35m]𝑥𝑦0.35m0.35mx,y\in[-0.35\mathrm{m},0.35\mathrm{m}]italic_x , italic_y ∈ [ - 0.35 roman_m , 0.35 roman_m ], z[0m,0.7m]𝑧0m0.7mz\in[0\mathrm{m},0.7\mathrm{m}]italic_z ∈ [ 0 roman_m , 0.7 roman_m ]. The gripper width is normalized to [1,1]11[-1,1][ - 1 , 1 ] using the range of [0m,0.11m]0m0.11m[0\mathrm{m},0.11\mathrm{m}][ 0 roman_m , 0.11 roman_m ].

Network. The sparse 3D encoder is implemented based on MinkowskiEngine [5] with a voxel size of 5mm, which outputs a set of point feature vectors at the dimension of 512. For sparse positional encoding, we set v=5mm𝑣5mmv=5\mathrm{mm}italic_v = 5 roman_m roman_m and c=400𝑐400c=400italic_c = 400. The transformer contains 4 encoding blocks and 1 decoding block, with dmodel=512subscript𝑑model512d_{\text{model}}=512italic_d start_POSTSUBSCRIPT model end_POSTSUBSCRIPT = 512 and dff=2048subscript𝑑ff2048d_{\text{ff}}=2048italic_d start_POSTSUBSCRIPT ff end_POSTSUBSCRIPT = 2048. The dimension of the readout token is 512. We employ a CNN-based diffusion head [4] with 100 denoising iterations for training and 20 iterations for inference. The output action horizon is 20.

Training. RISE is trained on 2 Nvidia A100 GPUs with a batch size of 240, an initial learning rate of 3e-4, and a warmup step of 2000. The learning rate is decayed by a cosine scheduler. During training, the point clouds are randomly translated by [0.2m,0.2m]0.2m0.2m[-0.2\mathrm{m},0.2\mathrm{m}][ - 0.2 roman_m , 0.2 roman_m ] along X/Y/Z-axis, and randomly rotated by [30,30]superscript30superscript30[-30^{\circ},30^{\circ}][ - 30 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , 30 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ] around X/Y/Z-axis.

Baseline. ACT [57], Diffusion Policy [4], Act3D [11] and DP3 [53] are trained based on the official implementations. The Diffusion Policy baseline takes ResNet18 as the visual encoder and adopts a CNN-based backbone. For the Act3D baseline, we implement a simple planner for pick-and-place tasks to avoid collisions, which decouples an action into a horizontal one and a vertical one. It follows a heuristic rule: the horizontal action precedes the downward one while it follows the upward one. For ACT (3D), we replace the image tokens with the point tokens. For Diffusion Policy (3D), we employ an AvgPooling layer to get the observation embedding from point features.

V-C Discussions about Action Representations

Refer to caption
Figure 3: Failure cases of the Collect Pens task in the experiments.

Axis-wise. (Tab. V (blue)) Axis-wise action representation assumes that only one axis-wise movement is conducted in each step (typically one of the translations along the X/Y/Z-axis, the rotation around the X/Y/Z-axis, and the gripper opening/closing). Demonstrations with axis-wise action representations are usually collected via teleoperation with low frequency, like keyboard teleoperations.

Natural. (Tab. V (red)) Natural action representation allows composite movement patterns in each step (that is, the robot can simultaneously translate, rotate, and open or close the gripper in one step). Demonstrations with natural action representations are usually collected via teleoperations with high frequency, like teleoperations with haptic devices.

Discussions. Due to only one non-zero value for each action at any step, axis-wise action representations are easy to learn. However, this ease of learning can introduce noticeable induction biases in the learned policy, resulting in a lack of action diversity. Moreover, the axis-wise action representation increases the difficulty of representing complex trajectories, resulting in the limited generalization capability of the learned policy. On the contrary, the natural action representation is more challenging to learn than the axis-wise action representation, the learned policy can exhibit more natural action trajectories. Furthermore, the natural action representation aligns more closely with the patterns of human action execution, thus adopting natural actions can enhance data collection efficiency, as illustrated in Tab. VII. Therefore, we adopt natural action representation in our collected real-world demonstrations.

V-D Failure Cases and Recovery

In this section, we take the Collect Pens task as an example and illustrate the failure cases of RISE during experiments in Fig. 3. We observe that failure cases are mainly caused by inaccurate positions during picking (Failure #1) and placing (Failure #2). We found that RISE can automatically correct some failure scenarios, such as instances where the pen is inadvertently moved due to imprecise positioning during grasping (Failure #1). In contrast, many keyframe-based methods [11, 43, 51] lack the ability to offer immediate recovery actions for failures, potentially leading to the exacerbation of errors.