1 Introduction
For autonomous vehicles and robots, estimating the 7 DegreesofFreedom (location, dimension, and orientation) state of the surrounding objects in complicated realworld environments is a vital task. Recently, LiDARbased 3D object detection has been received increasing attention
[pvrcnn, votenet, clocs] due to its ability of direct 3D measurement. However, compared with the well developed 2D image detection, LiDARbased 3D detection still suffers from the intrinsic difficulties of point sparsity and large search space in 3D space.Given that point clouds from LiDAR are irregular, most 3D detection methods transform such data into regular 3D voxel grids [song2016deep, voxelnet, second, pointpillars] or collections of projected 2D view [joint3d, mv3d, pixor, conticonv, hdnet, mmf, rangercnn]
. While these methods can easily take advantage of the ordered data representation by using regular 2D or 3D convolution for feature extraction conveniently, the quantization error in the construction of voxel or multiview features limits their performance. On the contrary, pointbased methods
[pointnet++, pointrcnn, frustumconv, frustum, std] can learn features from the raw point cloud, but usually, they need complicated and inefficient operations [pointnet++] to aggregate local information. Consequently, these pointbased methods are mostly fused with other representations other than individually used for 3D object detection.In this paper, we are more interested in another setting like RCNN [rcnn]: We already have a set of proposals in 3D space and seek to refine them. Therefore, we name our method LiDAR RCNN. When confronting this usage, PointNet becomes our first choice since the network only needs to process the points from a single object in a small region. Compared with the DNN features with rich semantic information, the original point cloud contains the most accurate location information. Intuitively, applying PointNet [pointnet] on the raw point cloud for detection is straightforward. However, we find that the results of such a naive implementation are unsatisfactory. Through careful analysis, we find an intriguing size ambiguity problem: Different from voxelbased methods that equally divide the space into fixed size, PointNet only aggregates the features from points while ignoring the spacing in 3D space. Nevertheless, the spacing encodes essential information, such as the scale of the object. To remedy this issue, we propose a series of solutions and prove their effectiveness through detailed experiments. Comprehensive evaluation results on the Waymo Open Dataset(WOD) [wod] prove that our proposed LiDAR RCNN can improve the performance of various offtheshelf detectors significantly and consistently. With a welltuned PointPillars model, we even outperform the stateoftheart models.
To summarize, our contributions are as follows:

We propose an RCNN style secondstage detector for 3D object detection based on PointNet. Our method is plugandplay to any existing 3D detector, and needs no retraining to the base detectors.

We reveal the size ambiguity problem when using a pointbased RCNN detector. Through careful analysis, we propose several different ways to make the detector aware of the size of proposal box. Despite the simple design, it achieves significant performance improvements.

We test our proposed method on WOD [wod] and KITTI [kitti] datasets with various base detectors. Our method could consistently improve the base detectors while running at 200fps for 128 proposals on 2080Ti GPU.
2 Related Work
This section will briefly review some closely related work in LiDAR 3D detection according to different data representations.
2.1 Voxel based Methods
Researchers have long struggled with the irregular organization of point clouds. This nature makes it infeasible to apply traditional convolution, which is defined on regular grid. Voxelization [voxelfpn, class] is one of the most common treatments for it. [wang2015voting, vote3deep] perform realtime 3D detection by converting point cloud data into voxels with handcrafted features. Nevertheless, the generalization of handcrafted features limits their performance in complicated realworld environments. Subsequent work improves them in various ways: [3dconv] introduces a 3D fully convolutional network which uses binary 3D voxel representation. VoxelNet [voxelnet] applies mini PointNet for discriminative voxel features extraction. SECOND [second] further overcomes the computational barrier of dense 3D convolution by applying sparse convolution. [hansong3d] proposes 3D neural architecture search (3DNAS) to search for efficient 3D models structures. Albeit these voxelbased methods are widely used, their performance upper bounds are still limited by the quantization error when dividing the voxels. Thus voxelbased methods are usually fused with other representations that will be introduced later.
2.2 View based methods
CNNbased networks, such as VGGNet [vgg], ResNet [resnet]
, etc., have demonstrated excellent feature extraction ability on 2D computer vision tasks. An intuitive idea is to transform the point cloud data into certain kind of 2D view, which could be efficiently processed by 2D CNN. Among them, birdeyeview (BEV)
[yolo3d, joint3d, afdet, polarnet, ssn] is one of the most common representations. PointPillars [pointpillars] collapses the points in vertical columns (pillars). MV3D [mv3d] combines features from BEV and frontal view. While BEV could encode the scene to 2D space while maintaining scale invariance, range view [lasernet, dmlo, baidurange, squeezeseg, rangenet++, squeezesegv3, cylinder3d] is another compact representation based on the intrinsic 2.5D LiDAR scan. The range view avoids the challenge of the sparsity of point cloud yet introduces the scale variation for objects at different range. Recently, RCD [rcd, fan2021rangedet] tries to provide a uniform feature in 3D space with a range conditioned dilated convolution. And Soft Range Gating is used to avoid the influence of significant change in disparity. Benefited from the mature 2D CNN, viewbased methods are usually faster, but as mentioned, they usually introduce information loss when projecting 3D space into 2D.2.3 Point based methods
PointNet and its variants [pointnet, pointnet++, dgcnn, rscnn, pseudo] could directly take the raw points as input and use symmetric operators to address the main issue in points cloud – unorderness. Based on these powerful tools, PointRCNN [pointrcnn] proposes a bottomup 3D Region proposal network. They first generate proposals via segmented foreground objects and then learn a more accurate bounding box regression branch to predict the box coordinates. [parta2] extends PointRCNN by introducing an intraobject part supervision. PVRCNN [pvrcnn] aggregates points feature into a voxelbased framework, which improves the performance remarkably. From another perspective, VoteNet [votenet, imvotenet] proposes a new proposal generation mechanism with deep Hough voting based on the observation that the LiDAR points are all distributed on the surface of objects. StarNet [starnet] uses sampling centers with anchors as proposals to achieve a more computationally efficient framework, yet the low quality of the proposals limits its performance.
For a widely used Velodyne LiDAR HDL64E, it collects more than 100K points in one scan. This is a considerable challenge for PointNet. So almost all the pointbased methods need to downsample when dealing with large scenes. For PointRCNN, 16,384 points are kept. Besides, to effectively extract features from these individual points, usually hierarchical grouping methods as in [pointnet++] are essential in these methods; however, they are usually costly compared with pure PointNet. To summarize, these two main issues limit the performance and efficiency of point based method in general 3D detection task.
3 Methods
In this section, we will firstly introduce our framework of pointbased RCNN model. Then we will describe the size ambiguity problem, which is the core issue that limits the performance. Next, we propose several simple yet effective solutions to solve it.
3.1 Pointbased RCNN
Assume that we already have a 3D LiDAR object detector (no matter deep learning based or not) that could generate lots of 3D bounding box proposals, our goal is to refine the 7DoF bounding box parameters (including size, position and heading) and scores of the proposals simultaneously, i.e. in an RCNN manner. To make our model plugandplay and avoid further quantization error or interpolation, our input data are the original point cloud without any highlevel DNN features.
Input Features. For each 3D proposal , we enlarge its width and length to contain more contextual points around it. Then all the points in the enlarged boxes are taken out as the input data for our RCNN model.
To improve the generalization of our RCNN model, the points are then normalized according to the 3D bounding box coordinate system. The origin point is set as the center of the box. The heading direction is set as the axis. Its horizontally orthogonal direction is the axis and the vertical up direction is the axis (Fig. 1).
Backbone. The normalized points’ coordinates and other metadata will be mentioned later are fed into a pointbased network. For fast inference, we choose PointNet[pointnet]
as our backbone network. It consists of a MultiLayer Perception (MLP) module with three fully connected layers and a maxpooling operator for feature aggregation. Then the aggregated feature is fed into two branches: one for classification and another for regression. The whole network structure is illustrated in Fig
2. This backbone network is extremely lightweighted and fast, which is very suitable for realtime applications.Loss Function.
Similar to 2D RCNN, the loss function for classification branch is a softmax cross entropy loss with
categories,(1) 
where is the class number, is the batch size, is the assigned label for th sample and
is the softmax probability on
. The IoU threshold for positive and negative samples depends on the specific category, e.g. 0.7 for vehicle and 0.5 for pedestrian on WOD.The regression loss aims at refining the box parameters, thus it is only applied to the positive samples. Since the boxes are already transformed to the bounding box coordinate system, the proposal in global reference frame is transformed to
(2) 
Meanwhile, the 3D groundtruth bounding box should also be transformed to
(3) 
Also similar to 2D RCNN, the regression targets for center and size are defined as
(4) 
(5) 
For the orientation regression which is specific for 3D detection, we cannot directly make the target as . Due to the sparsity of point clouds and the appearance ambiguity of some objects, their predicted orientation may be flipped with . If we simply use
as the target, the gradient of flipped samples will be too large to overwhelm the normal samples. To avoid the training contaminated by these outlier samples, we set the orientation target to be the minimum residual calculated by original orientation and flipped orientation:
(6) 
(7) 
Having all the targets , our regression loss is defined as,
(8) 
where is the output of the model’s regression branch and is the number of positive samples. Finally, the overall loss is formulated as
(9) 
Where is a balance factor, which is 20 in practice.
3D AP (IoU=0.7)  3D APH (IoU=0.7)  BEV AP (IoU=0.7)  BEV APH (IoU=0.7)  
Difficulty  Method  Overall  030m  3050m  50mInf  Overall  030m  3050m  50mInf  Overall  030m  3050m  50mInf  Overall  030m  3050m  50mInf 
LEVEL_1  StarNet [starnet]  53.7                               
PointPillar [pointpillars]  56.6  81.0  51.8  27.9          75.6  92.1  74.1  55.5          
MVF [mvf]  62.9  86.3  60.0  36.0          80.4  93.6  79.2  63.1          
Pillarod [pillarod]  69.8  88.5  66.5  42.9          87.1  95.8  84.7  72.1          
PVRCNN [pvrcnn]  70.3  91.9  69.2  42.2  69.7  91.3  68.5  41.3  83.0  97.4  83.0  65.0  82.1  96.7  82.0  63.2  
RCD [rcd]  66.4  86.6  65.6  40.0  66.1  86.3  65.3  39.9  82.1  93.3  80.9  67.2  81.4  92.8  80.2  66.2  
RV first stage  53.4  73.0  49.0  28.1  52.8  72.3  48.4  27.8  70.5  86.2  68.5  51.9  69.5  85.2  67.6  51.0  
LiDAR RCNN (rv)  68.7  84.8  67.6  47.3  68.2  84.3  67.1  46.6  81.1  90.5  80.5  68.9  80.4  89.9  79.8  67.8  
PointPillars*  72.1  88.3  69.9  48.0  71.5  87.8  69.3  47.3  87.9  96.6  87.1  78.1  87.1  96.0  86.2  76.5  
LiDAR RCNN (pp)  75.6  92.1  74.3  53.3  75.1  91.6  73.8  52.6  88.2  97.1  87.6  78.3  87.5  96.6  86.9  76.8  
LiDAR RCNN (2x)  75.6  91.9  74.2  53.5  75.1  91.5  73.6  52.7  90.0  97.0  89.4  78.5  89.2  96.6  88.6  77.0  
LiDAR RCNN (2xc)  76.0  92.1  74.6  54.5  75.5  91.6  74.1  53.4  90.1  97.0  89.5  78.9  89.3  96.5  88.6  77.4  
LEVEL_2  PVRCNN [pvrcnn]  65.4  91.6  65.1  36.5  64.8  91.0  64.5  35.7  77.5  94.6  80.4  55.4  76.6  94.0  79.4  53.8 
RV first stage  46.0  70.6  44.1  21.0  45.5  69.8  43.6  20.7  63.5  85.5  62.7  41.1  62.7  84.5  61.9  40.3  
LiDAR RCNN (rv)  60.1  84.1  61.8  35.7  59.7  83.6  61.3  35.2  74.2  89.8  74.7  54.8  73.5  89.2  74.0  53.9  
PointPillars*  63.6  87.4  62.9  37.2  63.1  86.9  62.3  36.7  81.3  94.0  81.7  65.5  80.4  93.5  80.8  64.1  
LiDAR RCNN (pp)  66.5  89.1  68.3  40.8  66.1  88.7  67.7  40.3  81.2  94.2  81.8  63.6  80.5  93.7  81.0  62.3  
LiDAR RCNN (2x)  68.0  91.1  68.1  41.1  67.6  90.7  67.6  40.5  81.6  94.3  82.2  65.6  80.9  93.9  81.4  64.2  
LiDAR RCNN (2xc)  68.3  91.3  68.5  42.4  67.9  90.9  68.0  41.8  81.7  94.3  82.3  65.8  81.0  93.9  81.5  64.5 
, but without the Range Conditioned Covolution operator because it is not opensourced.
3.2 Size Ambiguity Problem
The pointbased RCNN is quite straightforward. Similar models have already been exploited in previous works [pointrcnn, fastpointrcnn, ipod]. However, when we directly apply such an algorithm, a clear performance drop is observed. The drop is most significant when the class is of mixed size or multiple classes are trained together. After careful analysis, we find an intriguing size ambiguity problem in the pointbased RCNN models.
The problem is related to the property of point cloud. Unlike 2D images, in which each position is filled with RGB values, the point cloud is sparse and irregular. Nevertheless, if we directly use a pointbased method, in which we only consider existing points in network, we would ignore the scale information indicated by the spacing. Taking Fig. 3(a) as an example, the dashes blue and green bounding boxes will have the same feature representation if we only consider the red points in it, however their classification and regression targets may differ a lot. Different from 2D RCNN, we should equip our LiDARRCNN with the ability to perceive the spacing and the size of proposals. In subsec:verify, we show some statistics from real data. As a result, we propose the following size aware methods to enhance the features.
3.3 Size Aware Point Features
In this section, we elaborate several methods to fix the issue mentioned above. These methods will be discussed and compared in detail in the experiments.
Normalization. The simplest solution to the ambiguity problem is to normalize the point coordinates by the proposal size. Then the proposal’s boundaries are aligned to . If the proposal is enlarged, the point coordinates will be smaller and the size target will be higher. Consequently, the model could be aware of the size of the proposal. However, size normalization makes the proposal to be a unit cube and the object shape will be distorted (Fig. 3(b)). When the RCNN model is applied on multiple categories, it totally ignores the scale difference off different categories. The size normalization makes it more difficult for the model to distinguish different categories.
Anchor. One solution proposed in some previous works[ipod, frustum] is to set an anchor for each category. Then the regression targets will be based on a fixed anchor, which eliminates the ambiguity of the size target. However, our goal is to judge the quality (whether the proposal’s IoU w.r.t. the groundtruth is larger than a threshold) and refine the proposal, not the anchor. Since the network is still not aware of the boundary of the proposal, this method does not solve the classification ambiguity. Furthermore, when there are few point in the boxes, objects from different categories will also have similar features. In this case, there will be a regression ambiguity because different categories have different anchors, which corresponds to different regression targets.
Voxelization. Several second stage detectors voxelize the points in proposal boxes[parta2, rcd]. The empty voxels or point distribution in the voxels may indicate the size and spacing in the proposal as shown in Fig. 3(d). However, in this solution, the voxel actually acts as a “subanchor”. The points in it are still not aware of the voxel boundary. The model only have coarse information about the proposal size at voxel level, but not the point level. As a result, this solution alleviates, but not fully solves the ambiguity problem.
Revisiting the previous solutions, we can conclude that the key is to provide the size information to network, while preserving the shape of the object. To achieve this, we propose two candidate solutions that can solve the problem.
Boundary offset. To provide the proposal boundary information, a simple way is to append the boundary offset to the point features. From the offset, the network will be able to know how far the points is from the proposal’s boundary, which should solve the ambiguity problem.
Virtual points. Since the RCNN model ignores the spacing, another natural idea is to augment the spacing with “virtual points” to indicate the existence of them. Here we generate the grid points which are evenly distributed in the proposal. The number of virtual points for all proposals are equal and the coordinates of virtual points has the same normalization with the real points. We append one binary feature to all the points to indicate whether the points are real or virtual. Through the virtual points, the RCNN module will have the ability to perceive the size of the proposal, since the spacing is no longer under represented.
Actually there may be other alternative solutions. We believe they will also work as long as providing the proposal size information to the RCNN model. The main contribution of this paper is to clarify the size ambiguity problem. The solution is not limited within the proposed ones.
vehicle (IoU=0.7)  pedestrian (IoU=0.5)  cyclist (IoU=0.5)  
Difficulty  Method  3D AP  3D APH  3D AP  3D APH  3D AP  3D APH 
LEVEL_1  SECOND [second]  58.5  57.9  63.9  54.9  48.6  47.6 
LiDAR RCNN (sec)  62.6  62.1  68.2  59.5  52.8  51.6  
PointPillars [pointpillars]  71.6  71.0  70.6  56.7  64.4  62.3  
LiDAR RCNN (pp)  73.4  72.9  70.6  57.8  66.8  64.8  

LiDAR RCNN (2x)  73.5  73.0  71.2  58.7  68.6  66.9 
LEVEL_2  SECOND [second]  51.6  51.1  56.0  48.0  46.8  45.8 
LiDAR RCNN (sec)  54.5  54.0  59.3  51.7  50.9  49.7  
PointPillars [pointpillars]  63.1  62.5  62.9  50.2  61.9  59.9  
LiDAR RCNN (pp)  64.6  64.1  62.5  50.9  64.3  62.4  

LiDAR RCNN (2x)  64.7  64.2  63.1  51.7  66.1  64.4 
4 Experiments
In this section, we first briefly introduce the datasets we use in subsec:datasets, then in subsec:verify we show some statistics to support our points in subsec:Problem.The implementation details of our LiDAR RCNN is provided in subsec:Implementation Details. subsec:Waymo Open Dataset shows that LiDAR RCNN can generally improve the performance of all baseline models, and surpasses the previous stateoftheart methods. In subsec:Ablation, we conduct extensive ablation studies to show that the proposed methods can effectively solve the problem described in subsec:Problem.
4.1 Datasets
Waymo Open Dataset(WOD) [wod] is a recently released public dataset for autonomous driving research. It contains RGB images from five highresolution cameras and 3D point clouds from five LiDAR sensors. The whole dataset consists of 1000 scenes(20s fragment) for training and validation and 150 scenes for testing. For the 3D detection task, the dataset has 12M labeled 3D LiDAR objects which are annotated in full 360degree field, including 4.81M vehicles and 2.22M pedestrians for training and 1.25M vehicles and 539K pedestrians for validation. It is one of the largest and most diverse autonomous driving datasets. For evaluation, Average precision (AP) and APH are used as the metric. Specifically, [wod] proposes APH to incorporate heading information into detection metrics. The IoU threshold for vehicle detection is 0.7 while 0.5 for pedestrian and cyclist.
KITTI dataset [kitti] is one of the most widely used dataset in 3D detection task. It is equipped with sensors like GPS/IMU with RTK, stereo cameras, Velodyne HDL64E. There are 7481 training samples which are commonly divided into 3712 samples for training and 3769 samples for validation, and 7518 samples for tesing.
4.2 Emperical Study of Scale Ambiguity Problem
As explained in subsec:Problem, the Size Ambiguity Problem means when the proposals are enlarged, the point clouds (also the features from PointNet) falling in them may not change. We verify our hypothesis on the WOD with a typical 3D detector, PointPillars [pointpillars]. The total number of output bounding boxes from PointPillars in the WOD testing set is . When we enlarge every box’s width and length by meter, boxes keep the same number of the points, and boxes only get new points less than , and most of them are ground points.
The output features of these bounding boxes and their enlarged ones will be very similar, while their classification and regression targets significantly differ. These numbers reflect how ubiqutous the Scale Ambiguity Problem is.
4.3 Implementation Details
For all experiments except voxelization, we use the network architecture shown in fig:pointnet_backbone. We shrink the number of embedding channels to in PointNet to achieve fast inference speed while maintaining accuracy. 512 points are sampled for each proposal. If the points number in one bounding box is fewer than 512, we randomly repeat the points to keep the input dimension.
For the voxelization experiments, we use two fully connected layer to extract features in each voxel. Then the features are maxpooled in voxels to form a 3D voxel map. Finally, five 3D convolution layers and three fullyconnected layers are utilized to extract the RCNN features.
To ensure the reproducibility of our algorithm, we use the code base of mmdetection3d^{2}^{2}2https://github.com/openmmlab/mmdetection3d to extract proposals from Waymo Open Dataset. We directly save the outputs from PointPillars [pointpillars] and SECOND [second] with default testing settings in mmdetection3d as our inputs. There are about 4M vehicle proposals, 1.8M pedestrian proposals and 72K cyclist proposals for each model. Following PointRCNN [pointrcnn], we randomly jitter the vehicle proposals during training as augmentation.
Besides the models from MMDetection3D, we also apply our proposed algorithm on the recently proposed RCD model [rcd]. It is a twostage detector with a range view based first stage and a voxelbased second stage. Since it is not opensourced, we only reimplemented its baseline model without the Range Conditioned Convolution operator and the Soft Range Gating. The results of both stages match with the original paper.
In our experiments, all models are trained using SGD with momentum of 0.9 for 60 epochs. The training schedule is a poly learning rate update strategy, starting from 0.02 with
weight decay. Training the model on WOD takes 7 hours on 4 Nvidia Geforce GTX2080Ti GPU with 256 batch size per GPU.4.4 3D Detection on Open Datasets
Waymo Open Dataset. We apply the proposed LiDAR RCNN to three different kinds of baseline models: PointPillars [pointpillars], SECOND [second], and RV [rcd] which are birdseyeview based, 3D voxel based, and range view based, respectively. All these methods are evaluated with 3D AP in two difficulty levels defined in the official evaluation, where the LEVEL_1 objects should contain at least 5 points in each groudtruth bounding box while the LEVEL_2 objects only need one single point.
tab:waymovehicle reports the results of vehicle detection with 3D and BEV AP on validation sequences. Note that after hyperparameters finetuning, the performance of PointPillars
[pointpillars] implemented by MMDetection3D already surpasses the previous stateoftheart model PVRCNN [pvrcnn]. As shown in tab:waymovehicle, our method improves all the baseline models with a significant margin. Benchmark on various distance ranges shows that LiDAR RCNN generally improves the baseline consistently on all distances. Furthermore, with the strong PointPillars baseline, LiDAR RCNN outperforms the PVRCNN by AP on the 3D vehicle detection task. We also experiment with a 2x model with double channels, which enhances performance on the more challenging case, improvement in the LEVEL_2. Based on this, we cascade another stage to show the ability of our models, which increase the 3D AP to on vehicle detetion. For the range view model, without bells and whistles, we outperform the RCD [rcd].Additionally, we show multiclass detection results with VEHICLE, PEDESTRIAN, and CYCLIST in tab:waymo3d. Because we only obtain the multiclass results of PointPillars [pointpillars] and SECOND [second] from publicly available codes, we test our method with these two baselines only. We use a multiclass classification loss and a classagnostic regression loss for the three classes detection, without any other tuning. tab:waymo3d demonstrates that our model also performs well in detecting small objects besides vehicles.
Methods  Easy  Moderate  Hard 

PointPillars(T)  82.17  72.81  67.57 
PointPillars(TV)  83.12  74.11  69.14 
Ours(TV)  85.97  74.21  69.18 
KITTI Dataset. We also show the effectiveness of the proposed method on KITTI dataset [kitti]. LiDAR RCNN works by fitting the residuals upon firststage models. However, models can easily achieve AP (easy set) on the small training set of KITTI with only frames. As a result, the limited training data reduces the effectiveness of our LiDAR RCNN. As shown in Tab. 3, we use PointPillars(T) as the firststage and train LiDAR RCNN with trainval data to get more effective data. We also show PointPillars(TV) to make a fair comparison. Though the improvement is less than those on WOD, we still highlight 2.8 AP improvement on easy set.
4.5 Ablation Studies
In this section, we first conduct an ablation study to support the claims on subsec:Problem. And then, we show the generalization and inference time of proposed method. We show both singleclass and multiclass results in Waymo Open Dataset in this section.
The size ambiguity problem.
Methods  3D AP@70  BEV AP@70 

PointPillars [pointpillars]  71.6  87.1 
PointNet refinement  74.1  87.9 
voxel  72.9  87.2 
anchor  75.2  88.2 
size normalization  75.4  88.1 
virtual point  75.4  88.1 
boundary offset  75.6  88.3 
tab:ablationsingle shows the comparison with different methods proposed in subsec:Problem. From the table, we can conclude that all the pointbased RCNN models outperforms the baseline PointPillars. The PointNet models performs even better than voxel based models. All other solutions achieve similar improvements compared with baseline on the oneclass setting.
Methods  vehicle  pedestrian  cyclist 

PointPillars [pointpillars]  71.6  70.6  64.4 
PointNet refinement  72.1  69.2  62.2 
voxel  72.1  69.8  64.5 
anchor  72.5  70.2  63.5 
size normalization  72.7  69.9  64.4 
virtual point  73.3  70.4  66.2 
boundary offset  73.4  70.6  66.8 
Nevertheless, when it comes to the multiclass problem, the size normalization and anchor solution suffer from difficulty in distinguishing different categories as shown in tab:ablationthree. Normalization using proposals’ size will make proposals from different classes have similar scale, which would confuse the RCNN model. When the network is not sure about the class of the input objects, it will also be confused about which anchor should be based on. The virtual point and offset methods, which fundamentally solve the ambiguity problem, reach the best results on the multiclass setting.
Methods  3D AP@70  BEV AP@70 

RV baseline  53.4  70.5 
RV + voxel  64.1  76.7 
RV+ LiDAR RCNN  68.7  81.1 
In tab:ablationvoxel, we show the performance improvement of our LiDAR RCNN based on a rangeviewbased model RV. The RV model is the baseline model in RCD [rcd]. It is built with a voxelbased second stage, whose features are not only from the raw points, but also from the high level CNN feature maps. Similar with the voxel experiments in tab:ablationsingle, the voxelization method is inferior to our purely pointbased RCNN model.
Network volume. A tradeoff between speed and accuracy can be achieved by modifying the channels in points encoding. To quantify the effect of channel numbers, we show a LiDAR RCNN (2x) in tab:waymo3d which obtains further improvement in pedestrian and in cyclist with two times channels. In addition, we also show the adaptive inference time by changing the number of sampled points per objects in fig:number_points.
Crossmodel generalization
Methods  vehicle  pedestrian  cyclist 

SECOND [second]  58.5  63.9  48.6 
LiDAR RCNN (sec)  62.6  68.2  52.8 
LiDAR RCNN (pp)  62.1  67.2  52.9 
tab:ablationgeneralization demonstrates that our approach is able to generalize across models. Without finetuning, LiDAR RCNN trained on PointPillars [pointpillars] works well on SECOND [second] model.
Subtask ablation
Method 




AP@L1  AP@L2  

PointPillars [pointpillars]  72.1  63.6  
PointNet  ✓  72.9  64.2  
refinement  ✓  72.9  64.3  
✓  72.2  63.7  
✓  72.1  63.6  
✓  ✓  ✓  ✓  74.1  65.3  
LiDAR RCNN  ✓  73.3  64.5  
boudary offset  ✓  73.7  65.0  
✓  72.6  64.1  
✓  72.2  63.6  
✓  ✓  ✓  ✓  75.6  66.5 
An RCNN model contains two tasks, namely classification task and regression task. The regression task can be further splitted into three subtasks, center regression, size regression and orientation regression. In tab:ablationreal, we successively add these subtask refinement on the baseline model to see the effect of each subtask. From the table, we can infer that the vanilla PointNet refinement cannot improve the performance of the size subtask based on the PointPillars[pointpillars] model. With the boundary offset, our LiDAR RCNN has no ambiguity problem and is able to refine the size.
Methods  3D AP  time  Params 

voxel  72.9  19ms  3.5M 
ours  75.6  4.5ms  0.5M 
Finally, we show our inference time and model size in tab:ablationspeed. Memory and computational demands of LiDAR RCNN are marginal.
5 Conclusions and Future Work
This paper presents LiDAR RCNN, a fast and accurate secondstage 3D detector. Through a detailed analysis of scale ambiguity problem and deliberate experiments, we come up with practical solutions. Comprehensive experiments on Waymo Open Dataset demonstrate our method can improve steadily on all baseline models and stateoftheart performance.
A multisensor fusion perception system is necessary for robotics and autonomous driving. Besides the encouraging performance on singleframe LiDAR detection, our LiDAR RCNN is easy to generalize to other kinds of inputs such as multiframe LiDAR and RGB+LiDAR. As a secondstage framework, our method is more comfortable with various aggregated inputs. We will investigate how to develop our method into a multimodal fusion framework.
Comments
There are no comments yet.