Need autonomous driving training data? ›

Our Findings on Localization Accuracy of Vehicle Detection Models (Part 2)

Our Findings on Localization Accuracy of Vehicle Detection Models (Part 2)

Editor’s note: This is the second in a series of three posts outlining the findings of research our in-house computer vision team conducted regarding the accuracy of popular open-source object detection models for detecting vehicles, as measured by pixel level accuracy. Before diving in, be sure to check out part 1 to understand the scope of our experiment.

 

Model outputs

Let’s dive into what we learned. We tested the five detectors on our test dataset, limiting the class list to car, truck, and bus. We also compared the model outputs to crowd-sourced annotations that Mighty AI’s platform generated using a standard workflow for box annotations.

Following are examples of the model outputs (red) and the ground truth (green) for Faster R-CNN NAS, Mask R-CNN ResNet V2, and SSD Mobilenet.

 

Faster R-CNN NAS

Mask R-CNN ResNet V2

SSD Mobilenet V1

 

Measuring localization accuracy

We measured the detectors’ localization accuracy by the Intersection over Union (IoU)—also referred to as Jaccard index—and the pixel deviation or pixel error. The pixel deviation is defined as the maximum deviation in x- and y-direction between the detected box and the ground truth box.

Figure: Pixel deviation between ground truth box (pink) and detected box (gray) is defined as the maximum of delta_x and delta_y.

For a given detection to be counted as a true positive (TP), its IoU with an identically labeled ground truth box has to exceed a given threshold. When using pixel deviation instead of IoU, a TP has to have a deviation from the ground truth box that falls below a given threshold. Any detection that does not fulfill these requirements is counted as a false positive (FP).

 

Precision-recall curves

In our first experiment, we computed the micro-averaged precision-recall (PR) curves for the five models at IoU thresholds of 0.5 and 0.7, the same values used in the Kitti benchmark protocol.

For comparison, we provide the precision and recall for Mighty AI’s global community of human annotators for an IoU threshold of 0.5. Since the human annotations do not include a real-valued estimate of the class probability, they will generate only a single point on the PR chart.

Figure: PR curves for IoU thresholds of 0.5 (solid) and 0.7 (dashed)

As you can see, Mask R-CNN performs slightly better than Faster R-CNN NAS, followed by Faster R-CNN ResNet and ResNeXt. SSD falls short of any of the other models.

For an IoU threshold of 0.5 at 90% precision, Mask R-CNN and Faster R-CNN NAS reached 75% recall, Faster R-CNN ResNeXt reached 65% recall, and ResNet reached 70% recall. For reference, Mighty AI’s global community of human annotators achieved 95% precision at 92% recall.

When we increased the IoU threshold to 0.7, we noticed a significant decrease in the performance across all systems. At 90% precision, Mask R-CNN and Faster R-CNN NAS dropped to 63% and 60% recall, respectively. Faster R-CNN ResNeXt and ResNet dropped to 45% and 50% recall, respectively.

We then used pixel deviation instead of the IoU to compute the micro-averaged PR curves.

Figure: PR curves for max pixel deviation thresholds of 25 (solid) and 10 (dashed)

The ranking of the models was similar to the ranking based on IoU thresholds, but the gap between the top four models narrowed. At a threshold of 25 pixels and 90% precision, Mask R-CNN and Faster R-CNN NAS reached 70% recall. For a threshold of 10 pixels, there was a significant 10-15% drop in the precision of all models across large parts of the curves.

 

Localization accuracy

To get a better understanding of the detectors’ localization accuracy, we computed the cumulative histogram of the IoU values of the TPs at an IoU threshold of 0.5 and a minimum class probability of 0.5.

Mask R-CNN and Faster R-CNN NAS performed the best. For Mask R-CNN, 50% of the TPs had an IoU > 0.9, 80% had an IoU > 0.8, and 90% had an IoU > 0.7. The human annotations from Mighty AI’s global community of annotators resulted in 86% of TPs with an IoU > 0.9, 98% had an IoU > 0.8, and 99% had an IoU > 0.7.

Figure: Cumulative histogram of number of TPs over IoU

We’ve noted the cumulative distribution of the pixel deviation for the TPs in the following graph. Faster R-CNN NAS performed best with 25% of TPs having a deviation of < 3 pixels, 50% having a deviation of < 5 pixels, and 90% having a deviation of < 13 pixels. The human annotations had 24% of TPs within 1 pixel deviation, 73% within 3 pixels, 86% within 5 pixels, and  94% within 10 pixels.

Figure: Cumulative histogram of number of TPs over pixel deviation

What we learned

Among the five models tested, Mask R-CNN ResNet V2 and Faster R-CNN NAS performed best across all experiments, but neither achieved the quality levels of Mighty AI’s global community of human annotators.

We did an in-depth evaluation of the localization accuracy of the models using cumulative histograms computed across two types of accuracy measures: the IoU and the pixel deviation. Faster R-CNN NAS performed best with 50% of its detections having an IoU of above 0.9. For the pixel deviation measure, Faster R-CNN NAS hit the 50% mark at 5 pixels.

 

Next up: Check out part three of this series to see what we found when we evaluated the models’ localization accuracy based on the pixel deviation between the detections and the ground truth boxes, and what we discovered when we tested the models’ robustness against image noise and conversion to grayscale.