版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Empirical Upper-bound in Object Detection and MoreSeyed MehdiIranmaneshAli BorjiWest Virginia University & MarkableAIAbstractObject detection remains as one of the most notorious open problems in computer vision. Despite large strides in accuracy in recent yea
2、rs, modern object detectors have started to saturate on popular benchmarks raising the ques- tion of how far we can reach with deep learning tools and tricks. Here, by employing 2 state-of-the-art object detec- tion benchmarks, and analyzing more than 15 models over 4 large scale datasets, we I) car
3、efully determine the upper- bound in AP, which is 91.6% on VOC (test2007), 78.2% on COCO (val2017), and 58.9% on OpenImages V4 (val- idation), regardless of the IOU. These numbers are much better than the AP of the best model1 (47.9% on VOC, and 46.9% on COCO; IOUs=.5:.95), II) characterize the sour
4、ces of errors in object detectors, in a novel and intu- itive way, and find that classification error (confusion with other classes and misses) explains the largest fraction of errors and weighs more than localization and duplicate er- rors, and III) analyze the invariance properties of models when
5、surrounding context of an object is removed, when an object is placed in an incongruent background, and when images are blurred or flipped vertically. We find that mod- els generate boxes on empty regions and that context is more important for detecting small objects than larger ones. Our work taps
6、into the tight relationship between recognition and detection and offers insights to build better models2.1. Introduction and MotivationSeveral years of extensive research on object detection has resulted in accumulation of an overwhelming amount of knowledge regarding model backbones, tricks for mo
7、del training and optimization, data collection and annotation, and model evaluation and comparison 68, to a point that separating wheat from chaff is very difficult. As an exam- ple, truly understanding and implementing Average Preci- sion (AP) is frustratingly difficult. A quick Google search retur
8、ns numerous blogs and codes with discrepant expla- nations of AP. To make matters even worse, it is not quite clear whether AP has started to saturate, whether progress isFASHION94.191.691.691.671.271.271.261.471.070.747.971.145.764.762.759.751.437.6PASCAL VOC MaskRCNNMS COCO
9、84.6FasterRCNN78.278.278.281.663.566.062.1 46.951.250.728.0Figure 1. Upper-bound AP (red circles and numbers) and scores of the best models (blue; FCOS 59 on VOC and FASHION, and Hybrid Task Cascade 19 on COCO; numbers from the same model).significant, and more importantly how far we can improve fol
10、lowing the current path, making one wonder maybe we have reached the peak of performance using deep learning. Further, we do not know what is holding us back from mak- ing progress in object detection, compared to human-level (although debatable) accuracy on object recognition.To shed light on the a
11、bove matters, first we systemati- cally and carefully approximate the empirical upper-bound in AP. We hypothesize that the upper-bound AP (UAP) is the score of the best recognition model that is trained on the training target bounding boxes and is then used to label the testing target boxes. We also
12、 investigate whether visual context surrounding a target object or itsoverlapping boxes can improve the upper-bound AP. Second, we identify bot- tlenecks by characterising the type of errors that object de- tectors make and measure the impact of each one on per- formance. Third, we study the invaria
13、nce properties of var- ious object detectors on different types of transformations including incongruent context, scale, blur, vertical flip, etc. In a nutshell, we find that there is a large gap between the performance of the best detection models andthe empirical upper-bound as shown in Fig. 1. Th
14、is entails that there is a hope to reach this peak with the current tools, if we can find smarter ways to adopt object recognition models for object detection. We also find that classification remains as the major bottleneck in object detection and is more criti- cal over small objects. Specifically
15、, object detection models inherit the main limitations of CNNs which is the lack of in-1The best published mAP on COCO test-dev is 51.0 by Efficient- Det 58. See 3 for the latest results on COCO dataset.2Our code is publicly available at 7.4321arXiv:1911.12451v1 cs.CV 27 Nov 20193. Experimental Setu
16、p3.1. BenchmarksWe base our analysis on two recent large-scale object detection benchmarks: MMDetection 6, 20 and Detec- tron2 4. The former evaluates more than 25 models. The latter includes several variants of FastRCNN 28. In both benchmarks, all COCO models have been trained on train2017 and eval
17、uated on val2017. Here, we use MMDe- tection to train and test additional models on a newdataset.3.2. ModelsWe consider the latest models published in the ma- jor vision conferences and the ones included in the above benchmarks. Several variants of the RCNN model includ- ing FasterRCNN 52, MaskRCNN
18、30, RetinaNet 39,GridRCNN 42, LibraRCNN 48, CascadeRCNN 18,MaskScoringRCNN 36, GAFasterRCNN 66, and Hy- brid Task Cascade 19 are considered. We also include SSD 41, FCOS 59, and CenterNet 65. Different back- bones for each model are also taken into account.3.3. DatasetsWe employ 4 datasets including
19、 PASCAL VOC 25, our home-brewed FASHION dataset, MSCOCO 40, and OpenImages 37. Over VOC, we use trainval0712 for training (16,551 images, 47,223 boxes) and test2007 (4,952 images, 14,976 boxes) for testing. This dataset has 20 cat- egories. Our FASHION dataset covers 40 categories of clothing items
20、(39 + humans). Trainval, and test sets for this dataset contain 206,530 images (776,172 boxes) and 51,650 images (193,689 boxes), respectively. Fig. 5 dis- plays samples from this dataset (see Supplement for stats). This is a challenging dataset since clothing items are non- rigid as opposed to COCO
21、 or VOC objects. MSCOCO has 80 categories. It has carried the torch for benchmarking advances in object detection for the past 6 years. We use train2017 for training (118,287 images, 860,001 boxes) and val2017 (5,000 images, 36,781 boxes) for testing. Finally, we use the OpenImages V4 dataset, used
22、in Kaggle compe- tition 10. It has 500 classes and contains 1,743,042 images (12,195,144 boxes) for training and 41,620 images (226,811 boxes) for validation (used here for testing).3.4. MetricsWe use COCO evaluation code 2 to measure AP over IOU thresholds of 0.5 and 0.75 as well as the average AP
23、over IOUs in the range 0.5:.05:0.95. APs are calculated per class and are then averaged. We also report breakdown APs over small (area 322), medium (322 area 962) objects. Please see 2, 8, 5 for details.4. Characterizing the Empirical Upper-boundWe hypothesize that the empirical upper-bound in AP is
24、 the score of a detector with ground truth bounding boxes la-variance. Example failure cases include generating a lot of bounding boxes on a white background containing a single object, and failing to detect objects in incongruent contexts, vertically flipped images or blurred ones. It seems that hu
25、- mans can still manage to solve these tasks, although with higher effort and lower performance than intact images.2. Related WorkWe discuss three lines of related works. The first one includes works that strive to understand detection ap- proaches, identify their shortcomings, and pinpoint where mo
26、re research is needed. Parikh et al. 49 aimed to find the weakest links in person detectors by replacing different components in a pipeline (e.g. part detection, non-maxima- suppression) with human annotations. Mottaghi et al. 46 proposed human-machine CRFs for identifying bottlenecks in scene under
27、standing. Hoeim et al. 33 inspected de- tection models in terms of their localization errors, confu- sion with other classes, and confusion with the background on PASCAL dataset. They also conducted a meta-analysis to measure the impact of object properties such as color, texture, and real-world siz
28、e on detection performance. We replicate, simplify and extend this work on the larger COCO dataset and on image transformations. Russakovsky et al. 55 analyzed the ImageNet localization task and em- phasized on fine-grained recognition. Zhang et al. 64 measured how far we are from solving pedestrian
29、 detection. Vondrick et al. 61 proposed a method for visualizing ob- ject detection features to gain insights into their functioning. Some other related works in this line include 38, 67, 63.The second line regards research in comparing object detection models. Some works have analyzed and reported
30、statistics and performances over benchmark datasets such PASCAL VOC 26, 25, MSCOCO 40, CityScapes 22,and open images 37. Recently, Huang et al. 35 per- formed a speed/accuracy trade-off analysis of modern ob- ject detectors. Dollar et al. 24 and Borji et al. 15, 17, 16 compared models for person det
31、ection, and salient object detection, respectively. In 44, Michaelis et al. assessed detection models on degraded images and observed about 3060% performance drop, which could be mitigated by data augmentation. In order to resolve the shortcomings of the AP score, some works have attempted to introd
32、uce alterna- tive 29 or complementary evaluation measures 47, 53. A large number of works have also assessed object recognition models and their robustness (e.g. 56, 12, 51, 45).Works in the third line study the role of context in ob- ject detection and recognition (e.g. 13, 62, 43, 32, 60, 50, 54,
33、27. Heitz et al. 32 proposed a probabilistic frame- work to capture contextual information between stuff and things to improve detection. Barnea et al. 14 utilized co- occurrence relations among objects to improve the detection scores. Divvala et al. 23 explored different types of con- text in recog
34、nition. See also 32, 21, 57, 34, 43, 11.4322object onlyobject + contextcontext onlyABGT box Candidate boxesOZPZQuQPQPR1 v *ZSZRVR2Figure 2. Illustration of visual context surrounding an object.SRU =|PQ|SR Figure 3. A) Illustrationofoursetupforfindingboxeswith IOU with the target box (corresponding t
35、o = 2/(1 + ); = 2/3 for IOU = 0.5), B) The solutions are 4 curves represented by Eqs. 4 to 8. Four sample rectangles are shown with dashed o details lets recap how AP is calculated.AP calculation. For each category, detections over all im- ages are sorted according to their confidences. Sta
36、rting from the top of this list, the target with the highest IOU with each detection is considered. We have a true positive (TP; hit) iftheir IOU is thresh, and if that target has not been as- signed yet. We have a false positive (FP) if IOU thresh(i.e. localization error) or if the target has been
37、assigned (i.e. duplicate; two predictions on the same target). A target box can be matched with only one detection (the one withthe highest confidence score and IOU thresh). If a de- tection has IOU thresh with two targets, it is assigned to the one with the highest IOU which is not assigned already
38、.Scanning the sorted detection list again, a precision for each recall is obtained and is used to draw the Recall-Precision (RP) curve and to compute the AP. See 8, 5 for details.We explore two strategies in pursuit of the upper-bound AP. In the first strategy, we apply the best classifier from the
39、previous section to the target boxes. The detector built in this fashion gives the same AP regardless of the IOU threshold, since our detections are target boxes. As we ar- gued above, it is not possible to improve this detector at IOU=1. However, if we are interested in upper-bound for a lower IOU
40、(say ), then it might be possible to do better by searching among the candidate boxes near a target box and choose the one that can be classified better than the tar- get box, or aggregate information from nearby boxes. Thus, in our second strategy, we sample boxes around an object and either apply
41、the original classifier (trained on canonical object size) or train and test new classifiers on the surround- ing boxes. In any case, we always keep the target box but change its label and/or its confidence. First, lets take a look at our box sampling approach, which is illustrated in Fig. 3. Sampli
42、ng boxes with IOU above a threshold. Here, weare interested in finding the coordinates of the top-left cor-ner of all rectangles with IOU 1 withthe ground-truth bounding box. We use the coordinate systemcentered at the top-left corner of the target box P ; which can be easily converted to the image
43、level coordinate frame. The bottom-right coordinates of the desired rectangles that inter- sect with the target box from the top-left follow the equation = 2/(1 + ), where and are width and height ofTable 1. Recognition accuracy using object and/or its context.beled by the best object classifier. Th
44、e classification score is considered as the detection score. This way we essen- tially assume that the localization problem is solved and what remains is only object recognition. However, it might be possible to improve upon this detector in at least two ways: a) by exploiting local context around a
45、n object to im- prove classification accuracy and hence better UAP, and b) by searching over the scene and finding boxes that are eas- ier to classify (compared to the target box) and have enough overlap with the target box. This does not matter for the per- fect IOU but may affect IOUs lower than o
46、ne. We carefully investigate these possibilities in the following.4.1. Utility of the surrounding contextWe trained ResNet152 31 (see supp.) on target boxes in three settings as shown in Fig. 2: I) object only, II) object+ context, and III) context only. Standard data augmenta- tion techniques inclu
47、ding mean pixel subtraction, color jit- tering, random horizontal flip and random rotation (10 de-grees) were applied. Boxes were resized to 224 224 pix- els and models were trained for 15 epochs. Trained modelswere tested on the original object box. Results (top-1 ac- curacy) shown in Table 1 revea
48、l that the canonical object size contains the most information regarding the category of an object over all four datasets. Increasing or decreasing object box lowers the performance. Context-only scenario leads to high classification score but still does below other cases. Stretching the context to
49、the whole scene drops the performance significantly. Training and testing models on the same condition (i.e. both on object+context) results in higher accuracy on that specific condition but does not lead to better overall recognition accuracy.4.2. Searching for the best labelEssentially the problem
50、 definition here is how we can get the best classification accuracy for recognition of objects in the scene by utilizing all the information in the scene. This is different than recognition approaches that treat objects in isolation. Note that recognition accuracy is not the same as AP, since detect
51、ion scores also matter in AP calculation.Having the best classifier at hand, we are ready to ap- proximate the empirical upper-bound in AP. Beforedelving4323R2 = UVR1 = uv u = aU v = bVIOU = gab = 2g1 + gDatasetobject onlyobject + contextcontext only0.2 0.4 0.6 0.8 11.2 1.4 1.6 1.8 21.2 2 all imgVOC
52、 FASHION COCOOpenImg.39.3 68.0 82.6 92.5 94.8- 52.9 66.4 71.7 88.8- 67.1 79.8 86.7- 69.093.0 91.6 90.6 88.6 87.082.3 77.2 71.8 67.9 64.882.9 78.3 72.5 67.4 63.065.1 62.7 -63.6 64.9 35.329.0 32.2 12.043.7 48.9 11.0- 1001005050Using VOC evaluation codeUsing COCO evaluation code10010091.689.58080.08060
53、Table 2. Results of our second strategy for estimating the upper- bound AP (i.e. searching for the best bounding box or object label near a target box; among boxes with IOU 0.5). Notice that upper-bound for AP, AP0.5 and AP0.75 are all the same. Under- lined numbers show where we could improve over
54、the 1st strategy.6047.940VOCVOC040Figure 4. Model scores and upper-bound AP over PASCAL VOC dataset using VOC (left) and COCO AP evaluation codes (right). Categories are sorted based on the average model AP. Bar charts show classification scores. Solid red and dashed black lines repre- sent upper-bo
55、und AP, and the best model AP, respectively.rectangles, respectively (we assume all boxes have the same width and height as the target box). According to the illus- tration in Fig. 3(A), we have:R1R1 = uv, R2 = UV, IOU = , IOU =(1)1005002R2 R1From these equations and assuming u = U , and v = V , it
56、is easy to derive the following equations:FASHION71.22 1 + R1 = UV , R1 =R2(2)59.7and also:2 1 + 2 =, =for = 0.5(3)3The same equation governs the coordinates of the bottom- left, top-left, and top-right corners of the rectangles inter- secting with the target box at points Q, R, and S, respec- tivel
57、y (in the coordinate frames centered as these points, in order). Calculating the top-left corner of these rectangles (in their corresponding coordinate frames) and representing them in the coordinate frame of point P , we arrive at the following four equations (note that these are not lines):Figure 5. Upper-bound and model APs over the Fashion dataset.4.3. Upper-bound resultsHere, we report classification scores, upper-boun
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 健康晚餐搭配与营养分析方案
- 客户接待引导服务标准操作流程
- 环保安全隐患排查整改规范
- 拔罐排毒疗法禁忌症管理手册
- 辣椒育苗移栽田间管理规程
- 蔬菜产品农残快速检测方案
- 脉诊检查评估操作流程
- 牦牛夏季放牧补饲技术指引
- 风电场绝缘测试方案
- 排污单位环境风险管理指南
- 房屋安全鉴定服务投标方案
- 红木鉴赏与收藏知到智慧树章节测试课后答案2024年秋海南热带海洋学院
- 《新能源乘用车二手车鉴定评估技术规范 第1部分:纯电动》
- 工程造价咨询服务投标方案(技术方案)
- 修建祠堂合同模板
- 《交通监控系统》课件
- 2024年04月国家艺术基金管理中心应届毕业生招考聘用笔试历年典型考题及考点研判与答案解析
- 2024河北出版传媒集团招聘91人公开引进高层次人才和急需紧缺人才笔试参考题库(共500题)答案详解版
- 小升初英语词汇表(含1600个必备单词)+英语冲刺专项训练.情景对话+155个必考短语(必背)
- 等静压石墨行业分析
- 27.2.2相似三角形的性质教学设计人教版九年级数学下册
评论
0/150
提交评论