|
[1] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.Berg, “Ssd: Single shot multibox detector,” in European conference on computervision. Springer, 2016, pp. 21–37. [2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once:Unified, real-time object detection,” in Proceedings of the IEEE conferenceon computer vision and pattern recognition, 2016, pp. 779–788. [3] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-timeobject detection with region proposal networks,” in Advances in neural informationprocessing systems, 2015, pp. 91–99. [4] G. Goswami, S. Bharadwaj, M. Vatsa, and R. Singh, “On rgb-d face recognitionusing kinect,” in 2013 IEEE Sixth International Conference on Biometrics:Theory, Applications and Systems (BTAS). IEEE, 2013, pp. 1–6. [5] B. Y. Li, A. S. Mian, W. Liu, and A. Krishna, “Using kinect for face recognitionunder varying poses, expressions, illumination and disguise,” in 2013IEEE workshop on applications of computer vision (WACV). IEEE, 2013,pp. 186–192.57 [6] H. Zhang, H. Han, J. Cui, S. Shan, and X. Chen, “Rgb-d face recognition viadeep complementary and common feature learning,” in 2018 13th IEEE InternationalConference on Automatic Face & Gesture Recognition (FG 2018).IEEE, 2018, pp. 8–15. [7] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescaleimage recognition,” in 3rd International Conference on Learning Representations,ICLR 2015, San Diego, CA, USA, May 7-9, 2015, ConferenceTrack Proceedings, 2015. [8] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang,Y. Zhu, R. Pang, V. Vasudevan et al., “Searching for mobilenetv3,” in Proceedingsof the IEEE International Conference on Computer Vision, 2019,pp. 1314–1324. [9] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks,” in Proceedings of the IEEE conference oncomputer vision and pattern recognition, 2017, pp. 4700–4708. [10] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedingsof the IEEE conference on computer vision and pattern recognition, 2018,pp. 7132–7141. [11] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features fromrgb-d images for object detection and segmentation,” in European conferenceon computer vision. Springer, 2014, pp. 345–360.58 [12] S. Gupta, J. Hoffman, and J. Malik, “Cross modal distillation for supervisiontransfer,” in Proceedings of the IEEE conference on computer vision andpattern recognition, 2016, pp. 2827–2836. [13] G. Li, Y. Gan, H. Wu, N. Xiao, and L. Lin, “Cross-modal attentional contextlearning for rgb-d object detection.” IEEE transactions on image processing:a publication of the IEEE Signal Processing Society, vol. 28, no. 4, p. 1591,2019. [14] D. G. Lowe, “Object recognition from local scale-invariant features,” in Proceedingsof the seventh IEEE international conference on computer vision,vol. 2. Ieee, 1999, pp. 1150–1157. [15] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,”in European conference on computer vision. Springer, 2006, pp. 404–417. [16] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,”in 2005 IEEE computer society conference on computer vision andpattern recognition (CVPR’05), vol. 1. IEEE, 2005, pp. 886–893. [17] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale androtation invariant texture classification with local binary patterns,” IEEETransactions on pattern analysis and machine intelligence, vol. 24, no. 7, pp.971–987, 2002. [18] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXivpreprint arXiv:1804.02767, 2018.59 [19] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “Dssd: Deconvolutionalsingle shot detector,” arXiv preprint arXiv:1701.06659, 2017. [20] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, “Single-shot refinementneural network for object detection,” in Proceedings of the IEEE conferenceon computer vision and pattern recognition, 2018, pp. 4203–4212. [21] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conferenceon computer vision, 2015, pp. 1440–1448. [22] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedingsof the IEEE international conference on computer vision, 2017, pp. 2961–2969. [23] M. Najibi, M. Rastegari, and L. S. Davis, “G-cnn: an iterative grid basedobject detector,” in Proceedings of the IEEE conference on computer visionand pattern recognition, 2016, pp. 2369–2377. [24] Y.-C. Lee, J. Chen, C. W. Tseng, and S.-H. Lai, “Accurate and robust facerecognition from rgb-d images with a deep learning approach.” in BMVC,vol. 1, no. 2, 2016, p. 3. [25] A. Chowdhury, S. Ghosh, R. Singh, and M. Vatsa, “Rgb-d face recognition vialearning-based reconstruction,” in 2016 IEEE 8th International Conferenceon Biometrics Theory, Applications and Systems (BTAS). IEEE, 2016, pp.1–7.60 [26] S.-Z. Li, B. Yu, W. Wu, S.-Z. Su, and R.-R. Ji, “Feature learning based on sae–pca network for human gesture recognition in rgbd images,” Neurocomputing,vol. 151, pp. 565–573, 2015. [27] K. O. Rodriguez and G. C. Chavez, “Finger spelling recognition from rgb-dinformation using kernel descriptor,” in 2013 XXVI Conference on Graphics,Patterns and Images. IEEE, 2013, pp. 1–7. [28] M. Ma, X. Xu, J. Wu, and M. Guo, “Design and analyze the structure basedon deep belief network for gesture recognition,” in 2018 Tenth InternationalConference on Advanced Computational Intelligence (ICACI). IEEE, 2018,pp. 40–44. [29] M. Schwarz, A. Milan, A. S. Periyasamy, and S. Behnke, “Rgb-d object detectionand semantic segmentation for autonomous manipulation in clutter,”The International Journal of Robotics Research, vol. 37, no. 4-5, pp. 437–451,2018. [30] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets for3d object detection from rgb-d data,” in Proceedings of the IEEE conferenceon computer vision and pattern recognition, 2018, pp. 918–927. [31] S. Song and J. Xiao, “Deep sliding shapes for amodal 3d object detection inrgb-d images,” in Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2016, pp. 808–816. [32] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchiesfor accurate object detection and semantic segmentation,” in Proceedings of61the IEEE conference on computer vision and pattern recognition, 2014, pp.580–587. [33] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “Fusenet: Incorporatingdepth into semantic segmentation via fusion-based cnn architecture,” in Asianconference on computer vision. Springer, 2016, pp. 213–228. [34] S.-J. Park, K.-S. Hong, and S. Lee, “Rdfnet: Rgb-d multi-level residual featurefusion for indoor semantic segmentation,” in Proceedings of the IEEEinternational conference on computer vision, 2017, pp. 4980–4989. [35] D. Feng, C. Haase-Schütz, L. Rosenbaum, H. Hertlein, C. Glaeser, F. Timm,W. Wiesbeck, and K. Dietmayer, “Deep multi-modal object detection andsemantic segmentation for autonomous driving: Datasets, methods, and challenges,”IEEE Transactions on Intelligent Transportation Systems, 2020. [36] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for actionrecognition in videos,” in Advances in neural information processingsystems, 2014, pp. 568–576. [37] E. Chen, X. Bai, L. Gao, H. C. Tinega, and Y. Ding, “A spatiotemporal heterogeneoustwo-stream network for action recognition,” IEEE Access, vol. 7,pp. 57 267–57 275, 2019. [38] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool,“Temporal segment networks: Towards good practices for deep action recognition,”in European conference on computer vision. Springer, 2016, pp.20–36.62 [39] Q. Gao, J. Liu, Z. Ju, Y. Li, T. Zhang, and L. Zhang, “Static hand gesturerecognition with parallel cnns for space human-robot interaction,” in InternationalConference on Intelligent Robotics and Applications. Springer, 2017,pp. 462–473. [40] T. Weng, A. Pallankize, Y. Tang, O. Kroemer, and D. Held, “Multimodaltransfer learning for grasping transparent and specular objects,” IEEERobotics and Automation Letters, vol. 5, no. 3, pp. 3791–3798, 2020. [41] K.-H. Shih, C.-T. Chiu, J.-A. Lin, and Y.-Y. Bu, “Real-time object detectionwith reduced region proposal network via multi-feature concatenation,” IEEEtransactions on neural networks and learning systems, 2019. [42] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxoutnetworks,” in International conference on machine learning, 2013, pp.1319–1327. [43] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,”2016. [44] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell,“Understanding convolution for semantic segmentation,” in 2018 IEEE winterconference on applications of computer vision (WACV). IEEE, 2018, pp.1451–1460. [45] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fastfeature embedding,” in Proceedings of the 22nd ACM international conferenceon Multimedia, 2014, pp. 675–678.63 [46] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification withdeep convolutional neural networks,” in Advances in neural information processingsystems, 2012, pp. 1097–1105. [47] S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene understandingbenchmark suite,” in Proceedings of the IEEE conference on computervision and pattern recognition, 2015, pp. 567–576. [48] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman,“The pascal visual object classes (voc) challenge,” International journal ofcomputer vision, vol. 88, no. 2, pp. 303–338, 2010. [49] ——, “The pascal visual object classes challenge 2007 (voc2007) results,”2007. [50] M. Everingham and J. Winn, “The pascal visual object classes challenge2012 (voc2012) development kit,” Pattern Analysis, Statistical Modelling andComputational Learning, Tech. Rep, vol. 8, 2011. [51] S. Hou, Z. Wang, and F. Wu, “Object detection via deeply exploiting depthinformation,” Neurocomputing, vol. 286, pp. 58–66, 2018. [52] J.-A. Lin, C.-T. Chiu, and Y.-Y. Cheng, “Object detection with color anddepth images with multi-reduced region proposal network and multi-pooling,”in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP). IEEE, 2020, pp. 1618–1622. |