帳號:guest(18.119.117.231)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):張家盛
作者(外文):Chang, Chia-Sheng
論文名稱(中文):單眼影片物體之擬真編輯與動態三維標注
論文名稱(外文):Plausible Editing and Dynamic 3D Annotation of Monocular Video Entities
指導教授(中文):朱宏國
指導教授(外文):Chu, Hung-Kuo
口試委員(中文):李潤容
廖弘源
陳炳宇
姚智原
口試委員(外文):Lee, Ruen-Rone
Liao, Hong-Yuan
Chen, Bing-Yu
Yao, Chih-Yuan
學位類別:博士
校院名稱:國立清華大學
系所名稱:資訊工程學系所
學號:101062562
出版年(民國):107
畢業學年度:107
語文別:英文
論文頁數:64
中文關鍵詞:影片合成物件代理二維三維標注工具時間同調性
外文關鍵詞:Video synthesisEntity proxies2D/3D annotatorTemporal coherence
相關次數:
  • 推薦推薦:0
  • 點閱點閱:129
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
影片依然為捕捉真實事件的最佳方法。然而,若只有原始的單眼影片,在單一或多個影片中直接編輯影片內的物件仍舊是十分困難。更糟糕的是,若影片中含有數個動態物體,看似簡單的工作像是估計和標注三維邊界框亦變得複雜。雖然也許能夠利用各式場景道具或高解析度的三維攝影機來精確地重建物體的三維幾何,但這樣的工作流程勢必較為麻煩、昂貴、且冗長乏味。為了處理這樣的問題,我們利用代理幾何的概念來表示影片中的物體,並在這篇論文中提出兩項物件階層的影片應用:單眼影片物體之擬真編輯與行車影片之動態三維標注。前者為一套較為簡單的工作流程,利用重建之稀疏點雲結構來支援混合多部影片的擬真影片編輯。具體來說,我們提出一個維持物體結構的影像變形演算法,套用於數張挑選出來的影片畫面上,再使用維持時間、空間連續性影像拼貼演算法來建構出最後的物體影像。後者則是一個直覺的動態行車影片標注系統,此系統不僅能估計在時間軸上具連續性的三維邊界框序列,亦能夠改善從影片中萃取的雜亂二維邊界框追蹤序列。除此之外,更提供一套直覺的使用者介面來確保在標注過程中二維與三維邊界框序列的正確性。以技術層面來說,系統本身處理計算需求較高的工作,而使用者則是扮演監督系統的角色。我們使用各式的輸入影片以及KITTI追蹤訓練基準來廣泛地評估兩系統之效能,以展示所提出方法之性能。
Video remains the method of choice for capturing temporal events. However, with only raw monocular video footages, it remains difficult to make object level edits in a single video or across multiple videos. Even worse, when it comes to the videos of dynamic objects, simple tasks as estimating or annotating the 3D bounding box sequences become non-trivial. Although it may be possible to address the problems by explicitly reconstructing the 3D geometries of the objects using extensive rigging of the environments or high-resolution 3D sensors, such a workflow is cumbersome, expensive, and tedious. To approach the problems, we exploit the concept of proxy geometries to represent the video entities, and present two object-level applications in this thesis: plausible editing of monocular video entities and dynamic 3D annotation for dashcam videos. The former is a much simpler workflow that creates plausible editing and mixing of raw video footages using the recovered sparse structure points. Specifically, we present a structure-preserving image warping on multiple input frames adaptively selected from the object video, followed by a spatiotemporally coherent image stitching to compose the final object image. The latter is an intuitive annotation system for dynamic dashcam videos which estimates the temporally smooth 3D bounding box sequences of entities, and alternately improves the accuracy and precision of the recovered noisy 2D bounding box tracks. In addition, an intuitive user interface is provided to ensure the correctness of the 2D/3D bounding box sequences during the process. Technically, the system performs computation demanding tasks, and the user is responsible for high level supervision. We extensively evaluated the performances of our systems on a variety of input videos and KITTI tracking benchmark, demonstrating the capabilities of our methods.
Contents
1 Introduction 1
1.1 Plausible editing of monocular video entities 2
1.2 Dynamic 3D annotation for dashcam videos 2
1.3 Contributions 3
2 Related Work 5
2.1 Video editing 5
2.2 Novel view video rendering 6
2.3 Video-based modeling 6
2.4 Annotation tools 8
2.5 Dashcam video datasets 8
2.6 Object detection and tracking 9
3 Plausible Editing of Monocular Video Entities 10
3.1 Introduction 10
3.2 Overview 11
3.3 Scenemodeling 14
3.4 Image-based rendering using SSP 16
3.4.1 Proxy geometry decomposition 17
3.4.2 Object frame retrieval 17
3.4.3 Structure-preserving image warping 20
3.4.4 Spatio-temporally coherent image stitching 22
3.4.5 Layer composition 24
3.5 Evaluation 25
3.5.1 Performance 27
3.5.2 Timing performance 29
3.5.3 Limitations 29
4 Dynamic 3D Annotation for Dashcam Videos 31
4.1 Introduction 31
4.2 Overview 32
4.3 Robust 3DBB estimation 35
4.4 Track correction 37
4.5 Track proposal 39
4.6 User annotation 41
4.6.1 Entity identification 41
4.6.2 Primitive refinements 43
4.7 Experiments and evaluation 44
4.7.1 Comparison: naïve annotation tool 45
4.7.1.1 Simulated annotation 46
4.7.1.2 User study 48
4.7.2 Comparison: baseline methods 49
5 Conclusions 53
5.1 Summary 53
5.2 Futurework 54
A Detailed Results of Baseline Comparison 56
B List of Publications 57
Bibliography 58
[1] Agarwala A., Dontcheva M., Agrawala M., Drucker S., Colburn A., Curless B., Salesin D., Cohen M. (2004) Interactive digital photomontage. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 23(3):294–302
[2] Ambardekar A., Nicolescu M., Dascalu S. (2009) Ground truth verification tool (gtvt) for video surveillance systems. In: International Conferences on Advances in Computer-Human Interactions
[3] Autodesk (2009) 123D Catch. URL http://www.123dapp.com/catch
[4] Bai X., Wang J., Simons D., Sapiro G. (2009) Video snapcut: Robust video object cutout using localized classifiers. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 28(3):70:1–70:11
[5] Boykov Y., Veksler O., Zabih R. (2001) Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 23(11):1222–1239
[6] Chan F.-H., Chen Y.-T., Xiang Y., Sun M. (2016) Anticipating accidents in dash- cam videos. In: Asian Conference on Computer Vision (ACCV), Springer, pp 136– 153
[7] Chang C.-S., Chu H.-K., Mitra N. J. (2016) Interactive videos: Plausible video editing using sparse structure points. Computer Graphics Forum (Proceedings of EUROGRAPHICS) 35
[8] Chang C.-S., Sun M., Chu H.-K. (2018) An interactive system for robust and ef- ficient 2d/3d annotation of dashcam videos. submitted to International Journal of Computer Vision (IJCV)
[9] Chaurasia G., Duchene S., Sorkine-Hornung O., Drettakis G. (2013) Depth syn- thesis and local warps for plausible image-based navigation. ACM Transactions on Graphics (ToG) 32(3):30:1–30:12
[10] Chen X., Kundu K., Zhang Z., Ma H., Fidler S., Urtasun R. (2016) Monocular 3d object detection for autonomous driving. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2147–2156, doi: 10.1109/CVPR.2016.236
[11] Chopra A. (2012) Introduction to google sketchup
[12] Chu H.-K., Hsu W.-H., Mitra N. J., Cohen-Or D., Wong T.-T., Lee T.-Y. (2010) Camouflage images. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 29:51:1–51:8
[13] Chuang Y.-Y., Agarwala A., Curless B., Salesin D. H., Szeliski R. (2002) Video matting of complex scenes. ACM Transactions on Graphics (Proceedings of SIG- GRAPH) 21(3):243–248
[14] Comaniciu D., Meer P. (2002) Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 24(5):603–619
[15] Comaschi F., Stuijk S., Basten T., Corporaal H. (2014) A tool for fast ground truth generation for object detection and tracking from video. In: IEEE International Conference on Image Processing (ICIP), pp 368–372
[16] Criminisi A., Reid I. D., Zisserman A. (2000) Single view metrology. vol 40, pp 123–148
[17] DavisA.,LevoyM.,DurandF.(2012)Unstructuredlightfields.ComputerGraph- ics Forum (Proceedings of EUROGRAPHICS) 31(2pt1):305–314
[18] Doennann D., Mihalcik D. (2000) Tools and techniques for video performance evaluation. In: Proceedings of International Conference on Pattern Recognition, vol 4
[19] Dosovitskiy A., Ros G., Codevilla F., Lopez A., Koltun V. (2017) CARLA: An open urban driving simulator. In: Proceedings of the 1st Annual Conference on Robot Learning, pp 1–16
[20] Engel J., Schöps T., Cremers D. (2014) Lsd-slam: Large-scale direct monocular slam. In: European Conference on Computer Vision (ECCV)
[21] Ester M., Kriegel H.-P., Sander J., Xu X. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of In- ternational Conference on Knowledge Discovery and Data Mining (KDD ’96), pp 226–231
[22] Fan Q., Zhong F., Lischinski D., Cohen-Or D., Chen B. (2015) Jumpcut: Non- successive mask transfer and interpolation for video cutout. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia) 34(6)
[23] Farbman Z., Lischinski D. (2011) Tonal stabilization of video. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 30(4):89:1–89:10
[24] Gaidon A., Wang Q., Cabon Y., Vig E. (2016) Virtualworlds as proxy for multi- object tracking analysis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4340–4349
[25] Geiger A., Lenz P., Urtasun R. (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
[26] Girshick R. (2015) Fast R-CNN. In: Proceedings of the International Conference on Computer Vision (ICCV)
[27] Goldman D. B., Gonterman C., Curless B., Salesin D., Seitz S. M. (2008) Video object annotation, navigation, and composition. In: Proceedings of the 21st Annual ACM Symposium on User Interface Software and Technology (UIST), pp 3–12
[28] HartleyR.,ZissermanA.(2003)MultipleViewGeometryinComputerVision,2nd edn. Cambridge University Press
[29] He K., Rhemann C., Rother C., Tang X., Sun J. (2011) A global sampling method for alpha matting. In: IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pp 2049–2056
[30] van den Hengel A., Dick A., Thormählen T., Ward B., Torr P. H. S. (2007) Video- trace: Rapid interactive scene modelling from video. ACM Transactions on Graph- ics (Proceedings of SIGGRAPH) 26(3)
[31] Hennessey J. W., Mitra N. J. (2015) An image degradation model for depth- augmented image editing. Computer Graphics Forum (Proceedings of SGP)
[32] IgarashiT.,MoscovichT.,HughesJ.F.(2005)As-rigid-as-possibleshapemanipu- lation. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 24(3):1134– 1141
[33] Jiang N., Tan P., Cheong L.-F. (2009) Symmetric architecture modeling with a single image. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia) pp 113:1–113:8
[34] KarschK.,SunkavalliK.,HadapS.,CarrN.,JinH.,FonteR.,SittigM.,ForsythD. (2014) Automatic scene inference for 3d object compositing. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 33(3)
[35] Kavasidis I., Palazzo S., Salvo R. D., Giordano D., Spampinato C. (2012) A semi- automatic tool for detection and tracking ground truth generation in videos. In: Proceedings of International Workshop on Visual Interfaces for Ground Truth Col- lection in Computer Vision Applications
[36] Klein G., Murray D. (2007) Parallel tracking and mapping for small AR workspaces. In: Proceedings of IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR), Nara, Japan
[37] KloseF.,WangO.,BazinJ.-C.,MagnorM.,Sorkine-HornungA.(2015)Sampling based scene-space video processing. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 34(4):67:1–67:11
[38] Kopf J., Cohen M. F., Szeliski R. (2014) First-person hyper-lapse videos. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 33(4):78:1–78:10
[39] Lee S. C., Nevatia R. (2011) Robust camera calibration tool for video surveillance camera in urban environment. In: CVPR Workshops
[40] Lepetit V., Moreno-Noguer F., Fua P. (2009) Epnp: An accurate o(n) solution to the pnp problem. International Journal Computer Vision (IJCV) 81(2):155–166
[41] Li C., Zia Z., Tran Q.-H., Yu X., Hager G. D., Chandraker M. (2017) Deep su- pervision with shape concepts for occlusion-aware 3d object parsing. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
[42] Li G., Liu L., Zheng H., Mitra N. J. (2010) Analysis, reconstruction and manip- ulation using arterial snakes. ACM Transactions on Graphics (ToG) 29(6):152:1– 152:10
[43] Li Y., Zheng Q., Sharf A., Cohen-Or D., Chen B., Mitra N. J. (2011) 2d-3d fusion for layer decomposition of urban facades. In: Proceedings of the International Conference on Computer Vision (ICCV)
[44] Liu F., Gleicher M., Jin H., Agarwala A. (2009) Content-preserving warps for 3d video stabilization. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 28(3):44:1–44:9
[45] Liu S., Wang J., Cho S., Tan P. (2014) Trackcam: 3d-aware tracking shots from consumer video. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia) 33(6):198:1–198:11
[46] Lourakis M. I. (Jul. 2004) levmar: Levenberg-marquardt nonlinear least squares algorithms in C/C++. URL http://www.ics.forth.gr/~lourakis/ levmar/
[47] Milan A., Leal-Taixe L., Reid I., Roth S., Schindler K. (2016) MOT16: A bench- mark for multi-object tracking. arXiv:1603.00831 [cs]
[48] MousavianA.,AnguelovD.,FlynnJ.,KoseckaJ.(2017)3dboundingboxestima- tion using deep learning and geometry. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
[49] Mur-Artal R., Montiel J. M. M., Tardós J. D. (2015) Orb-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics (TRO) 31(5):1147–1163, doi: 10.1109/TRO.2015.2463671
[50] Newcombe R., Fox D., Seitz S. (2015) Dynamicfusion: Reconstruction and track- ing of non-rigid scenes in real-time. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
[51] Pollefeys M., Van Gool L., Vergauwen M., Verbiest F., Cornelis K., Tops J., Koch R. (2004) Visual modeling with a hand-held camera. International Journal Com- puter Vision (IJCV) 59(3):207–232
[52] Ren S., He K., Girshick R., Sun J. (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 39:1137–1149
[53] Ros G., Sellart L., Materzynska J., Vazquez D., Lopez A. M. (2016) The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
[54] Rüegg J., Wang O., Smolic A., Gross M. (2013) Ducttake: Spatiotemporal video compositing. Computer Graphics Forum (Proceedings of EUROGRAPH- ICS) 32(2pt1):51–61
[55] Schaefer S., McPhail T., Warren J. (2006) Image deformation using moving least squares. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 25(3):533– 540
[56] SchödlA.,EssaI.A.(2002)Controlledanimationofvideosprites.In:Proceedings of ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA ’02, pp 121–127
[57] Snavely N., Seitz S. M., Szeliski R. (2006) Photo tourism: Exploring photo col- lections in 3d. ACM Transactions on Graphics (Proceedings of SIGGRAPH) pp 835–846, doi: 10.1145/1179352.1141964
[58] Sunkavalli K., Johnson M. K., Matusik W., Pfister H. (2010) Multi-scale im- age harmonization. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 29(4):125:1–125:10
[59] Sweeney C. (2013) Theia multiview geometry library: Tutorial & reference. URL http://theia-sfm.org
[60] Thormählen T., Broszio H. (2007) Voodoo camera tracker
[61] Vi3Dim (2011) Vi3dimv2. URL http://www.vi3dim.com
[62] Vondrick C., Patterson D., Ramanan D. (2013) Efficiently scaling up crowd- sourced video annotation. International Journal Computer Vision (IJCV) 101
[63] Wang J., Bhat P., Colburn R. A., Agrawala M., Cohen M. F. (2005) Interac- tive video cutout. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 24(3):585–594
[64] Wang T. Y., Kohli P., Mitra N. J. (2015) Dynamic sfm: Detecting scene changes from image pairs. Computer Graphics Forum (Proceedings of SGP)
[65] Wong Y.-S., Chu H.-K., Mitra N. J. (2015) Smartannotator: An interactive tool for annotating indoor rgbd images. Computer Graphics Forum (Proceedings of EUROGRAPHICS) 34
[66] Xiang Y., Mottaghi R., Savarese S. (2014) Beyond pascal: A benchmark for 3d object detection in the wild. In: IEEE Winter Conference on Applications of Com- puter Vision (WACV)
[67] Xiang Y., Alahi A., Savarese S. (2015) Learning to track: Online multi-object tracking by decision making. In: Proceedings of the International Conference on Computer Vision (ICCV), pp 4705–4713, doi: 10.1109/ICCV.2015.534
[68] Xiang Y., Choi W., Lin Y., Savarese S. (2017) Subcategory-aware convolutional neural networks for object proposals and detection. In: IEEE Winter Conference on Applications of Computer Vision (WACV)
[69] Xiao J., Cao X., Foroosh H. (2006) 3d object transfer between non-overlapping videos. In: Proceedings of IEEE Virtual Reality Conference, pp 127–134
[70] Xu F., Liu Y., Stoll C., Tompkin J., Bharaj G., Dai Q., Seidel H.-P., Kautz J., Theobalt C. (2011) Video-based characters: Creating new human performances from a multi-view video database. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 30(4):32:1–32:10
[71] Yang C. M., Choo Y., Park S. (2018) Semi-automatic image and video annotation system for generating ground truth information. In: International Conference on Information Networking (ICOIN)
[72] YuF.,FoleyS.,ChenH.,BaiH.,XianW.,ChenY.,WangX.,DarrellT.,Gonzalez J., Hays J. (2018) Scalabel. URL http://www.scalabel.ai
[73] Zhang G., Dong Z., Jia J., Wan L., Wong T.-T., Bao H. (2009) Refilming with depth-inferred videos. IEEE Transactions on Visualization and Computer Graph- ics (TVCG) 15(5):828–840
[74] Zheng Y., Chen X., Cheng M.-M., Zhou K., Hu S.-M., Mitra N. J. (2012) Interac- tive images: Cuboid proxies for smart image manipulation. ACM Transactions on Graphics (Proceedings of SIGGRAPH) 31(4):99:1–99:11
[75] Zheng Y., Liu H., Dorsey J., Mitra N. J. (2016) Smartcanvas: Context-inferred in- terpretation of sketches for preparatory design studies. Computer Graphics Forum (Proceedings of EUROGRAPHICS) 35(2):37–48, doi: 10.1111/cgf.12809
[76] Zhong F., Yang S., Qin X., Lischinski D., Cohen-Or D., Chen B. (2014) Slippage- free background replacement for hand-held video. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia) 33(6):199:1–199:11
(此全文未開放授權)
電子全文
中英文摘要
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *