帳號:guest(3.140.186.54)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):詹振宏
作者(外文):Chan, Cheng-Hung
論文名稱(中文):物換形移:基於神經網路與乘法殘差之影片編輯技術
論文名稱(外文):Hashing Neural Video Decomposition with Multiplicative Residuals in Space-Time
指導教授(中文):陳煥宗
指導教授(外文):Chen, Hwann-Tzong
口試委員(中文):賴尚宏
劉庭祿
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:110062513
出版年(民國):112
畢業學年度:111
語文別:英文
論文頁數:47
中文關鍵詞:機器學習影片編輯
外文關鍵詞:Machine LearningVideo Editing
相關次數:
  • 推薦推薦:0
  • 點閱點閱:25
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
我們針對在時空上有不同光照和運動軌跡的影片提出了一種基於影片分解的方法,能夠方便地將影片分層來進行影片編輯。我們的神經網路模型將輸入的影片分解成多層物件來表達,其中每個物件包含一張 2D 紋理貼圖、對於原影片的遮罩,以及一個用來描述光照變化的乘法殘差估算器。在紋理貼圖上的一個修改可以影響到整個影片每個幀中對應的位置,同時保留其他內容的一致性。透過對座標做 hash 編碼,我們的方法能夠在 25 秒內有效學習 1080p 影片的分層的神經網路表達方法,並且能夠在單一張 GPU 上以 71 幀每秒的速度實時渲染編輯效果。以品質
而言,我們在各種不同的影片中應用我們的方法,以展示其生成高品質編輯效果的有效性。以量化數據而言,我們採用特徵追蹤的評估方法,藉以客觀地評估影片編輯的一致性。
We present a video decomposition method that facilitates layer-based editing of videos with spatiotemporally varying lighting and motion effects. Our neural model decomposes an input video into multiple layered representations, each comprising a 2D texture map, a mask for the original video, and a multiplicative residual characterizing the spatiotemporal variations in lighting conditions. A single edit on the texture maps can be propagated to the corresponding locations in the entire video frames while preserving other contents' consistencies. Our method efficiently learns the layer-based neural representations of a 1080p video in 25s per frame via coordinate hashing and allows real-time rendering of the edited result at 71 fps on a single GPU. Qualitatively, we run our method on various videos to show its effectiveness in generating high-quality editing effects. Quantitatively, we propose to adopt feature-tracking evaluation metrics for objectively assessing the consistency of video editing.
List of Tables 3
List of Figures 4
摘要 7
Abstract 8
1 Introduction 9
2 Related Work 12
3 Approach 14
3.1 Layer-Based Video Decomposition 14
3.1.1 Layer hierarchy 14
3.2 Texture Mapping 15
3.3 Multiplicative-Residual Estimator 16
3.4 Network Architecture 17
3.4.1 Mapping network 17
3.4.2 Texture network and multiplicative-residual estimator 17
3.4.3 Alpha network 18
3.4.4 Pixel color reconstruction 18
3.5 Loss Terms 18
3.5.1 Reconstruction loss 18
3.5.2 Sparsity loss 19
3.5.3 Optical flow loss 19
3.5.4 Alpha bootstrapping loss 20
3.5.5 Residual consistency loss 20
3.5.6 Residual regularization 21
3.5.7 Alpha regularization 21
3.6 Hash Grid Encoding 22
4 Experiments 23
4.1 Qualitative Results 23
4.2 Comparison with Previous Work 25
4.3 Representation of Multiple Foreground Objects 27
4.4 Consistent Video Editing 29
4.5 Quantitative Results for Editing Consistency 29
4.6 Manipulating Camera Motion 30
4.7 Ablations 32
4.7.1 Choice of residual type 32
4.7.2 Sparsity loss 32
4.7.3 Optical flow loss 34
4.7.4 Residual consistent loss 34
5 Conclusion and FutureWork 37
A More Experimental Results 38
A.1 Ablation of Hash Grid 38
A.2 Ablation of Residual Type 38
B Implementation Details 40
Bibliography 42
Aseem Agarwala, Aaron Hertzmann, David Salesin, and Steven M. Seitz. Keyframe-based tracking for rotoscoping and animation. ACM Trans. Graph., 23(3):584–591, 2004.
Yuval Alaluf, Or Patashnik, Zongze Wu, Asif Zamir, Eli Shechtman, Dani Lischinski, and Daniel Cohen-Or. Third time’s the charm? image and video editing with stylegan3. In Computer Vision - ECCV 2022 Workshops - Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part II, volume 13802 of Lecture Notes in Computer Science, pages 204–220. Springer, 2022.
Jean-Baptiste Alayrac, João Carreira, Relja Arandjelovic, and Andrew Zisserman. Controllable attention for structured layered video decomposition. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 5733–5742. IEEE, 2019.
Connelly Barnes, Dan B. Goldman, Eli Shechtman, and Adam Finkelstein. Video tapestries with continuous temporal zoom. ACM Trans. Graph., 29(4):89:1–89:9, 2010.
Zhangxing Bian, Allan Jabri, Alexei A. Efros, and Andrew Owens. Learning pixel trajectories with multiscale contrastive random walks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 6498–6509. IEEE, 2022.
Gabriel J. Brostow and Irfan A. Essa. Motion based decompositing of video. In Proceedings of the International Conference on Computer Vision, Kerkyra, Corfu, Greece, September 20-25, 1999, pages 8–13. IEEE Computer Society, 1999.
Carlos D. Correa and Kwan-Liu Ma. Dynamic video narratives. ACM Trans. Graph., 29(4):88:1–88:9, 2010.
Carl Doersch, Ankush Gupta, Larisa Markeeva, Adria Recasens Continente, Kucas Smaira, Yusuf Aytar, Joao Carreira, Andrew Zisserman, and Yi Yang. Tap-vid: A benchmark for tracking any point in a video. In NeurIPS Datasets Track, 2022.
Zeqi Gu, Wenqi Xian, Noah Snavely, and Abe Davis. Factormatte: Redefining video matting for re-composition tasks. ArXiv, abs/2211.02145, 2022.
Gregory D. Hager and Peter N. Belhumeur. Efficient region tracking with parametric models of geometry and illumination. IEEE Trans. Pattern Anal. Mach. Intell., 20(10):1025–1039, 1998.
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask r-cnn. 2017 IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017.
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell., 42(2):386–397, 2020.
Haozhi Huang, Hao Wang, Wenhan Luo, Lin Ma, Wenhao Jiang, Xiaolong Zhu, Zhifeng Li, and Wei Liu. Real-time neural style transfer for videos. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 7044–7052. IEEE Computer Society, 2017.
Allan Jabri, Andrew Owens, and Alexei A. Efros. Space-time correspondence as a contrastive random walk. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
Varun Jampani, Raghudeep Gadde, and Peter V. Gehler. Video propagation networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3154–3164. IEEE Computer Society, 2017.
Nebojsa Jojic and Brendan J. Frey. Learning flexible sprites in video layers. In 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2001), with CD-ROM, 8-14 December 2001, Kauai, HI, USA, pages 199–206. IEEE Computer Society, 2001.
Yoni Kasten, Dolev Ofri, Oliver Wang, and Tali Dekel. Layered neural atlases for consistent video editing. ACM Trans. Graph., 40(6):210:1–210:12, 2021.
M. Pawan Kumar, Philip H. S. Torr, and Andrew Zisserman. Learning layered motion segmentations of video. Int. J. Comput. Vis., 76(3):301–319, 2008.
Wei-Sheng Lai, Jia-Bin Huang, Oliver Wang, Eli Shechtman, Ersin Yumer, and Ming-Hsuan Yang. Learning blind video temporal consistency. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XV, volume 11219 of Lecture Notes in Computer Science, pages 179–195. Springer, 2018.
Xueting Li, Sifei Liu, Jan Kautz, and Ming-Hsuan Yang. Learning linear transformations for fast image and video style transfer. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 3809–3817. Computer Vision Foundation / IEEE, 2019.
Sharon Lin, Matthew Fisher, Angela Dai, and Pat Hanrahan. Layerbuilder: Layer decomposition for interactive image and video color editing. ArXiv, abs/1701.03754, 2017.
Erika Lu, Forrester Cole, Tali Dekel, Weidi Xie, Andrew Zisserman, David Salesin, William T. Freeman, and Michael Rubinstein. Layered neural rendering for retiming people in video. ACM Trans. Graph., 39(6):256:1–256:14, 2020.
Erika Lu, Forrester Cole, Tali Dekel, Andrew Zisserman, William T. Freeman, and Michael Rubinstein. Omnimatte: Associating objects and their effects in video. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 4507–4515. Computer Vision Foundation / IEEE, 2021.
Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I, volume 12346 of Lecture Notes in Computer Science, pages 405–421. Springer, 2020.
Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, July 2022.
William S. Peebles, Jun-Yan Zhu, Richard Zhang, Antonio Torralba, Alexei A. Efros, and Eli Shechtman. Gan-supervised dense visual alignment. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 13460–13471. IEEE, 2022.
Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus H. Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 724–732. IEEE Computer Society, 2016.
Hanspeter Pfister. Interactive intrinsic video editing. ACM Trans. Graph., 33(6):197:1–197:10, 2014.
Alex Rav-Acha, Pushmeet Kohli, Carsten Rother, and Andrew W. Fitzgibbon. Unwrap mosaics: a new representation for video editing. ACM Trans. Graph., 27(3):17, 2008.
Manuel Ruder, Alexey Dosovitskiy, and Thomas Brox. Artistic style transfer for videos. In Pattern Recognition - 38th German Conference, GCPR 2016, Hannover, Germany, September 12-15, 2016, Proceedings, volume 9796 of Lecture Notes in Computer Science, pages 26–36. Springer, 2016.
Soumyadip Sengupta, Vivek Jayaram, Brian Curless, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. Background matting: The world is your green screen. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 2288–2297. Computer Vision Foundation / IEEE, 2020.
Zachary Teed and Jia Deng. RAFT: recurrent all-pairs field transforms for optical flow. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part II, volume 12347 of Lecture Notes in Computer Science, pages 402–419. Springer, 2020.
Rotem Tzaban, Ron Mokady, Rinon Gal, Amit Bermano, and Daniel Cohen-Or. Stitch it in time: Gan-based facial editing of real videos. In SIGGRAPH Asia 2022 Conference Papers, SA 2022, Daegu, Republic of Korea, December 6-9, 2022, pages 29:1–29:9. ACM, 2022.
John Y. A. Wang and Edward H. Adelson. Representing moving images with layers. IEEE Trans. Image Process., 3(5):625–638, 1994.
Wenjing Wang, Shuai Yang, Jizheng Xu, and Jiaying Liu. Consistent video style transfer via relaxation and regularization. IEEE Trans. Image Process., 29:9125–9139, 2020.
Josh Wills, Sameer Agarwal, and Serge J. Belongie. What went where. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2003), 16-22 June 2003, Madison, WI, USA, pages 37–44. IEEE Computer Society, 2003.
Ning Xu, Brian L. Price, Scott Cohen, and Thomas S. Huang. Deep image matting. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 311–320. IEEE Computer Society, 2017.
Yiran Xu, Badour AlBahar, and Jia-Bin Huang. Temporally consistent semantic video editing. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XV, volume 13675 of Lecture Notes in Computer Science, pages 357–374. Springer, 2022.
Yanchao Yang, Brian Lai, and Stefano Soatto. Dystab: Unsupervised object segmentation via dynamic-static bootstrapping. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 2826–2836. Computer Vision Foundation / IEEE, 2021.
Vickie Ye, Zhengqi Li, Richard Tucker, Angjoo Kanazawa, and Noah Snavely. Deformable sprites for unsupervised video decomposition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 2647–2656. IEEE, 2022.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *