帳號:guest(3.137.177.164)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):李侑霖
作者(外文):Li, Yu-Lin
論文名稱(中文):透過採樣重要點及再利用淺層特徵圖增進影片物件分割
論文名稱(外文):Enhancing Video Object Segmentation by Sampling Important Points and Reusing Fine­-grained Feature
指導教授(中文):胡敏君
指導教授(外文):Hu, Min­-Chun
口試委員(中文):朱宏國
姚智原
口試委員(外文):Chu, Hung-Kuo
Yao, Chih-Yuan
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:108062610
出版年(民國):110
畢業學年度:109
語文別:英文
論文頁數:30
中文關鍵詞:影片物件分割多重前景半監督式
外文關鍵詞:Video object segmentationmulti-foregroundsemi-supervised
相關次數:
  • 推薦推薦:0
  • 點閱點閱:124
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
本論文主題為半監督式影片目標物件切割 (Video Object Segmentation),
在給定影片第一幀目標物件 (單個或多個) 的分割遮罩情況下,根據指定的
目標物件,將剩餘幀數的物件分割遮罩切割出來。近年來使用類神經網路模
型進行影片物件分割的方法大多專注於編碼器 (encoder) 的設計改良,然而
它們的解碼器部分往往過於粗糙,無法善用編碼器端的資訊。因此,本篇論
文選擇引入原本用於圖像語意分割的 PointRend 解碼器 (decoder) 架構,在向
上採樣時會選取特定需要重新預測的點,這樣的採樣方式不同於一般卷積時
的均勻採樣過程,可著重在不確定性較高的點。此外,我們也透過參照編碼
器中現成的特徵圖來重新取得網路淺層所蘊含的顏色訊息,以迴避因為深
層網路導致的資訊流失問題。而編碼器部分則使用 CFBI (Collaborative Video
Object Segmentation by Multi­Scale Foreground­Background Integration) 架構,
其不需要模擬資料來進行預訓練 (pre­training),且專注於區別前後景特徵的
架構。由於 CFBI 不用進行預訓練,更能真實反映加入色彩資訊可維持遮罩
完整性的研究目的。我們藉由結合 CFBI 架構與 PointRend 解碼器,解決目
標遮罩破碎問題,並將此架構實驗於 DAVIS2017­val 資料庫,在不預訓練
的設定下,取得 J & F 約為 82.5 % 的分數,優於現有之頂尖方法。而在
DAVIS2017­test 與 YouTube­VOS 資料集上也分別取得 74.2% 及 79.4% 的分
數,近乎目前頂尖方法的表現。
This paper target on the problem of semi­supervised video object segmentation.
To be more precise, given the mask ground truth of the target object(s) in the first
frame, the model outputs the segmentation masks of the target object(s) for the rest
frames. Most of the existing methods focus on modifying the encoding architecture but uses a simple decoder which does not make good use of the information
aggregated by the encoder. Due to this, this work applies PointRend to improve
the generated object mask. Unlike the common deconvolution module, PointRend
samples the specific points that should be predicted again during up­sampling and
can maintain mask integrity by reusing color information. Moreover, the proposed
decoder refer to the feature maps in the encoder so that the RGB information in the
shallow feature maps can properly help to avoid information loss in deep neural networks. We choose CFBI (Collaborative Video Object Segmentation by Multi­Scale
Fore­ ­ground­Background Integration) as the encoder. CFBI focuses on learning
foreground and background differences and does not rely on additional pre­train process on simulation data. By combining the CFBI and the PointRend decoder, we aim
to construct better results that avoid fractured segmentation. The evaluation is conducted on three public datasets, and no additional datasets are used for pre­training.
For the DAVIS2017­val dataset, our method outperforms the state­of­the­art methods and achieves 82.5% score in terms of the (J & F). For the DAVIS2017­test
and Youtube­VOS datasets, our method results in the scores of 74.2% and 79.4%
respectively, which are comparable to the state­of­the­art methods.
摘要 i
Abstract ii
Table of Contents iii
1 Introduction 1
1.1 Research Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Related Work 5
2.1 Semi­Supervised Video Object Segmentation . . . . . . . . . . . . . . . . . . 5
2.1.1 Online fine­tuning based learning . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Matching based learning . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 PointRend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Method 10
3.1 Overview of the Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Encoder Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Decoder Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Training/Inference Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Experiment 15
4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Training and Implementation Details . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4.1 Comparison with SOTA VOS methods . . . . . . . . . . . . . . . . . 19
4.4.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.5.1 The Effect of Instance­Level Attention Map . . . . . . . . . . . . . . . 23
4.5.2 Influence of the point number sampled in PointRend . . . . . . . . . . 24
5 Conclusion 26
[1] RobMOTS Challenge, 2021. URL https://eval.vision.rwth­aachen.de/
vision/.
[2] Wenguan Wang, Hongmei Song, Shuyang Zhao, Jianbing Shen, Sanyuan Zhao, Steven CH
Hoi, and Haibin Ling. Learning unsupervised video object segmentation through visual
attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 3064–3074, 2019.
[3] Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih Porikli. See
more, know more: Unsupervised video object segmentation with co­attention siamese networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 3623–3632, 2019.
[4] Yuan­Ting Hu, Jia­Bin Huang, and Alexander G Schwing. Unsupervised video object
segmentation using motion saliency­guided spatio­temporal propagation. In Proceedings
of the European conference on computer vision (ECCV), pages 786–802, 2018.
[5] Siyang Li, Bryan Seybold, Alexey Vorobyov, Alireza Fathi, Qin Huang, and C­C Jay Kuo.
Instance embedding transfer to unsupervised video object segmentation. In Proceedings of
the IEEE conference on computer vision and pattern recognition, pages 6526–6535, 2018.
[6] Jiaxu Miao, Yunchao Wei, and Yi Yang. Memory aggregation networks for efficient interactive video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10366–10375, 2020.
[7] Yuk Heo, Yeong Jun Koh, and Chang­Su Kim. Interactive video object segmentation using
global and local transfer modules. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, pages 297–313.
Springer, 2020.
[8] Jonathon Luiten, Paul Voigtlaender, and Bastian Leibe. Premvos: Proposal­generation,
refinement and merging for video object segmentation. In Asian Conference on Computer
Vision, pages 565–580. Springer, 2018.
[9] Paul Voigtlaender, Yuning Chai, Florian Schroff, Hartwig Adam, Bastian Leibe, and LiangChieh Chen. Feelvos: Fast end­to­end embedding learning for video object segmentation.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 9481–9490, 2019.
[10] Seoung Wug Oh, Joon­Young Lee, Ning Xu, and Seon Joo Kim. Video object segmentation using space­time memory networks. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 9226–9235, 2019.
[11] Hongje Seong, Junhyuk Hyun, and Euntai Kim. Kernelized memory network for video object segmentation. In European Conference on Computer Vision, pages 629–645. Springer,
2020.
[12] Li Hu, Peng Zhang, Bang Zhang, Pan Pan, Yinghui Xu, and Rong Jin. Learning position
and target consistency for memory­based video object segmentation. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4144–4154,
2021.
[13] Zongxin Yang, Yunchao Wei, and Yi Yang. Collaborative video object segmentation by
foreground­background integration. In European Conference on Computer Vision, pages
332–348. Springer, 2020.
[14] Jie Hu, Li Shen, and Gang Sun. Squeeze­and­excitation networks. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
[15] Zongxin Yang, Yuhang Ding, Yunchao Wei, and Yi Yang. Cfbi+: Collaborative video
object segmentation by multi­scale foreground­background integration. In Proceedings of
the IEEE conference on computer vision and pattern recognition workshops, volume 1,
page 3, 2020.
[16] Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Girshick. Pointrend: Image segmentation as rendering. In Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition, pages 9799–9808, 2020.
[17] Jordi Pont­Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine­Hornung,
and Luc Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint
arXiv:1704.00675, 2017.
[18] Sergi Caelles, Kevis­Kokitsi Maninis, Jordi Pont­Tuset, Laura Leal­Taixé, Daniel Cremers,
and Luc Van Gool. One­shot video object segmentation, 2017.
[19] Huaxin Xiao, Jiashi Feng, Guosheng Lin, Yu Liu, and Maojun Zhang. Monet: Deep motion
exploitation for video object segmentation. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 1140–1148, 2018
[20] Paul Voigtlaender and Bastian Leibe. Online adaptation of convolutional neural networks
for video object segmentation. arXiv preprint arXiv:1706.09364, 2017.
[21] Federico Perazzi, Anna Khoreva, Rodrigo Benenson, Bernt Schiele, and Alexander
Sorkine­Hornung. Learning video object segmentation from static images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2663–2672,
2017.
[22] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir
Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning
optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 2758–2766, 2015.
[23] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r­cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
[24] Yuhua Chen, Jordi Pont­Tuset, Alberto Montes, and Luc Van Gool. Blazingly fast video object segmentation with pixel­wise metric learning. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 1189–1198, 2018.
[25] Yuan­Ting Hu, Jia­Bin Huang, and Alexander G Schwing. Videomatch: Matching based
video object segmentation. In Proceedings of the European conference on computer vision
(ECCV), pages 54–70, 2018.
[26] Haozhe Xie, Hongxun Yao, Shangchen Zhou, Shengping Zhang, and Wenxiu Sun. Efficient
regional memory network for video object segmentation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 1286–1295, 2021.
[27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint
arXiv:1706.03762, 2017.
[28] Brendan Duke, Abdalla Ahmed, Christian Wolf, Parham Aarabi, and Graham W Taylor.
Sstvos: Sparse spatiotemporal transformers for video object segmentation. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5912–
5921, 2021.
[29] Liang­Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L
Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine
intelligence, 40(4):834–848, 2017.
[30] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for
image recognition. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016.
[31] Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and
Thomas Huang. Youtube­vos: A large­scale video object segmentation benchmark. arXiv
preprint arXiv:1809.03327, 2018
(此全文20260816後開放外部瀏覽)
電子全文
中英文摘要
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *