基於人臉片段學習特徵表示之影片人臉分群__國立清華大學博碩士論文全文影像系統

帳號：guest(216.73.216.198) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	陳嘉臨
作者(外文):	Chen, Jia Lin
論文名稱(中文):	基於人臉片段學習特徵表示之影片人臉分群
論文名稱(外文):	Learning Representations from Face Tracklets for Video Face Clustering
指導教授(中文):	林嘉文
指導教授(外文):	Lin, Chia Wen
口試委員(中文):	張正尚莊永裕孫民
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	電機工程學系
學號:	103061620
出版年(民國):	106
畢業學年度:	105
語文別:	英文
論文頁數:	36
中文關鍵詞:	學習特徵表示、深度學習、人臉分群
外文關鍵詞:	learning representations、deep learning、face clustering
相關次數:	推薦:0 點閱:329 評分: 下載:0 收藏:0

好的特徵表示方式在許多計算機視覺的問題中扮演非常重要的角色，尤其是一些需要分辨資料間微小差異的問題，例如：細分類問題(fine-grained categorization)以及人臉分群等。近年來，深度學習方法驗證了該架構在學習好的影像特徵表示上的有效性，但是深度學習方法需要利用非常大量且經過標註的資料來訓練深度學習網路模型。然而，蒐集大量的資料並且加上類別標註需要花費很高的成本。因此，我們希望能妥善利用可以直接從影片中得到的資訊作為限制，幫助我們同時學習卷積層網路模型(convolutional neural network model)和進行人臉分群。
在本篇論文中，我們提出了一種針對電影進行人臉分群的非監督式深度學習方法，但只需要利用人臉追蹤片段(face tracklets)以及共同出現的情形(co-occurrences)來做為事先訓練訓練孿生網路(Siamese network)所需要的成對標籤。經過事先訓練之後，以迭代的方式進行聚合人臉群和微調網路模型。本篇方法的核心思想是好的特徵表示可以產生好的人臉分群結果，而分群結果提供我們新的資訊來微調卷積層網路模型，再利用微調後的卷積層網路模型來提取更好的特徵表示。遵循我們提出的架構，不僅可以不使用任何人為標註的標籤來學習卷積層網路模型並產生好的特徵表示，還能同時改善人臉分群的結果。雖然我們解決了取得好的特徵表示的問題，但人臉分群還面臨了另一項挑戰，那就是資料的不平均問題。在電影中，男女主角出現的次數會遠大於其他角色。為了解決這個問題，我們設定了訓練模型的停損點。
實驗結果證實，可以利用我們所提出的架構來學習卷積層網路模型並得到更好的深度學習特徵表示。

關鍵字：學習特徵表示、深度學習、人臉分群

Good representations play an important role in many computer vision tasks, especially when we have to distinguish the subtle differences between data, such as fine-grained categorization and face clustering, etc. In recent years, deep learning has demonstrated the effectiveness on learning good image representations, but it needs a dataset which is large enough and has been annotated with corresponding labels for training the network. However, collecting massive data with labels cost a lot, so we expect to make proper use of constraints which we can get from the video to train a convolutional neural network (CNN) model and cluster faces jointly.
In this thesis, we proposed an unsupervised deep learning approach for face clustering in movies, only using face tracklets and co-occurrences as pairwise label to pre-train the CNN model with Siamese network structure. After that, we merge the clusters and fine-tune the model iteratively. The core idea of our framework is that good representations lead to good clustering results, and the clustering results can provide us some extra information to fine-tune the CNN model, then the CNN model can generate better representations. Following our framework, we not only learn good representations by a CNN model without any manually annotated labels, but also refine the face clustering results. Although our approach solve the problem of getting good features, there still have one problem in movie face clustering, that is the unbalance of dataset. The number of faces from main actor and main actress will more than the other characters. To solve this issue, we set a stop-loss point of training the model.
Our experiments demonstrate that we can get better deep representations by our framework.
Keywords: learning representations, deep learning, face clustering

摘要 ii
Abstract iii
Content iv
Chapter 1 Introduction 6
1.1 Background and Motivation 6
1.2 Research Objective 7
1.3 Thesis Organization 8
Chapter 2 Related Work 9
2.1 Purely Data-driven Face Clustering 9
2.2 Face Clustering with Prior Knowledge 10
2.3 Clustering with Deep Features 11
Chapter 3 Proposed Method 13
3.1 Overview 13
3.2 Pre-training the Convolution Neural Network 13
3.3 Merging Process 15
3.4 Sampling Data 16
3.5 Fine-tuning the Convolutional Neural Network 18
3.6 Stop-loss Point 18
Chapter 4 Experiments and Discussion 20
4.1 Dataset and Settings 20
4.2 Experimental Results 21
4.2.1 The Numerical Result of Our Approach 22
4.2.2 Evaluation of Fine-tuning Stage 24
4.2.3 Evaluation of the Sampling Procedure 27
4.2.4 Clustering Results 29
4.2.5 The Analysis of Validation Set 31
Chapter 5 Conclusion 33
References 34

[1] J. Wang, Y. Cheng, and R. S. Feris, “Walk and Learn: Facial Attribute Representation Learning from Egocentric Video and Contextual Data,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 2295–2304, 2016.
[2] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Proc. Neural Information and Processing Systems, pp. 1106–1114, 2012.
[3] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 580–587, 2014.
[4] J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional Networks for Semantic Segmentation,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 3431–3440, 2015.
[5] C. Zhu, F. Wen, and J. Sun, “A rank-order distance based-clustering algorithm for face tagging,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 481–488, 2011.
[6] Y. Hu, A. S. Mian, and R. Owens, “Sparse approximated nearest points for image set classification,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 121–128, 2011.
[7] A. W. Fitzgibbon and A. Zisserman, “Joint manifold distance: a new approach to appearance based clustering,” in Proc. IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, vol. 1, pp. I-26–I-33, 2003.
[8] O. Arandjelovic and R. Cipolla, “Automatic Cast Listing in Feature-Length Films with Anisotropic Manifold Space,” in Proc. IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 1513–1520, 2006.
[9] S. Xiao, M. Tan, and D. Xu, “Weighted Block-Sparse Low Rank Representation for Face Clustering in Videos,” in Proc. 13th European Conf. on Computer Vision, pp. 123–138, 2014.
[10] B. Wu, B. G. Hu, and Q. Ji, “A Coupled Hidden Markov Random Field model for simultaneous face clustering and tracking in videos,” in Pattern Recognition, vol. 64, pp. 361–373, April 2017.
[11] N. Vretos, V. Solachidis, and I. Pitas, “A mutual information based face clustering algorithm for movie content analysis,” in Image and Vision Computing, vol. 29, pp. 693–705, September 2011.
[12] R. G. Cinbis, J. Verbeek, and C. Schmid, “Unsupervised Metric Learning for Face Identification in TV Video,” in Proc. International Conference on Computer Vision, pp. 1559–1566, 2011.
[13] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Joint Face Representation Adaptation and Clustering in Videos,” in Proc. 14th European Conf. on Computer Vision, pp. 236–251, 2016.
[14] C. Zhou, C. Zhang, H. Fu, R. Wang, and X. Cao, “Multi-cue Augmented Face Clustering,” in Proc. ACM International Conference on Multimedia, pp. 1095–1098, 2015.
[15] Y. Song and T. Leung, “Context-aided human recognition- clustering,” in Proc. 9th European Conf. on Computer Vision, pp. 382–395, 2006.
[16] A. Gallagher and T. Chen, “Clothing Co-segmentation for Recognizing People,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1–8, 2008.
[17] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised Deep Embedding for Clustering Analysis,” in Proc. International Conference on Machine Learning, vol. 48, 2016.
[18] Y. C. Hsu and Z. Kira, “Neural network-based clustering using pairwise constraints,” in arXiv:1511.06321v5, 2015.
[19] J. Yang, D. Parikh, and D. Batra, “Joint Unsupervised Learning of Deep Representations and Image Clusters,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 5147–5156, 2016.
[20] B. J. Frey and D. Dueck, “Clustering by Passing Messages Between Data Points,” in Science, vol. 315, no. 5814, pp. 972–976, 2007.

(此全文未開放授權)
電子全文
摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文