利用多聲學專家模型改進多模態電影場景分割__國立清華大學博碩士論文全文影像系統

帳號：guest(216.73.216.146) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	林孟翰
作者(外文):	Lin, Meng-Han
論文名稱(中文):	利用多聲學專家模型改進多模態電影場景分割
論文名稱(外文):	Improving Multimodal Movie Scene Segmentation Using Mixture of Acoustic Experts
指導教授(中文):	李祈均
指導教授(外文):	Lee, Chi-Chun
口試委員(中文):	林彥宇陳奕廷
口試委員(外文):	Lin, Yen-Yu Chen, Yi-Ting
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	電機工程學系
學號:	108061529
出版年(民國):	111
畢業學年度:	110
語文別:	英文
論文頁數:	38
中文關鍵詞:	電影、場景分割、多專家模型、多模態注意力機制、聲訊
外文關鍵詞:	Movie、Scene Segmentation、Mixture of Experts、Multimodal Attention、Audio
相關次數:	推薦:0 點閱:571 評分: 下載:0 收藏:0

近年來大量增加的多媒體影音內容促進了各種多媒體技術的發展，而這些多媒體技術都有把長的影片分割成較短且有意義的片段作為前置處理的需求。在一部電影中，場景(scene)是最基本的語義單位，而場景分割則是各種多媒體技術中關鍵的前置處理。先前場景分割的研究從幀(frame)或鏡頭(shot)中提取低階的視覺特徵來做分群，並且對分群施加時間上的限制，使得分群的結果能符合場景時序上的特性。最近的研究則進一步以多模態語義特徵訓練能辨識場景的模型。在多模態語義特徵中，音訊特徵通常是以在多領域皆適用的預訓練模型提取，並沒有考慮到電影音訊中具有不同語義的各種面向。除此之外，用於場景分割的音訊和影像互補的特性和複雜的互動也經常被忽略。在此篇研究中，我們提出能夠整合聲學模型和多模態模型的多聲學專家模型（MOAE）來提升場景分割的品質。聲學模型是針對音訊中各種語義面向訓練而成，其中包括講者、環境音等其他音訊事件的特徵都被當作是聲學模型的訓練資料。MOAE 能夠優化各個專家模型的權重，並在具有 1,110 個場景邊界 (scene boundary) 的電影資料庫中達到 61.89% F1-score。我們將MOAE分配給各專家模型的權重視覺化，並以圖像說明專家模型之間的互補的特性是如何改進場景分割的結果。

The recent growth of multimedia content brought about the accelerated development of multimedia computing technology. Such technology requires the segmentation of long videos, such as movies, into semantically meaningful clips as pre-processing. Scenes are the basic semantic elements in a movie. Past studies proposed clustering shots into scenes based on low-level visual representations with additional temporal constraints and alignment mechanisms. Recently, researchers proposed the use of multimodal semantic features to represent the complex semantic aspects of a movie. However, the semantically meaningful aspects of audio are often ignored because acoustic representations are usually extracted using a universally pretrained model. Moreover, the discussion and further research into the interactions between the aspects of audio and visual modalities were left much to be desired. In this work, we introduce a mixture-of-audio-experts (MOAE) framework. It has the ability to integrate acoustic and multimodal expert models that are learned based on different acoustic semantics, including speaker, environmental sounds along with visual representations. The mixer network in the MOAE determines the importance of different modalities and aspects by optimizing the weights assigned to each expert model. Our framework achieves a state-of-the-art 61.89% F1-score for scene segmentation on a database with 1,110 scenes. We also provide the visualization of the assigned weights along with corresponding sampled images/clips. These visualizations reveal the complementary properties among visual and acoustic modalities that lead to the improvement of scene segmentation.

摘要 ---------------------------------------------------------------------------i
Abstract-----------------------------------------------------------------------ii
誌謝---------------------------------------------------------------------------iii
Contents-----------------------------------------------------------------------iv
List of Figures----------------------------------------------------------------vi
List of Tables-----------------------------------------------------------------viii
Chapter 1 Introduction---------------------------------------------------------1
1.1 Background-----------------------------------------------------------------1
1.2 Motivation and Proposal----------------------------------------------------3
Chapter 2 Database and Methodology---------------------------------------------5
2.1 Database and Scene Annotation----------------------------------------------5
2.2 Multimodal Shot Representations--------------------------------------------8
2.2.1 Visual Shot Representations----------------------------------------------8
2.2.2 Acoustic Shot Representations--------------------------------------------9
2.2.3 Feature Encoding---------------------------------------------------------9
2.3 Framework------------------------------------------------------------------10
2.3.1 Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM)----11
2.3.2 Acoustic Expert and Segment Prediction-----------------------------------12
2.3.3 Multimodal Expert and Segment Prediction---------------------------------13
2.3.4 Mixer Network and Constraints--------------------------------------------15
Chapter 3 Experiment-----------------------------------------------------------19
3.1 Experimental Setup---------------------------------------------------------19
3.1.1 Network Configurations---------------------------------------------------20
Chapter 4 Results and Analysis-------------------------------------------------21
4.1 Comparison of Unimodal and Multimodal Experts------------------------------22
4.2 Comparison of Ensemble Approaches------------------------------------------24
4.3 Weight Analysis------------------------------------------------------------24
Chapter 5 Discussions----------------------------------------------------------27
Chapter 6 Conclusions----------------------------------------------------------29
Reference----------------------------------------------------------------------30
Appendix-----------------------------------------------------------------------34
Notations in Demo Clips--------------------------------------------------------34
Clips Analysis-----------------------------------------------------------------34

1. B. Patel and B. Meshram, "Content Based Video Retrieval Systems," International Journal of UbiComp, vol. 3, May 2012.
2. A. Ansari and M. H. Mohammed, "Content based video retrieval systems-methods, techniques, trends and challenges," International Journal of Computer Applications, vol. 112, 2015.
3. M. Soleymani, J. J. M. Kierkels, G. Chanel and T. Pun, "A bayesian framework for video affective representation," in 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, 2009.
4. A. Zlatintsi, P. Koutras, G. Evangelopoulos, N. Malandrakis, N. Efthymiou, K. Pastra, A. Potamianos and P. Maragos, "COGNIMUSE: A multimodal video database annotated with saliency, events, semantics and emotion with application to summarization," EURASIP Journal on Image and Video Processing, vol. 2017, p. 1–24, 2017.
5. A. S. Adly, M. S. Abdelwahab, I. Hegazy and T. Elarif, "Issues and Challenges for Content-Based Video Search Engines A Survey," in 2020 21st International Arab Conference on Information Technology (ACIT), 2020.
6. E. Katz, Ephraim Katz's The Film Encyclopedia, Thomas Y. Crowell, 1979.
7. J. E. Cutting, "Event segmentation and seven types of narrative discontinuity in popular movies," Acta psychologica, vol. 149, p. 69–77, 2014.
8. L. Chen and M. T. Ozsu, "Rule-based scene extraction from video," in Proceedings. International Conference on Image Processing, 2002.
9. V. T. Chasanis, A. C. Likas and N. P. Galatsanos, "Scene detection in videos using shot clustering and sequence alignment," IEEE transactions on multimedia, vol. 11, p. 89–100, 2008.
10. R. Panda, S. K. Kuanar and A. S. Chowdhury, "Nyström approximated temporally constrained multisimilarity spectral clustering approach for movie scene detection," IEEE transactions on cybernetics, vol. 48, p. 836–847, 2017.
11. D. Rotman, D. Porat, G. Ashour and U. Barzelay, "Optimally grouped deep features using normalized cost for video scene detection," in Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, 2018.
12. P. Sidiropoulos, V. Mezaris, I. Kompatsiaris, H. Meinedo, M. Bugalho and I. Trancoso, "Temporal video segmentation to scenes using high-level audiovisual features," IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, p. 1163–1177, 2011.
13. L. Baraldi, C. Grana and R. Cucchiara, "A deep siamese network for scene detection in broadcast videos," in Proceedings of the 23rd ACM international conference on Multimedia, 2015.
14. N. Nitanda, M. Haseyama and H. Kitajima, "Audio signal segmentation and classification for scene-cut detection," in 2005 IEEE International Symposium on Circuits and Systems, 2005.
15. Y. Zhu and D. Zhou, "Scene change detection based on audio and video content analysis," in Proceedings Fifth International Conference on Computational Intelligence and Multimedia Applications. ICCIMA 2003, 2003.
16. H. Sundaram and S.-F. Chang, "Video scene segmentation using video and audio features," in 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No. 00TH8532), 2000.
17. S. Rho and E. Hwang, "Video scene determination using audiovisual data analysis," in 24th International Conference on Distributed Computing Systems Workshops, 2004. Proceedings., 2004.
18. A. Rao, L. Xu, Y. Xiong, G. Xu, Q. Huang, B. Zhou and D. Lin, "A local-to-global approach to multi-modal movie scene segmentation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
19. Z. Pan, Z. Luo, J. Yang and H. Li, "Multi-Modal Attention for Speech Emotion Recognition," in Proc. Interspeech 2020, 2020.
20. Q. Huang, Y. Xiong, A. Rao, J. Wang and D. Lin, "MovieNet: A Holistic Dataset for Movie Understanding," in Proceedings of the European Conference on Computer Vision (ECCV), 2020.
21. B. Zhou, A. Lapedriza, A. Khosla, A. Oliva and A. Torralba, "Places: A 10 million image database for scene recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 40, p. 1452–1464, 2017.
22. Q. Huang, Y. Xiong and D. Lin, "Unifying identification and context learning for person recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
23. N. Zhang, M. Paluri, Y. Taigman, R. Fergus and L. Bourdev, "Beyond frontal faces: Improving person recognition using multiple cues," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015.
24. C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar and others, "Ava: A video dataset of spatio-temporally localized atomic visual actions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
25. S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold and others, "CNN architectures for large-scale audio classification," in 2017 ieee international conference on acoustics, speech and signal processing (icassp), 2017.
26. Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang and M. D. Plumbley, "Panns: Large-scale pretrained audio neural networks for audio pattern recognition," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, p. 2880–2894, 2020.
27. J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal and M. Ritter, "Audio set: An ontology and human-labeled dataset for audio events," in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
28. J. Cramer, H.-H. Wu, J. Salamon and J. P. Bello, "Look, listen, and learn more: Design choices for deep audio embeddings," in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
29. R. Arandjelovic and A. Zisserman, "Look, listen and learn," in Proceedings of the IEEE International Conference on Computer Vision, 2017.
30. J. Pons, O. Nieto, M. Prockup, E. M. Schmidt, A. F. Ehmann and X. Serra, "End-to-end learning for music audio tagging at scale," in 19th International Society for Music Information Retrieval Conference (ISMIR2018), 2018.
31. J. Pons and X. Serra, "musicnn: pre-trained convolutional neural networks for music audio tagging," in Late-breaking/demo session in 20th International Society for Music Information Retrieval Conference (LBD-ISMIR2019), 2019.
32. E. Law, K. West, M. I. Mandel, M. Bay and J. S. Downie, "Evaluation of algorithms using games: The case of music tagging.," in ISMIR, 2009.
33. C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan and Z. Zhu, "Deep speaker: an end-to-end neural speaker embedding system," arXiv preprint arXiv:1705.02304, 2017.
34. V. Panayotov, G. Chen, D. Povey and S. Khudanpur, "Librispeech: An ASR corpus based on public domain audio books," in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015.
35. N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V. Le, G. E. Hinton and J. Dean, "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer," in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
36. D. Eigen, M. Ranzato and I. Sutskever, "Learning Factored Representations in a Deep Mixture of Experts," in 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Workshop Track Proceedings, 2014.

電子全文
摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文