帳號:guest(216.73.216.146)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):林孟翰
作者(外文):Lin, Meng-Han
論文名稱(中文):利用多聲學專家模型改進多模態電影場景分割
論文名稱(外文):Improving Multimodal Movie Scene Segmentation Using Mixture of Acoustic Experts
指導教授(中文):李祈均
指導教授(外文):Lee, Chi-Chun
口試委員(中文):林彥宇
陳奕廷
口試委員(外文):Lin, Yen-Yu
Chen, Yi-Ting
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:108061529
出版年(民國):111
畢業學年度:110
語文別:英文
論文頁數:38
中文關鍵詞:電影場景分割多專家模型多模態注意力機制聲訊
外文關鍵詞:MovieScene SegmentationMixture of ExpertsMultimodal AttentionAudio
相關次數:
  • 推薦推薦:0
  • 點閱點閱:571
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
近年來大量增加的多媒體影音內容促進了各種多媒體技術的發展,而這些多媒體技術都有把長的影片分割成較短且有意義的片段作為前置處理的需求。在一部電影中,場景(scene)是最基本的語義單位,而場景分割則是各種多媒體技術中關鍵的前置處理。先前場景分割的研究從幀(frame)或鏡頭(shot)中提取低階的視覺特徵來做分群,並且對分群施加時間上的限制,使得分群的結果能符合場景時序上的特性。最近的研究則進一步以多模態語義特徵訓練能辨識場景的模型。在多模態語義特徵中,音訊特徵通常是以在多領域皆適用的預訓練模型提取,並沒有考慮到電影音訊中具有不同語義的各種面向。除此之外,用於場景分割的音訊和影像互補的特性和複雜的互動也經常被忽略。在此篇研究中,我們提出能夠整合聲學模型和多模態模型的多聲學專家模型(MOAE)來提升場景分割的品質。聲學模型是針對音訊中各種語義面向訓練而成,其中包括講者、環境音等其他音訊事件的特徵都被當作是聲學模型的訓練資料。MOAE 能夠優化各個專家模型的權重,並在具有 1,110 個場景邊界 (scene boundary) 的電影資料庫中達到 61.89% F1-score。我們將MOAE分配給各專家模型的權重視覺化,並以圖像說明專家模型之間的互補的特性是如何改進場景分割的結果。
The recent growth of multimedia content brought about the accelerated development of multimedia computing technology. Such technology requires the segmentation of long videos, such as movies, into semantically meaningful clips as pre-processing. Scenes are the basic semantic elements in a movie. Past studies proposed clustering shots into scenes based on low-level visual representations with additional temporal constraints and alignment mechanisms. Recently, researchers proposed the use of multimodal semantic features to represent the complex semantic aspects of a movie. However, the semantically meaningful aspects of audio are often ignored because acoustic representations are usually extracted using a universally pretrained model. Moreover, the discussion and further research into the interactions between the aspects of audio and visual modalities were left much to be desired. In this work, we introduce a mixture-of-audio-experts (MOAE) framework. It has the ability to integrate acoustic and multimodal expert models that are learned based on different acoustic semantics, including speaker, environmental sounds along with visual representations. The mixer network in the MOAE determines the importance of different modalities and aspects by optimizing the weights assigned to each expert model. Our framework achieves a state-of-the-art 61.89% F1-score for scene segmentation on a database with 1,110 scenes. We also provide the visualization of the assigned weights along with corresponding sampled images/clips. These visualizations reveal the complementary properties among visual and acoustic modalities that lead to the improvement of scene segmentation.
摘要 ---------------------------------------------------------------------------i
Abstract-----------------------------------------------------------------------ii
誌謝---------------------------------------------------------------------------iii
Contents-----------------------------------------------------------------------iv
List of Figures----------------------------------------------------------------vi
List of Tables-----------------------------------------------------------------viii
Chapter 1 Introduction---------------------------------------------------------1
1.1 Background-----------------------------------------------------------------1
1.2 Motivation and Proposal----------------------------------------------------3
Chapter 2 Database and Methodology---------------------------------------------5
2.1 Database and Scene Annotation----------------------------------------------5
2.2 Multimodal Shot Representations--------------------------------------------8
2.2.1 Visual Shot Representations----------------------------------------------8
2.2.2 Acoustic Shot Representations--------------------------------------------9
2.2.3 Feature Encoding---------------------------------------------------------9
2.3 Framework------------------------------------------------------------------10
2.3.1 Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM)----11
2.3.2 Acoustic Expert and Segment Prediction-----------------------------------12
2.3.3 Multimodal Expert and Segment Prediction---------------------------------13
2.3.4 Mixer Network and Constraints--------------------------------------------15
Chapter 3 Experiment-----------------------------------------------------------19
3.1 Experimental Setup---------------------------------------------------------19
3.1.1 Network Configurations---------------------------------------------------20
Chapter 4 Results and Analysis-------------------------------------------------21
4.1 Comparison of Unimodal and Multimodal Experts------------------------------22
4.2 Comparison of Ensemble Approaches------------------------------------------24
4.3 Weight Analysis------------------------------------------------------------24
Chapter 5 Discussions----------------------------------------------------------27
Chapter 6 Conclusions----------------------------------------------------------29
Reference----------------------------------------------------------------------30
Appendix-----------------------------------------------------------------------34
Notations in Demo Clips--------------------------------------------------------34
Clips Analysis-----------------------------------------------------------------34
1. B. Patel and B. Meshram, "Content Based Video Retrieval Systems," International Journal of UbiComp, vol. 3, May 2012.
2. A. Ansari and M. H. Mohammed, "Content based video retrieval systems-methods, techniques, trends and challenges," International Journal of Computer Applications, vol. 112, 2015.
3. M. Soleymani, J. J. M. Kierkels, G. Chanel and T. Pun, "A bayesian framework for video affective representation," in 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, 2009.
4. A. Zlatintsi, P. Koutras, G. Evangelopoulos, N. Malandrakis, N. Efthymiou, K. Pastra, A. Potamianos and P. Maragos, "COGNIMUSE: A multimodal video database annotated with saliency, events, semantics and emotion with application to summarization," EURASIP Journal on Image and Video Processing, vol. 2017, p. 1–24, 2017.
5. A. S. Adly, M. S. Abdelwahab, I. Hegazy and T. Elarif, "Issues and Challenges for Content-Based Video Search Engines A Survey," in 2020 21st International Arab Conference on Information Technology (ACIT), 2020.
6. E. Katz, Ephraim Katz's The Film Encyclopedia, Thomas Y. Crowell, 1979.
7. J. E. Cutting, "Event segmentation and seven types of narrative discontinuity in popular movies," Acta psychologica, vol. 149, p. 69–77, 2014.
8. L. Chen and M. T. Ozsu, "Rule-based scene extraction from video," in Proceedings. International Conference on Image Processing, 2002.
9. V. T. Chasanis, A. C. Likas and N. P. Galatsanos, "Scene detection in videos using shot clustering and sequence alignment," IEEE transactions on multimedia, vol. 11, p. 89–100, 2008.
10. R. Panda, S. K. Kuanar and A. S. Chowdhury, "Nyström approximated temporally constrained multisimilarity spectral clustering approach for movie scene detection," IEEE transactions on cybernetics, vol. 48, p. 836–847, 2017.
11. D. Rotman, D. Porat, G. Ashour and U. Barzelay, "Optimally grouped deep features using normalized cost for video scene detection," in Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, 2018.
12. P. Sidiropoulos, V. Mezaris, I. Kompatsiaris, H. Meinedo, M. Bugalho and I. Trancoso, "Temporal video segmentation to scenes using high-level audiovisual features," IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, p. 1163–1177, 2011.
13. L. Baraldi, C. Grana and R. Cucchiara, "A deep siamese network for scene detection in broadcast videos," in Proceedings of the 23rd ACM international conference on Multimedia, 2015.
14. N. Nitanda, M. Haseyama and H. Kitajima, "Audio signal segmentation and classification for scene-cut detection," in 2005 IEEE International Symposium on Circuits and Systems, 2005.
15. Y. Zhu and D. Zhou, "Scene change detection based on audio and video content analysis," in Proceedings Fifth International Conference on Computational Intelligence and Multimedia Applications. ICCIMA 2003, 2003.
16. H. Sundaram and S.-F. Chang, "Video scene segmentation using video and audio features," in 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No. 00TH8532), 2000.
17. S. Rho and E. Hwang, "Video scene determination using audiovisual data analysis," in 24th International Conference on Distributed Computing Systems Workshops, 2004. Proceedings., 2004.
18. A. Rao, L. Xu, Y. Xiong, G. Xu, Q. Huang, B. Zhou and D. Lin, "A local-to-global approach to multi-modal movie scene segmentation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
19. Z. Pan, Z. Luo, J. Yang and H. Li, "Multi-Modal Attention for Speech Emotion Recognition," in Proc. Interspeech 2020, 2020.
20. Q. Huang, Y. Xiong, A. Rao, J. Wang and D. Lin, "MovieNet: A Holistic Dataset for Movie Understanding," in Proceedings of the European Conference on Computer Vision (ECCV), 2020.
21. B. Zhou, A. Lapedriza, A. Khosla, A. Oliva and A. Torralba, "Places: A 10 million image database for scene recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 40, p. 1452–1464, 2017.
22. Q. Huang, Y. Xiong and D. Lin, "Unifying identification and context learning for person recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
23. N. Zhang, M. Paluri, Y. Taigman, R. Fergus and L. Bourdev, "Beyond frontal faces: Improving person recognition using multiple cues," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015.
24. C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar and others, "Ava: A video dataset of spatio-temporally localized atomic visual actions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
25. S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold and others, "CNN architectures for large-scale audio classification," in 2017 ieee international conference on acoustics, speech and signal processing (icassp), 2017.
26. Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang and M. D. Plumbley, "Panns: Large-scale pretrained audio neural networks for audio pattern recognition," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, p. 2880–2894, 2020.
27. J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal and M. Ritter, "Audio set: An ontology and human-labeled dataset for audio events," in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
28. J. Cramer, H.-H. Wu, J. Salamon and J. P. Bello, "Look, listen, and learn more: Design choices for deep audio embeddings," in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
29. R. Arandjelovic and A. Zisserman, "Look, listen and learn," in Proceedings of the IEEE International Conference on Computer Vision, 2017.
30. J. Pons, O. Nieto, M. Prockup, E. M. Schmidt, A. F. Ehmann and X. Serra, "End-to-end learning for music audio tagging at scale," in 19th International Society for Music Information Retrieval Conference (ISMIR2018), 2018.
31. J. Pons and X. Serra, "musicnn: pre-trained convolutional neural networks for music audio tagging," in Late-breaking/demo session in 20th International Society for Music Information Retrieval Conference (LBD-ISMIR2019), 2019.
32. E. Law, K. West, M. I. Mandel, M. Bay and J. S. Downie, "Evaluation of algorithms using games: The case of music tagging.," in ISMIR, 2009.
33. C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan and Z. Zhu, "Deep speaker: an end-to-end neural speaker embedding system," arXiv preprint arXiv:1705.02304, 2017.
34. V. Panayotov, G. Chen, D. Povey and S. Khudanpur, "Librispeech: An ASR corpus based on public domain audio books," in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015.
35. N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V. Le, G. E. Hinton and J. Dean, "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer," in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
36. D. Eigen, M. Ranzato and I. Sutskever, "Learning Factored Representations in a Deep Mixture of Experts," in 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Workshop Track Proceedings, 2014.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top

相關論文

1. 透過語音特徵建構基於堆疊稀疏自編碼器演算法之婚姻治療中夫妻互動行為量表自動化評分系統
2. 基於健保資料預測中風之研究並以Hadoop作為一種快速擷取特徵工具
3. 一個利用人類Thin-Slice情緒感知特性所建構而成之全時情緒辨識模型新框架
4. 應用多任務與多模態融合技術於候用校長演講自動評分系統之建構
5. 基於多模態主動式學習法進行樣本與標記之間的關係分析於候用校長評鑑之自動化評分系統建置
6. 透過結合fMRI大腦血氧濃度相依訊號以改善語音情緒辨識系統
7. 結合fMRI之迴旋積類神經網路多層次特徵 用以改善語音情緒辨識系統
8. 針對實體化交談介面開發基於行為衡量方法於自閉症小孩之評估系統
9. 一個多模態連續情緒辨識系統與其應用於全域情感辨識之研究
10. 整合文本多層次表達與嵌入演講屬性之表徵學習於強健候用校長演講自動化評分系統
11. 利用聯合因素分析研究大腦磁振神經影像之時間效應以改善情緒辨識系統
12. 利用LSTM演算法基於自閉症診斷觀察量表訪談建置辨識自閉症小孩之評估系統
13. 利用多模態模型混合CNN和LSTM影音特徵以自動化偵測急診病患疼痛程度
14. 以雙向長短期記憶網路架構混和多時間粒度文字模態改善婚 姻治療自動化行為評分系統
15. 透過表演逐字稿之互動特徵以改善中文戲劇表演資料庫情緒辨識系統
 
* *