帳號:guest(216.73.216.146)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):呂紹豪
作者(外文):Lu, Shao-Hao
論文名稱(中文):轉換偵測解碼器於語者預測
論文名稱(外文):Dialogical speaker decoder with transition detection for next speaker prediction
指導教授(中文):李祈均
指導教授(外文):LEE, CHI-CHUN
口試委員(中文):冀泰石
陳冠宇
曹昱
口試委員(外文):Chi, Tai-Shih
Chen, Kuan-Yu
Tsao, Yu
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:108061708
出版年(民國):112
畢業學年度:111
語文別:中文
論文頁數:33
中文關鍵詞:語者預測轉換偵測
外文關鍵詞:next speaker predictiontransition detection
相關次數:
  • 推薦推薦:0
  • 點閱點閱:893
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
在小組互動以及人機互動中,語者預測和話權預測是非常重要的任務。 我們日常生活中常見的流暢對話以及互動,都需要我們同時去整合三個問題,現在是誰在說話,下一個語者會是誰,下一個語者什麼時候會接話。人類在對話中分析到的細微行為差異對於機器學習而言是非常困難的。有很多學者使用了不同的行為特徵來當作預測語者的資訊,例如:視線的方向、講話語調或者是身體的姿勢手勢。在這篇研究中,我提出了一個可以同時考慮語者過去講話資訊以及考慮是否會發生講話狀態改變的模型。利用語者的講話傾向以及視線方向來預測下一個語者是誰,接著結合過去語者講話的資訊和是否會發生話語權的轉換來優化預測的表現。我的模型最後在 UAR 的表現達到 78.11%,贏過了原本 MultiMediate challenge 2021 冠軍模型 3.41%。
Next speaker prediction and turn change prediction are two important tasks in group interaction or human-agent interaction. In order to have a fluent and understandable conversation, we need to coordinate three questions, that is, who is currently speaking, who will be the next speaker and when the next speaker shall start to speak. Therefore, many researchers have investigated deeply in human’s subtle behaviors between these interactions. Behaviors such as gaze direction, speaking prosody or gestures have been utilized to become turn-taking cues for models to predict the next speaker. In this work, I proposed a decoder-based model dialogical speaker decoder (DSD) for next speaker prediction, which will coordinate speaker’s behavior features, such as talk tendency and gaze pattern, speaker’s past history of talking and speaking state transition detection model with timeawareness and behavior divergence. I have achieved next speaker prediction with UAR of 78.11%, which is 3.41% improvement over champion model in MultiMediate challenge 2021.
Acknowledgements
摘要 i
Abstract ii
1 Introduction 1
2 Data Corpus 5
2.1 Annotation method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Eye contact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Next speaker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Behavior Analysis 7
3.1 Talk Tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Gaze Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.1 Speaker gaze pattern and turn-changing . . . . . . . . . . . . . . . . . 8
3.2.2 Listener gaze pattern and turn-changing . . . . . . . . . . . . . . . . . 9
3.2.3 Timing structure of eye contact and turn-changing . . . . . . . . . . . 9
4 Method 13
4.1 Task definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2.1 Active speaker detection model . . . . . . . . . . . . . . . . . . . . . 14
4.2.2 Gaze model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Dialogical Speaker Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3.1 Base prediction Model . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.2 Transition model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.3 Speaking assignment process . . . . . . . . . . . . . . . . . . . . . . . 20
4.3.4 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Results 23
5.1 Dialogical speaker decoder performance . . . . . . . . . . . . . . . . . . . . . 23
5.2 Base model performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3 Transition model performance . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3.1 Transition model analysis . . . . . . . . . . . . . . . . . . . . . . . . 26
6 Conclusion 29
References 31
[1] B. O’Conaill, S. Whittaker, and S. Wilbur, “Conversations over video conferences: An
evaluation of the spoken aspects of video-mediated communication,” Human-computer
interaction, vol. 8, no. 4, pp. 389–428, 1993.
[2] L. Mondada, “Multimodal resources for turn-taking: Pointing and the emergence of possible next speakers,” Discourse studies, vol. 9, no. 2, pp. 194–225, 2007.
[3] S. C. Levinson and F. Torreira, “Timing in turn-taking and its implications for processing
models of language,” Frontiers in Psychology, vol. 6, 2015.
[4] G. Skantze, “Turn-taking in conversational systems and human-robot interaction: a review,” Computer Speech & Language, vol. 67, p. 101178, 2021.
[5] Z. Degutyte and A. Astell, “The role of eye gaze in regulating turn taking in conversations: a systematized review of methods and findings,” Frontiers in Psychology, vol. 12,
p. 616471, 2021.
[6] V. Srinivasan and R. Murphy, “A survey of social gaze,” in Proceedings of the 6th International Conference on Human-Robot Interaction, HRI ’11, (New York, NY, USA), p. 253–
254, Association for Computing Machinery, 2011.
[7] E. Calisgan, A. Haddadi, H. Van der Loos, J. A. Alcazar, and E. Croft, “Identifying nonverbal cues for automated human-robot turn-taking,” in 2012 IEEE RO-MAN: The 21st IEEE
International Symposium on Robot and Human Interactive Communication, pp. 418–423,
2012.
[8] U. Malik, J. Saunier, K. Funakoshi, and A. Pauchet, “Who speaks next? turn change
and next speaker prediction in multimodal multiparty interaction,” in 2020 IEEE 32nd
International Conference on Tools with Artificial Intelligence (ICTAI), pp. 349–354, 2020.
[9] J.-P. Noel, M. A. De Niear, N. S. Lazzara, and M. T. Wallace, “Uncoupling between multisensory temporal function and nonverbal turn-taking in autism spectrum disorder,” IEEE
Transactions on Cognitive and Developmental Systems, vol. 10, no. 4, pp. 973–982, 2018.
[10] K. Jokinen, K. Harada, M. Nishida, and S. Yamamoto, “Turn-alignment using eye-gaze
and speech in conversational interaction,” in Eleventh Annual Conference of the International Speech Communication Association, 2010.
[11] R. Ishii, S. Kumano, and K. Otsuka, “Predicting next speaker based on head movement in
multi-party meetings,” in 2015 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pp. 2319–2323, IEEE, 2015.
[12] T. Kawahara, T. Iwatate, and K. Takanashi, “Prediction of turn-taking by combining
prosodic and eye-gaze information in poster conversations,” in Thirteenth Annual Conference of the International Speech Communication Association, 2012.
[13] J. Yang, P. Wang, Y. Zhu, M. Feng, M. Chen, and X. He, “Gated multimodal fusion
with contrastive learning for turn-taking prediction in human-robot dialogue,” in ICASSP
2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pp. 7747–7751, IEEE, 2022.
[14] Y. Liang and Q. Zhou, “Detect turn-takings in subtitle streams with semantic recall transformer encoder,” in 2020 International Conference on Asian Language Processing (IALP),
pp. 1–6, IEEE, 2020.
[15] I. De Kok and D. Heylen, “Multimodal end-of-turn prediction in multi-party meetings,”
in Proceedings of the 2009 international conference on Multimodal interfaces, pp. 91–98,
2009.
[16] P. Müller, D. Schiller, D. Thomas, G. Zhang, M. Dietz, P. Gebhard, E. André, and
A. Bulling, “Multimediate: Multi-modal group behaviour analysis for artificial mediation,” in Proc. ACM Multimedia (MM), pp. 4878–4882, 2021.
[17] P. Müller, M. X. Huang, and A. Bulling, “Detecting low rapport during natural interactions in small groups from non-verbal behaviour,” in 23rd International Conference on
Intelligent User Interfaces, pp. 153–164, 2018.
[18] R. Ishii, K. Otsuka, S. Kumano, and J. Yamato, “Prediction of who will be the next speaker
and when using gaze behavior in multiparty meetings,” ACM Trans. Interact. Intell. Syst.,
vol. 6, may 2016.
[19] R. Tao, Z. Pan, R. K. Das, X. Qian, M. Z. Shou, and H. Li, “Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection,” in Proceedings
of the 29th ACM International Conference on Multimedia, pp. 3927–3935, 2021.
[20] T. Afouras, J. S. Chung, and A. Zisserman, “The conversation: Deep audio-visual speech
enhancement,” arXiv preprint arXiv:1804.04121, 2018.
[21] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and
recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018.
[22] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual
speech recognition,” IEEE transactions on pattern analysis and machine intelligence,
2018.
[23] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July
2017.
[24] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018.
[25] J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.-J.
Lee, and I. Han, “In defence of metric learning for speaker recognition,” arXiv preprint
arXiv:2003.11982, 2020.
[26] C. Birmingham, K. Stefanov, and M. J. Mataric, “Group-level focus of visual attention for
improved next speaker prediction,” in Proceedings of the 29th ACM International Conference on Multimedia, pp. 4838–4842, 2021.
[27] S.-L. Yeh, Y.-S. Lin, and C.-C. Lee, “A dialogical emotion decoder for speech emotion
recognition in spoken dialog,” in ICASSP 2020-2020 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pp. 6479–6483, IEEE, 2020.
[28] T. Pham, T. Tran, D. Phung, and S. Venkatesh, “Deepcare: A deep dynamic memory
model for predictive medicine,” in Pacific-Asia conference on knowledge discovery and
data mining, pp. 30–41, Springer, 2016.
[29] I. M. Baytas, C. Xiao, X. Zhang, F. Wang, A. K. Jain, and J. Zhou, “Patient subtyping
via time-aware lstm networks,” in Proceedings of the 23rd ACM SIGKDD international
conference on knowledge discovery and data mining, pp. 65–74, 2017.
[30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and
I. Polosukhin, “Attention is all you need,” Advances in neural information processing
systems, vol. 30, 2017.
[31] D. M. Blei and P. I. Frazier, “Distance dependent chinese restaurant processes.,” Journal
of Machine Learning Research, vol. 12, no. 8, 2011.
[32] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf, “Support vector machines,” IEEE Intelligent Systems and their applications, vol. 13, no. 4, pp. 18–28, 1998.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top

相關論文

1. 透過語音特徵建構基於堆疊稀疏自編碼器演算法之婚姻治療中夫妻互動行為量表自動化評分系統
2. 基於健保資料預測中風之研究並以Hadoop作為一種快速擷取特徵工具
3. 一個利用人類Thin-Slice情緒感知特性所建構而成之全時情緒辨識模型新框架
4. 應用多任務與多模態融合技術於候用校長演講自動評分系統之建構
5. 基於多模態主動式學習法進行樣本與標記之間的關係分析於候用校長評鑑之自動化評分系統建置
6. 透過結合fMRI大腦血氧濃度相依訊號以改善語音情緒辨識系統
7. 結合fMRI之迴旋積類神經網路多層次特徵 用以改善語音情緒辨識系統
8. 針對實體化交談介面開發基於行為衡量方法於自閉症小孩之評估系統
9. 一個多模態連續情緒辨識系統與其應用於全域情感辨識之研究
10. 整合文本多層次表達與嵌入演講屬性之表徵學習於強健候用校長演講自動化評分系統
11. 利用聯合因素分析研究大腦磁振神經影像之時間效應以改善情緒辨識系統
12. 利用LSTM演算法基於自閉症診斷觀察量表訪談建置辨識自閉症小孩之評估系統
13. 利用多模態模型混合CNN和LSTM影音特徵以自動化偵測急診病患疼痛程度
14. 以雙向長短期記憶網路架構混和多時間粒度文字模態改善婚 姻治療自動化行為評分系統
15. 透過表演逐字稿之互動特徵以改善中文戲劇表演資料庫情緒辨識系統
 
* *