轉換偵測解碼器於語者預測__國立清華大學博碩士論文全文影像系統

帳號：guest(216.73.216.146) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	呂紹豪
作者(外文):	Lu, Shao-Hao
論文名稱(中文):	轉換偵測解碼器於語者預測
論文名稱(外文):	Dialogical speaker decoder with transition detection for next speaker prediction
指導教授(中文):	李祈均
指導教授(外文):	LEE, CHI-CHUN
口試委員(中文):	冀泰石陳冠宇曹昱
口試委員(外文):	Chi, Tai-Shih Chen, Kuan-Yu Tsao, Yu
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	電機工程學系
學號:	108061708
出版年(民國):	112
畢業學年度:	111
語文別:	中文
論文頁數:	33
中文關鍵詞:	語者預測、轉換偵測
外文關鍵詞:	next speaker prediction、transition detection
相關次數:	推薦:0 點閱:893 評分: 下載:0 收藏:0

在小組互動以及人機互動中，語者預測和話權預測是非常重要的任務。我們日常生活中常見的流暢對話以及互動，都需要我們同時去整合三個問題，現在是誰在說話，下一個語者會是誰，下一個語者什麼時候會接話。人類在對話中分析到的細微行為差異對於機器學習而言是非常困難的。有很多學者使用了不同的行為特徵來當作預測語者的資訊，例如：視線的方向、講話語調或者是身體的姿勢手勢。在這篇研究中，我提出了一個可以同時考慮語者過去講話資訊以及考慮是否會發生講話狀態改變的模型。利用語者的講話傾向以及視線方向來預測下一個語者是誰，接著結合過去語者講話的資訊和是否會發生話語權的轉換來優化預測的表現。我的模型最後在 UAR 的表現達到 78.11%，贏過了原本 MultiMediate challenge 2021 冠軍模型 3.41%。

Next speaker prediction and turn change prediction are two important tasks in group interaction or human-agent interaction. In order to have a fluent and understandable conversation, we need to coordinate three questions, that is, who is currently speaking, who will be the next speaker and when the next speaker shall start to speak. Therefore, many researchers have investigated deeply in human’s subtle behaviors between these interactions. Behaviors such as gaze direction, speaking prosody or gestures have been utilized to become turn-taking cues for models to predict the next speaker. In this work, I proposed a decoder-based model dialogical speaker decoder (DSD) for next speaker prediction, which will coordinate speaker’s behavior features, such as talk tendency and gaze pattern, speaker’s past history of talking and speaking state transition detection model with timeawareness and behavior divergence. I have achieved next speaker prediction with UAR of 78.11%, which is 3.41% improvement over champion model in MultiMediate challenge 2021.

Acknowledgements
摘要 i
Abstract ii
1 Introduction 1
2 Data Corpus 5
2.1 Annotation method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Eye contact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Next speaker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Behavior Analysis 7
3.1 Talk Tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Gaze Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.1 Speaker gaze pattern and turn-changing . . . . . . . . . . . . . . . . . 8
3.2.2 Listener gaze pattern and turn-changing . . . . . . . . . . . . . . . . . 9
3.2.3 Timing structure of eye contact and turn-changing . . . . . . . . . . . 9
4 Method 13
4.1 Task definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2.1 Active speaker detection model . . . . . . . . . . . . . . . . . . . . . 14
4.2.2 Gaze model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Dialogical Speaker Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3.1 Base prediction Model . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.2 Transition model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.3 Speaking assignment process . . . . . . . . . . . . . . . . . . . . . . . 20
4.3.4 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Results 23
5.1 Dialogical speaker decoder performance . . . . . . . . . . . . . . . . . . . . . 23
5.2 Base model performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3 Transition model performance . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3.1 Transition model analysis . . . . . . . . . . . . . . . . . . . . . . . . 26
6 Conclusion 29
References 31

[1] B. O’Conaill, S. Whittaker, and S. Wilbur, “Conversations over video conferences: An
evaluation of the spoken aspects of video-mediated communication,” Human-computer
interaction, vol. 8, no. 4, pp. 389–428, 1993.
[2] L. Mondada, “Multimodal resources for turn-taking: Pointing and the emergence of possible next speakers,” Discourse studies, vol. 9, no. 2, pp. 194–225, 2007.
[3] S. C. Levinson and F. Torreira, “Timing in turn-taking and its implications for processing
models of language,” Frontiers in Psychology, vol. 6, 2015.
[4] G. Skantze, “Turn-taking in conversational systems and human-robot interaction: a review,” Computer Speech & Language, vol. 67, p. 101178, 2021.
[5] Z. Degutyte and A. Astell, “The role of eye gaze in regulating turn taking in conversations: a systematized review of methods and findings,” Frontiers in Psychology, vol. 12,
p. 616471, 2021.
[6] V. Srinivasan and R. Murphy, “A survey of social gaze,” in Proceedings of the 6th International Conference on Human-Robot Interaction, HRI ’11, (New York, NY, USA), p. 253–
254, Association for Computing Machinery, 2011.
[7] E. Calisgan, A. Haddadi, H. Van der Loos, J. A. Alcazar, and E. Croft, “Identifying nonverbal cues for automated human-robot turn-taking,” in 2012 IEEE RO-MAN: The 21st IEEE
International Symposium on Robot and Human Interactive Communication, pp. 418–423,
2012.
[8] U. Malik, J. Saunier, K. Funakoshi, and A. Pauchet, “Who speaks next? turn change
and next speaker prediction in multimodal multiparty interaction,” in 2020 IEEE 32nd
International Conference on Tools with Artificial Intelligence (ICTAI), pp. 349–354, 2020.
[9] J.-P. Noel, M. A. De Niear, N. S. Lazzara, and M. T. Wallace, “Uncoupling between multisensory temporal function and nonverbal turn-taking in autism spectrum disorder,” IEEE
Transactions on Cognitive and Developmental Systems, vol. 10, no. 4, pp. 973–982, 2018.
[10] K. Jokinen, K. Harada, M. Nishida, and S. Yamamoto, “Turn-alignment using eye-gaze
and speech in conversational interaction,” in Eleventh Annual Conference of the International Speech Communication Association, 2010.
[11] R. Ishii, S. Kumano, and K. Otsuka, “Predicting next speaker based on head movement in
multi-party meetings,” in 2015 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pp. 2319–2323, IEEE, 2015.
[12] T. Kawahara, T. Iwatate, and K. Takanashi, “Prediction of turn-taking by combining
prosodic and eye-gaze information in poster conversations,” in Thirteenth Annual Conference of the International Speech Communication Association, 2012.
[13] J. Yang, P. Wang, Y. Zhu, M. Feng, M. Chen, and X. He, “Gated multimodal fusion
with contrastive learning for turn-taking prediction in human-robot dialogue,” in ICASSP
2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pp. 7747–7751, IEEE, 2022.
[14] Y. Liang and Q. Zhou, “Detect turn-takings in subtitle streams with semantic recall transformer encoder,” in 2020 International Conference on Asian Language Processing (IALP),
pp. 1–6, IEEE, 2020.
[15] I. De Kok and D. Heylen, “Multimodal end-of-turn prediction in multi-party meetings,”
in Proceedings of the 2009 international conference on Multimodal interfaces, pp. 91–98,
2009.
[16] P. Müller, D. Schiller, D. Thomas, G. Zhang, M. Dietz, P. Gebhard, E. André, and
A. Bulling, “Multimediate: Multi-modal group behaviour analysis for artificial mediation,” in Proc. ACM Multimedia (MM), pp. 4878–4882, 2021.
[17] P. Müller, M. X. Huang, and A. Bulling, “Detecting low rapport during natural interactions in small groups from non-verbal behaviour,” in 23rd International Conference on
Intelligent User Interfaces, pp. 153–164, 2018.
[18] R. Ishii, K. Otsuka, S. Kumano, and J. Yamato, “Prediction of who will be the next speaker
and when using gaze behavior in multiparty meetings,” ACM Trans. Interact. Intell. Syst.,
vol. 6, may 2016.
[19] R. Tao, Z. Pan, R. K. Das, X. Qian, M. Z. Shou, and H. Li, “Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection,” in Proceedings
of the 29th ACM International Conference on Multimedia, pp. 3927–3935, 2021.
[20] T. Afouras, J. S. Chung, and A. Zisserman, “The conversation: Deep audio-visual speech
enhancement,” arXiv preprint arXiv:1804.04121, 2018.
[21] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and
recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018.
[22] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual
speech recognition,” IEEE transactions on pattern analysis and machine intelligence,
2018.
[23] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July
2017.
[24] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018.
[25] J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.-J.
Lee, and I. Han, “In defence of metric learning for speaker recognition,” arXiv preprint
arXiv:2003.11982, 2020.
[26] C. Birmingham, K. Stefanov, and M. J. Mataric, “Group-level focus of visual attention for
improved next speaker prediction,” in Proceedings of the 29th ACM International Conference on Multimedia, pp. 4838–4842, 2021.
[27] S.-L. Yeh, Y.-S. Lin, and C.-C. Lee, “A dialogical emotion decoder for speech emotion
recognition in spoken dialog,” in ICASSP 2020-2020 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pp. 6479–6483, IEEE, 2020.
[28] T. Pham, T. Tran, D. Phung, and S. Venkatesh, “Deepcare: A deep dynamic memory
model for predictive medicine,” in Pacific-Asia conference on knowledge discovery and
data mining, pp. 30–41, Springer, 2016.
[29] I. M. Baytas, C. Xiao, X. Zhang, F. Wang, A. K. Jain, and J. Zhou, “Patient subtyping
via time-aware lstm networks,” in Proceedings of the 23rd ACM SIGKDD international
conference on knowledge discovery and data mining, pp. 65–74, 2017.
[30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and
I. Polosukhin, “Attention is all you need,” Advances in neural information processing
systems, vol. 30, 2017.
[31] D. M. Blei and P. I. Frazier, “Distance dependent chinese restaurant processes.,” Journal
of Machine Learning Research, vol. 12, no. 8, 2011.
[32] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf, “Support vector machines,” IEEE Intelligent Systems and their applications, vol. 13, no. 4, pp. 18–28, 1998.

電子全文
摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文