|
[1] B. O’Conaill, S. Whittaker, and S. Wilbur, “Conversations over video conferences: An evaluation of the spoken aspects of video-mediated communication,” Human-computer interaction, vol. 8, no. 4, pp. 389–428, 1993. [2] L. Mondada, “Multimodal resources for turn-taking: Pointing and the emergence of possible next speakers,” Discourse studies, vol. 9, no. 2, pp. 194–225, 2007. [3] S. C. Levinson and F. Torreira, “Timing in turn-taking and its implications for processing models of language,” Frontiers in Psychology, vol. 6, 2015. [4] G. Skantze, “Turn-taking in conversational systems and human-robot interaction: a review,” Computer Speech & Language, vol. 67, p. 101178, 2021. [5] Z. Degutyte and A. Astell, “The role of eye gaze in regulating turn taking in conversations: a systematized review of methods and findings,” Frontiers in Psychology, vol. 12, p. 616471, 2021. [6] V. Srinivasan and R. Murphy, “A survey of social gaze,” in Proceedings of the 6th International Conference on Human-Robot Interaction, HRI ’11, (New York, NY, USA), p. 253– 254, Association for Computing Machinery, 2011. [7] E. Calisgan, A. Haddadi, H. Van der Loos, J. A. Alcazar, and E. Croft, “Identifying nonverbal cues for automated human-robot turn-taking,” in 2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication, pp. 418–423, 2012. [8] U. Malik, J. Saunier, K. Funakoshi, and A. Pauchet, “Who speaks next? turn change and next speaker prediction in multimodal multiparty interaction,” in 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), pp. 349–354, 2020. [9] J.-P. Noel, M. A. De Niear, N. S. Lazzara, and M. T. Wallace, “Uncoupling between multisensory temporal function and nonverbal turn-taking in autism spectrum disorder,” IEEE Transactions on Cognitive and Developmental Systems, vol. 10, no. 4, pp. 973–982, 2018. [10] K. Jokinen, K. Harada, M. Nishida, and S. Yamamoto, “Turn-alignment using eye-gaze and speech in conversational interaction,” in Eleventh Annual Conference of the International Speech Communication Association, 2010. [11] R. Ishii, S. Kumano, and K. Otsuka, “Predicting next speaker based on head movement in multi-party meetings,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2319–2323, IEEE, 2015. [12] T. Kawahara, T. Iwatate, and K. Takanashi, “Prediction of turn-taking by combining prosodic and eye-gaze information in poster conversations,” in Thirteenth Annual Conference of the International Speech Communication Association, 2012. [13] J. Yang, P. Wang, Y. Zhu, M. Feng, M. Chen, and X. He, “Gated multimodal fusion with contrastive learning for turn-taking prediction in human-robot dialogue,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7747–7751, IEEE, 2022. [14] Y. Liang and Q. Zhou, “Detect turn-takings in subtitle streams with semantic recall transformer encoder,” in 2020 International Conference on Asian Language Processing (IALP), pp. 1–6, IEEE, 2020. [15] I. De Kok and D. Heylen, “Multimodal end-of-turn prediction in multi-party meetings,” in Proceedings of the 2009 international conference on Multimodal interfaces, pp. 91–98, 2009. [16] P. Müller, D. Schiller, D. Thomas, G. Zhang, M. Dietz, P. Gebhard, E. André, and A. Bulling, “Multimediate: Multi-modal group behaviour analysis for artificial mediation,” in Proc. ACM Multimedia (MM), pp. 4878–4882, 2021. [17] P. Müller, M. X. Huang, and A. Bulling, “Detecting low rapport during natural interactions in small groups from non-verbal behaviour,” in 23rd International Conference on Intelligent User Interfaces, pp. 153–164, 2018. [18] R. Ishii, K. Otsuka, S. Kumano, and J. Yamato, “Prediction of who will be the next speaker and when using gaze behavior in multiparty meetings,” ACM Trans. Interact. Intell. Syst., vol. 6, may 2016. [19] R. Tao, Z. Pan, R. K. Das, X. Qian, M. Z. Shou, and H. Li, “Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection,” in Proceedings of the 29th ACM International Conference on Multimedia, pp. 3927–3935, 2021. [20] T. Afouras, J. S. Chung, and A. Zisserman, “The conversation: Deep audio-visual speech enhancement,” arXiv preprint arXiv:1804.04121, 2018. [21] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018. [22] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual speech recognition,” IEEE transactions on pattern analysis and machine intelligence, 2018. [23] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. [24] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018. [25] J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han, “In defence of metric learning for speaker recognition,” arXiv preprint arXiv:2003.11982, 2020. [26] C. Birmingham, K. Stefanov, and M. J. Mataric, “Group-level focus of visual attention for improved next speaker prediction,” in Proceedings of the 29th ACM International Conference on Multimedia, pp. 4838–4842, 2021. [27] S.-L. Yeh, Y.-S. Lin, and C.-C. Lee, “A dialogical emotion decoder for speech emotion recognition in spoken dialog,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6479–6483, IEEE, 2020. [28] T. Pham, T. Tran, D. Phung, and S. Venkatesh, “Deepcare: A deep dynamic memory model for predictive medicine,” in Pacific-Asia conference on knowledge discovery and data mining, pp. 30–41, Springer, 2016. [29] I. M. Baytas, C. Xiao, X. Zhang, F. Wang, A. K. Jain, and J. Zhou, “Patient subtyping via time-aware lstm networks,” in Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 65–74, 2017. [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. [31] D. M. Blei and P. I. Frazier, “Distance dependent chinese restaurant processes.,” Journal of Machine Learning Research, vol. 12, no. 8, 2011. [32] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf, “Support vector machines,” IEEE Intelligent Systems and their applications, vol. 13, no. 4, pp. 18–28, 1998. |