|
[1] M. B. Akçay and K. Oğuz, “Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers,” Speech Communication, vol. 116, pp. 56–76, 2020. [2] D. Yu and L. Deng, AUTOMATIC SPEECH RECOGNITION. Springer, 2016. [3] W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterancelevel aggregation for speaker recognition in the wild,” in ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5791–5795, IEEE, 2019. [4] D. Braga, A. M. Madureira, L. Coelho, and R. Ajith, “Automatic detection of parkinson' s disease based on acoustic analysis of speech,” Engineering Applications of Artificial Intelligence, vol. 77, pp. 148–158, 2019. [5] L. Tóth, I. Hoffmann, G. Gosztolya, V. Vincze, G. Szatlóczki, Z. Bánréti, M. Pákáski, and J. Kálmán, “A speech recognitionbased solution for the automatic detection of mild cognitive impairment from spontaneous speech,” Current Alzheimer Research, vol. 15, no. 2, pp. 130–138, 2018. [6] B. Schuller and A. Batliner, Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing. 10 2013. [7] J. L. Kröger, O. H.M. Lutz, and P. Raschke, Privacy Implications of Voice and Speech Analysis – Information Disclosure by Inference, pp. 242–258. Cham: Springer International Publishing, 2020. [8] R. Tatman, “Gender and dialect bias in youtube's automatic captions,” in Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, pp. 53–59, 2017. [9] M. Sap, D. Card, S. Gabriel, Y. Choi, and N. A. Smith, “The risk of racial bias in hate speech detection,” in Proceedings of the 57th annual meeting of the association for computational linguistics, pp. 1668–1678, 2019. [10] B. M. L. Srivastava, A. Bellet, M. Tommasi, and E. Vincent, PrivacyPreserving Adversarial Representation Learning in ASR: Reality or Illusion?,” in Proc. Interspeech 2019, pp. 3700–3704, 2019. [11] R. Aloufi, H. Haddadi, and D. Boyle, “Emotionless: privacypreserving speech analysis for voice assistants,” arXiv preprint arXiv:1908.03632, 2019. [12] M. Jaiswal and E. M. Provost, “Privacy enhanced multimodal neural representations for emotion recognition,” in AAAI, 2020. 39 [13] M. Xia, A. Field, and Y. Tsvetkov, “Demoting racial bias in hate speech detection,” in Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media, (Online), pp. 7–14, Association for Computational Linguistics, July 2020. [14] W.N. Hsu, Y. Zhang, and J. Glass, “Learning latent representations for speech generation and transformation,” in Proc. Interspeech 2017, pp. 1273–1277, 2017. [15] L. Li, D. Wang, Y. Chen, Y. Shi, Z. Tang, and T. F. Zheng, “Deep factorization for speech signal,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5094–5098, 2018. [16] E. Creager, D. Madras, J.H. Jacobsen, M. Weis, K. Swersky, T. Pitassi, and R. Zemel, “Flexibly fair representation learning by disentanglement,” in Proceedings of the 36th International Conference on Machine Learning (K. Chaudhuri and R. Salakhutdinov, eds.), vol. 97 of Proceedings of Machine Learning Research, pp. 1436–1445, PMLR, 09–15 Jun 2019. [17] D. P. Kingma and M. Welling, “AutoEncoding Variational Bayes,” in 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 1416, 2014, Conference Track Proceedings, 2014. [18] Y.L. Huang, B.H. Su, Y.W. P. Hong, and C.C. Lee, “An AttributeAligned Strategy for Learning Speech Representation,” in Proc. Interspeech 2021, pp. 1179–1183, 2021. [19] M. Bancroft, R. Lotfian, J. Hansen, and C. Busso, “Exploring the intersection between speaker verification and emotion recognition,” in 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), pp. 337– 342, 2019. [20] R. Pappagari, T. Wang, J. Villalba, N. Chen, and N. Dehak, “xvectors meet emotions: A study on dependencies between emotion and speaker recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7169–7173, IEEE, 2020. [21] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for selfsupervised learning of speech representations,” arXiv preprint arXiv:2006.11477, 2020. [22] R. Lotfian and C. Busso, “Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings,” IEEE Transactions on Affective Computing, vol. 10, pp. 471–483, OctoberDecember 2019. [23] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, 2015. [24] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Transformers: Stateoftheart natural language processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, (Online), pp. 38–45, Association for Computational Linguistics, Oct. 2020. 40 [25] D. P. Kingma and M. Welling, “An introduction to variational autoencoders,” Found. Trends Mach. Learn., vol. 12, no. 4, pp. 307–392, 2019. [26] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, p. 1929–1958, Jan. 2014. [27] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in International conference on machine learning, pp. 1180–1189, PMLR, 2015. [28] T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attentionbased neural machine translation,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, (Lisbon, Portugal), pp. 1412–1421, Association for Computational Linguistics, Sept. 2015. [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, pp. 5998–6008, 2017. [30] L. Tarantino, P. N. Garner, and A. Lazaridis, “Selfattention for speech emotion recognition.,” in Interspeech, pp. 2578–2582, 2019. [31] N. Gui, D. Ge, and Z. Hu, “Afs: An attentionbased mechanism for supervised feature selection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3705–3713, 2019. [32] B. Skrlj, S. Dzeroski, N. Lavrac, and M. Petkovic, “Feature importance estimation with selfattention networks,” in ECAI 2020 24th European Conference on Artificial Intelligence, 29 August8 September 2020, Santiago de Compostela, Spain, August 29 September 8, 2020 Including 10th Conference on Prestigious Applications of Artificial Intelligence (PAIS 2020), vol. 325 of Frontiers in Artificial Intelligence and Applications, pp. 1491–1498, IOS Press, 2020. [33] Z. Lin, M. Feng, C. N. dos Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio, “A structured selfattentive sentence embedding,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, OpenReview.net, 2017. [34] W. Liu, Y. Wen, Z. Yu, and M. Yang, “Largemargin softmax loss for convolutional neural networks.,” in ICML, vol. 2, p. 7, 2016. [35] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 212–220, 2017. [36] F. Wang, J. Cheng, W. Liu, and H. Liu, “Additive margin softmax for face verification,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 926–930, 2018.
|