|
[1] C.-C. Lee, K. Sridhar, J.-L. Li, W.-C. Lin, B.-H. Su, and C. Busso, “Deep representation learning for affective speech signal analysis and processing: Preventing unwanted signal disparities,” IEEE Signal Processing Magazine, vol. 38, no. 6, pp. 22–38, 2021. [2] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database, ”Language resources and evaluation, vol. 42, pp. 335–359, 2008. [3] R. Lotfian and C. Busso, “Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings,” IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471–483, 2017. [4] M. Abdelwahab and C. Busso, “Domain adversarial for acoustic emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 12, pp. 2423–2435, 2018. [5] G.-Y. Chao, Y.-S. Lin, C.-M. Chang, and C.-C. Lee, “Enforcing semantic consistency for cross corpus valence regression from speech using adversarial discrepancy learning.,” in INTERSPEECH, pp. 1681–1685, 2019. [6] P. Wei, Y. Ke, X. Qu, and T.-Y. Leong, “Subdomain adaptation with manifolds discrepancy alignment,” IEEE Transactions on Cybernetics, vol. 52, no. 11, pp. 11698–11708, 2021. [7] A. Chatziagapi, G. Paraskevopoulos, D. Sgouropoulos, G. Pantazopoulos, M. Nikandrou, T. Giannakopoulos, A. Katsamanis, A. Potamianos, and S. Narayanan, “Data augmentation using gans for speech emotion recognition.,” in Interspeech, pp. 171–175, 2019. [8] C. Etienne, G. Fidanza, A. Petrovskii, L. Devillers, and B. Schmauch, “Cnn+ lstm architecture for speech emotion recognition with data augmentation,” arXiv preprint arXiv:1802.05630, 2018. [9] N. Akhtar and A. Mian, “Threat of adversarial attacks on deep learning in computer vision: A survey,” Ieee Access, vol. 6, pp. 14410–14430, 2018. [10] T. Long, Q. Gao, L. Xu, and Z. Zhou, “A survey on adversarial attacks in computer vision: Taxonomy, visualization and future directions,” Computers & Security, p. 102847, 2022. [11] H. Wu, B. Zheng, X. Li, X. Wu, H.-Y. Lee, and H. Meng, “Characterizing the adversarial vulnerability of speech self-supervised learning,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3164–3168, IEEE, 2022. [12] N. Akhtar, A. Mian, N. Kardan, and M. Shah, “Advances in adversarial attacks and defenses in computer vision: A survey,” IEEE Access, vol. 9, pp. 155161–155196, 2021. [13] S. Qiu, Q. Liu, S. Zhou, and C. Wu, “Review of artificial intelligence adversarial attack and defense technologies,” Applied Sciences, vol. 9, no. 5, p. 909, 2019. [14] Z. Ren, A. Baird, J. Han, Z. Zhang, and B. Schuller, “Generating and protecting against adversarial attacks for deep speech-based emotion recognition models,” in ICASSP 2020-2020 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp. 7184–7188, IEEE, 2020. [15] B.-H. Su and C.-C. Lee, “Unsupervised cross-corpus speech emotion recognition using a multi-source cycle-gan,” IEEE Transactions on Affective Computing, no. 01, pp. 1–1, 2022. [16] F. Ren and C. Quan, “Linguistic-based emotion analysis and recognition for measuring consumer satisfaction: an application of affective computing,” Information Technology and Management, vol. 13, no. 4, pp. 321–332, 2012. [17] G. N. Yannakakis, “Enhancing health care via affective computing,” 2018. [18] J. Hernandez, R. R. Morris, and R. W. Picard, “Call center stress recognition with person-specific models,” in International Conference on Affective Computing and Intelligent Interaction, pp. 125–134, Springer, 2011. [19] A. Menychtas, M. Galliakis, P. Tsanakas, and I. Maglogiannis, “Real-time integration of emotion analysis into homecare platforms,” in 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 3468–3471, IEEE, 2019. [20] H. Basanta, Y.-P. Huang, and T.-T. Lee, “Assistive design for elderly living ambient using voice and gesture recognition system,” in 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 840–845, IEEE, 2017. [21] E. Polyakov, M. Mazhanov, A. Rolich, L. Voskov, M. Kachalova, and S. Polyakov, “Investigation and development of the intelligent voice assistant for the internet of things using machine learning,” in 2018 Moscow Workshop on Electronic and Networking Technologies (MWENT), pp. 1–5, IEEE, 2018. [22] C. Filippini, D. Perpetuini, D. Cardone, A. M. Chiarelli, and A. Merla, “Thermal infrared imaging-based affective computing and its application to facilitate human robot interaction: a review,” Applied Sciences, vol. 10, no. 8, p. 2924, 2020. [23] R. K. Moore, “Is spoken language all-or-nothing? implications for future speech-based human-machine interaction,” in Dialogues with Social Robots, pp. 281–291, Springer, 2017. [24] C. Tschöpe, F. Duckhorn, M. Huber, W. Meyer, and M. Wolff, “A cognitive user interface for a multi-modal human-machine interaction,” in International Conference on Speech and Computer, pp. 707–717, Springer, 2018. [25] P. Song, W. Zheng, S. Ou, X. Zhang, Y. Jin, J. Liu, and Y. Yu, “Cross-corpus speech emotion recognition based on transfer non-negative matrix factorization,” Speech Communication, vol. 83, pp. 34–41, 2016. [26] S. Albanie, A. Nagrani, A. Vedaldi, and A. Zisserman, “Emotion recognition in speech using cross-modal transfer in the wild,” in Proceedings of the 26th ACM international conference on Multimedia, pp. 292–301, 2018. [27] Y. Zong, W. Zheng, T. Zhang, and X. Huang, “Cross-corpus speech emotion recognition based on domain-adaptive least-squares regression,” IEEE signal processing letters, vol. 23, no. 5, pp. 585–589, 2016. [28] S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps, “Cross corpus speech emotion classification-an effective transfer learning technique,” arXiv preprint arXiv:1801.06353, 2018. [29] Z. Zhang, F. Weninger, M. Wöllmer, and B. Schuller, “Unsupervised learning in cross-corpus acoustic emotion recognition,” in 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, pp. 523–528, IEEE, 2011. [30] B. Schuller, B. Vlasenko, F. Eyben, M. Wöllmer, A. Stuhlsatz, A. Wendemuth, and G. Rigoll, “Cross-corpus acoustic emotion recognition: Variances and strategies,” IEEE Transactions on Affective Computing, vol. 1, no. 2, pp. 119–131, 2010. [31] Z. Huang, W. Xue, Q. Mao, and Y. Zhan, “Unsupervised domain adaptation for speech emotion recognition using pcanet,” Multimedia Tools and Applications, vol. 76, no. 5, pp. 6785–6799, 2017. [32] P. Song, “Transfer linear subspace learning for cross-corpus speech emotion recognition.,” IEEE Trans. Affect. Comput., vol. 10, no. 2, pp. 265–275, 2019. [33] J. Gideon, M. McInnis, and E. M. Provost, “Improving cross-corpus speech emotion recognition with adversarial discriminative domain generalization (addog),” IEEE Transactions on Affective Computing, 2019. [34] L. Yi and M.-W. Mak, “Adversarial data augmentation network for speech emotion recognition,” in 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 529–534, IEEE, 2019. [35] F. Bao, M. Neumann, and N. T. Vu, “Cyclegan-based emotion style transfer as data augmentation for speech emotion recognition.,” in INTERSPEECH, pp. 2828–2832, 2019. [36] B.-H. Su and C.-C. Lee, “A conditional cycle emotion gan for cross corpus speech emotion recognition,” in 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 351–357, IEEE, 2021. [37] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” in INTERSPEECH, 2019. [38] T.-S. Nguyen, S. Stueker, J. Niehues, and A. Waibel, “Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7689–7693, IEEE, 2020. [39] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5220–5224, IEEE, 2017. [40] W.-N. Hsu, Y. Zhang, and J. Glass, “Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 16–23, IEEE, 2017. [41] P. Sheng, Z. Yang, and Y. Qian, “Gans for children: A generative data augmentation strategy for children speech recognition,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 129–135, IEEE, 2019. [42] Z. Chen, A. Rosenberg, Y. Zhang, G. Wang, B. Ramabhadran, and P. J. Moreno, “Improving speech recognition using gan-based speech synthesis and contrastive unspoken text selection,” Proc. Interspeech 2020, pp. 556–560, 2020. [43] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, pp. 2223–2232, 2017. [44] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in International conference on machine learning, pp. 214–223, PMLR, 2017. [45] Y. Xiao, H. Zhao, and T. Li, “Learning class-aligned and generalized domain-invariant representations for speech emotion recognition,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 4, no. 4, pp. 480–489, 2020. [46] J. Deng, Z. Zhang, F. Eyben, and B. Schuller, “Autoencoder-based unsupervised domain adaptation for speech emotion recognition,” IEEE Signal Processing Letters, vol. 21, no. 9, pp. 1068–1072, 2014. [47] J. Deng, X. Xu, Z. Zhang, S. Frühholz, and B. Schuller, “Universum autoencoder-based domain adaptation for speech emotion recognition,” IEEE Signal Processing Letters, vol. 24, no. 4, pp. 500–504, 2017. [48] M. Neumann and N. T. Vu, “Improving speech emotion recognition with unsupervised representation learning on unlabeled speech,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7390–7394, IEEE, 2019. [49] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell, “Cycada: Cycle-consistent adversarial domain adaptation,” in International conference on machine learning, pp. 1989–1998, PMLR, 2018. [50] S. Zhao, C. Lin, P. Xu, S. Zhao, Y. Guo, R. Krishna, G. Ding, and K. Keutzer, “Cycleemotiongan: Emotional semantic consistency preserved cyclegan for adapting image emotions,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 2620–2627, 2019 [51] M. Grimm, K. Kroschel, and S. Narayanan, “The vera am mittag german audio-visual emotional speech database,” in 2008 IEEE international conference on multimedia and expo, pp. 865–868, IEEE, 2008. [52] F. Eyben, M. Wöllmer, and B. Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM international conference on Multimedia, pp. 1459–1462, 2010. [53] F. Haider, S. Pollak, P. Albert, and S. Luz, “Emotion recognition in low-resource settings: An evaluation of automatic feature selection methods,” Computer Speech & Language, vol. 65, p. 101119, 2021. [54] S. T. Rajamani, K. T. Rajamani, A. Mallol-Ragolta, S. Liu, and B. Schuller, “A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6294–6298, IEEE, 2021. [55] C. Fu, C. Liu, C. T. Ishi, and H. Ishiguro, “Maec: Multi-instance learning with an adversarial auto-encoder-based classifier for speech emotion recognition,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6299–6303, IEEE, 2021. [56] A. Triantafyllopoulos and B. W. Schuller, “The role of task and acoustic similarity in audio transfer learning: Insights from the speech emotion recognition case,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7268–7272, IEEE, 2021. [57] C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab, N. Sadoughi, and E. M. Provost, “Msp-improv: An acted corpus of dyadic interactions to study emotion perception,” IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 67–80, 2016. [58] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, B. Weiss, et al., “A database of german emotional speech.,” in Interspeech, vol. 5, pp. 1517–1520, 2005. [59] A. Metallinou, C.-C. Lee, C. Busso, S. Carnicke, S. Narayanan, et al., “The usc creativeit database: A multimodal database of theatrical improvisation,” Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality, p. 55, 2010. [60] Z. Farhoudi and S. Setayeshi, “Fusion of deep learning features with mixture of brain emotional learning for audio-visual emotion recognition,” Speech Communication, vol. 127, pp. 92–103, 2021. [61] L. Cen, F. Wu, Z. L. Yu, and F. Hu, “A real-time speech emotion recognition system and its application in online learning,” in Emotions, technology, design, and learning, pp. 27–46, Elsevier, 2016. [62] M. Dewan, M. Murshed, and F. Lin, “Engagement detection in online learning: a review,” Smart Learning Environments, vol. 6, no. 1, pp. 1–20, 2019. [63] J. Zhang, Z. Yin, P. Chen, and S. Nichele, “Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review,” Information Fusion, vol. 59, pp. 103–126, 2020. [64] M. Spezialetti, G. Placidi, and S. Rossi, “Emotion recognition for human-robot interaction: Recent advances and future perspectives,” Frontiers in Robotics and AI, p. 145, 2020. [65] S. Poria, N. Majumder, R. Mihalcea, and E. Hovy, “Emotion recognition in conversation: Research challenges, datasets, and recent advances,” IEEE Access, vol. 7, pp. 100943–100953, 2019. [66] G. R. Machado, E. Silva, and R. R. Goldschmidt, “Adversarial machine learning in image classification: A survey toward the defender's perspective,” ACM Computing Surveys (CSUR), vol. 55, no. 1, pp. 1–38, 2021. [67] P. Vidnerová and R. Neruda, “Vulnerability of classifiers to evolutionary generated adversarial examples,” Neural Networks, vol. 127, pp. 168–181, 2020. [68] J. Villalba, Y. Zhang, and N. Dehak, “x-vectors meet adversarial attacks: Benchmarking adversarial robustness in speaker verification.,” in INTERSPEECH, pp. 4233–4237, 2020. [69] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014. [70] R. Olivier, B. Raj, and M. Shah, “High-frequency adversarial defense for speech and audio,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2995–2999, IEEE, 2021. [71] Z. Yang, B. Li, P.-Y. Chen, and D. Song, “Towards mitigating audio adversarial perturbations,” 2018. [72] S. Samizade, Z.-H. Tan, C. Shen, and X. Guan, “Adversarial example detection by classification for deep speech recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3102–3106, IEEE, 2020. [73] J. Zhang, B. Zhang, and B. Zhang, “Defending adversarial attacks on cloud-aided automatic speech recognition systems,” in Proceedings of the Seventh International Workshop on Security in Cloud Computing, pp. 23–31, 2019. [74] C.-H. Yang, J. Qi, P.-Y. Chen, X. Ma, and C.-H. Lee, “Characterizing speech adversarial examples using self-attention u-net enhancement,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3107–3111, IEEE, 2020. [75] H. Wu, X. Li, A. T. Liu, Z. Wu, H. Meng, and H.-y. Lee, “Adversarial defense for automatic speaker verification by cascaded self-supervised learning models,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6718–6722, IEEE, 2021. [76] H. Wu, X. Li, A. T. Liu, Z. Wu, H. Meng, and H.-Y. Lee, “Improving the adversarial robustness for speaker verification by self-supervised learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 202–217, 2021. [77] H. Wu, P.-c. Hsu, J. Gao, S. Zhang, S. Huang, J. Kang, Z. Wu, H. Meng, and H.-y. Lee, “Adversarial sample detection for speaker verification by neural vocoders,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 236–240, IEEE, 2022. [78] L.-C. Chang, Z. Chen, C. Chen, G. Wang, and Z. Bi, “Defending against adversarial attacks in speaker verification systems,” in 2021 IEEE International Performance, Computing, and Communications Conference (IPCCC), pp. 1–8, IEEE, 2021. [79] C.-H. H. Yang, Z. Ahmed, Y. Gu, J. Szurley, R. Ren, L. Liu, A. Stolcke, and I. Bulyko, “Mitigating closed-model adversarial examples with bayesian neural modeling for enhanced end-to-end speech recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6302–6306, IEEE, 2022. [80] S. Latif, R. Rana, S. Khalifa, R. Jurdak, and B. W. Schuller, “Deep architecture enhancing robustness to noise, adversarial attacks, and cross-corpus setting for speech emotion recognition,” Proc. Interspeech 2020, pp. 2327–2331, 2020. [81] B.-H. Su and C.-C. Lee, “Vaccinating ser to neutralize adversarial attacks with self-supervised augmentation strategy,” Proc. Interspeech 2022, pp. 1153–1157, 2022. [82] L. Hansen, Y.-P. Zhang, D. Wolf, K. Sechidis, N. Ladegaard, and R. Fusaroli, “A generalizable speech emotion recognition model reveals depression and remission,” Acta Psychiatrica Scandinavica, vol. 145, no. 2, pp. 186–199, 2022. [83] Y. Chang, S. Laridi, Z. Ren, G. Palmer, B. W. Schuller, and M. Fisichella, “Robust federated learning against adversarial attacks for speech emotion recognition,” arXiv preprint arXiv:2203.04696, 2022. [84] P. Gyawali, S. Ghimire, and L. Wang, “Enhancing mixup-based semi-supervised learning with explicit lipschitz regularization,” in 2020 IEEE International Conference on Data Mining (ICDM), pp. 1046–1051, IEEE, 2020. [85] S. Amini and S. Ghaemmaghami, “Towards improving robustness of deep neural networks to adversarial perturbations,” IEEE Transactions on Multimedia, vol. 22, no. 7, pp. 1889–1903, 2020 [86] A. Jati, C.-C. Hsu, M. Pal, R. Peri, W. AbdAlmageed, and S. Narayanan, “Adversarial attack and defense strategies for deep speaker recognition systems,” Computer Speech & Language, vol. 68, p. 101199, 2021. [87] S. Wang, W. Liu, and C.-H. Chang, “Detecting adversarial examples for deep neural networks via layer directed discriminative noise injection,” in 2019 Asian Hardware Oriented Security and Trust Symposium (AsianHOST), pp. 1–6, IEEE, 2019. [88] S. Joshi, J. Villalba, P. Żelasko, L. Moro-Velázquez, and N. Dehak, “Study of pre-processing defenses against adversarial attacks on state-of-the-art speaker recognition systems,” IEEE Transactions on Information Forensics and Security, vol. 16, pp. 4811–4826, 2021. [89] A. Sreeram, N. Mehlman, R. Peri, D. Knox, and S. Narayanan, “Perceptual-based deep-learning denoiser as a defense against adversarial attacks on asr systems,” arXiv preprint arXiv:2107.05222, 2021. [90] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460, 2020. [91] L. Pepino, P. Riera, and L. Ferrer, “Emotion recognition from speech using wav2vec 2.0 embeddings,” Proc. Interspeech 2021, pp. 3400–3404, 2021. [92] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli, “fairseq: A fast, extensible toolkit for sequence modeling,” arXiv preprint arXiv:1904.01038, 2019. [93] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019. [94] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in International Conference on Learning Representations, 2018. [95] D. Terjék, “Adversarial lipschitz regularization,” in International Conference on Learning Representations, 2020. [96] H. Wu, S. Liu, H. Meng, and H.-y. Lee, “Defense against adversarial attacks on spoofing countermeasures of asv,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6564–6568, IEEE, 2020.
|