帳號:guest(3.136.25.106)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):高聖哲
作者(外文):Kao, Sheng-Che
論文名稱(中文):基於注意力機制實現知識蒸餾之特徵萃取於圖像分類
論文名稱(外文):Knowledge Distillation via Representative-based feature Extracting with Attention Mechanism for Image Classification
指導教授(中文):黃之浩
指導教授(外文):Huang, Chih-Hao Scott
口試委員(中文):鍾偉和
鍾耀梁
口試委員(外文):Chung, Wei-Ho
Chung, Yao-Liang
學位類別:碩士
校院名稱:國立清華大學
系所名稱:通訊工程研究所
學號:109064518
出版年(民國):111
畢業學年度:110
語文別:英文
論文頁數:51
中文關鍵詞:知識蒸餾模型壓縮注意力機制遷移學習影像分類
外文關鍵詞:Knowledge DistillationModel CompressionAttention MechanismTransfer LearningImage Classification
相關次數:
  • 推薦推薦:1
  • 點閱點閱:0
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
近年來,深度學習在學術和工業上的應用都取得了顯著的成果;然而,當我們需要將原本訓練好的大型模型部屬至設備上時,例如:手機、平板、嵌入式……等等裝置上,通常會受限於其龐大的計算量和儲存空間而無法執行,因此,知識蒸餾的概念也成為其中一個具代表性的模型壓縮技術,至今在深度學習的領域,依然受到許多的關注,一般來說,知識蒸餾是一種遷移學習的技術,主要是透過大型的模型所學習到的知識轉移到規模較小的模型上,使小模型能夠提升整體表現,而正因為知識蒸餾的成功,目前也有越來越多知識蒸餾的方法被發展出來,尤其,大部分的研究是關於如何從老師的中間層萃取知識,但這樣方式通常是手動決定蒸餾的位置,忽略了老師與學生之間的關聯性,反而有可能讓學生學習到無用的資訊,限制了蒸餾的效率。過去這個問題一直被廣泛討論,因此最近開始有人提出透過注意力機制的方式進行蒸餾,能夠判斷老師和學生之間的相關性進而控制蒸餾強度,在眾多知識蒸餾方法有優異的效能表現,啟發於我們根據現有的模型為基礎,建構出本篇論文的架構。
本篇論文為了更加提升學生的蒸餾能力,透過前面提到的方式得出老師和學生特徵的相關性後,我們提出了具代表性的老師特徵(Representation Teacher Key )概念,在每次訓練時,使用top-k value的方式來篩選出我們要對老師層進行蒸餾的位置,而其餘不需要的特徵則不進行蒸餾,透過這樣的方式形成新代表性的注意力矩陣(Representation Attention Matrix)有助於我們提升特徵萃取的效果。並在Cifar10、Cifar100、SVHN和CINIC10資料上的實驗集結果顯示學生均得到良好的蒸餾表現,同時我們的方法跟其他知識蒸餾方法比較下來也顯得更為突出。
Recently, applications of deep neural network have been successful in both industry and academia. However, it's still difficult to deploy cumbersome deep model on device due to the large storage and high-computation. Thus, Knowledge Distillation (KD) become one of representative method to train efficient network, and it received a lot of attention from the community until now. In general, the knowledge distillation is a transfer learning technique, it can transfer the knowledge from a compact model with high capacity to a small student network, enable student network to achieve better performance.
With the rapid success of Knowledge Distillation (KD), there are a lot of relative methods also developed. Especially, most of researches are about extracting knowledge from the intermediate features of the teacher and student. Nevertheless, this way which manually choose the distillation position may ignore the similarity of teacher and student, then distill useless knowledge from teacher because of some reasons. There has been a lot of discussions about how to solve the problem. Therefore, there is a knowledge distillation based on attention mechanisms developed recently, it can identify similarities to control distillation intensities. Besides, the method also achieves the excellent performance in most of knowledge distillation method. Inspired by this, we take this previous work as reference, then construct the new structure of this paper.
In this thesis, we improved the distillation ability of student network, through the following method which can identify feature similarities between teacher and student. This research proposed the idea of representation teacher feature. In every training process, we used top-k value classifier to choose position of teacher layer what we need. Otherwise, the other unnecessary features won’t distilled by us. In this way, we formulate a new representation attention matrix to help us promote the distillation process efficiently. We validated our method in extensive experiments, showing that it can achieve significant performance on four image classification datasets: CIFAR10, CIFAR100, SVHN, and CINIC10, and our proposed approach outperforms other state-of-the-art KD methods in accuracy.
摘要 i
Abstract ii
誌謝 iii
Contents iv
List of Figures vi
List of Tables viii
1. Introduction......................................................1
1.1 Motivation and Problem Description...............................1
1.2 Contribution.....................................................3
1.3 Thesis Organization..............................................3
2. Related Works.....................................................4
2.1. Knowledge Distillation..........................................4
2.1.1. Response-Based Knowledge......................................5
2.1.2. Feature-Based Knowledge.......................................7
2.1.3. Relation-Based Knowledge......................................8
2.2. Attention Mechanism............................................10
2.2.1. Scaled Dot-Product Attention.................................10
2.2.2. Positional Encoding..........................................12
2.3. SoftPool.......................................................14
2.3.1 Down-sampled Theory...........................................14
2.3.2 Feature preservation..........................................16
3. Methodology......................................................17
3.1. Attention-Based Feature Distillation...........................17
3.2. Attention Transfer.............................................19
3.3. Representative Teacher Key(RTK)................................21
4. Experiment.......................................................26
4.1. Datasets.......................................................26
4.1.1. SVHN.........................................................26
4.1.2. CIFAR........................................................27
4.1.3. CINIC10......................................................28
4.2. Experiment Setup...............................................28
4.3. Experiment result for Distilling Performance...................31
4.4. Visualization and Analysis.....................................36
5. Conclusion and Future Work.......................................46
Reference...........................................................47

[1] S.Han, H.Mao, and W. J.Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,” 4th Int. Conf. Learn. Represent. ICLR 2016 - Conf. Track Proc., pp. 1–14, 2016.
[2] J.Wu, C.Leng, Y.Wang, Q.Hu, and J.Cheng, “Quantized Convolutional Neural Networks for Mobile Devices,” Dec.2015, doi: 10.48550/arxiv.1512.06473.
[3] C.Bucila, R.Caruana, and A.Niculescu-Mizil, “Model Compression Cristian,” Kdd, vol. 54, no. 1, pp. 1–9, 2006, [Online]. Available: http://users.iems.northwestern.edu/~nocedal/PDFfiles/dss.pdf%0Ahttp://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.67.7952&rep=rep1&type=pdf%255Cnhttp://books.nips.cc/papers/files/nips15/AA03.pdf%250Ahttps://ieeexplore.ieee.org/document/85783.
[4] G.Hinton, O.Vinyals, and J.Dean, “Distilling the Knowledge in a Neural Network,” Mar.2015, doi: 10.48550/arxiv.1503.02531.
[5] R.Chen, H.Ai, C.Shang, L.Chen, andZ.Zhuang, “Learning Lightweight Pedestrian Detector with Hierarchical Knowledge Distillation,” Sep.2019, doi: 10.1109/ICIP.2019.8803079.
[6] A.Esteva et al., “Dermatologist-level classification of skin cancer with deep neural networks,” Nature, vol. 542, no. 7639, pp. 115–118, 2017, doi: 10.1038/nature21056.
[7] K.Czuszynski, J.Ruminski, and A.Kwasniewska, “Gesture recognition with the linear optical sensor and recurrent neural networks,” IEEE Sens. J., vol. 18, no. 13, pp. 5429–5438, 2018, doi: 10.1109/JSEN.2018.2834968.
[8] N. D.Lane, P.Georgiev, and L.Qendro, “DeepEar: Robust smartphone audio sensing in unconstrained acoustic environments using deep learning,” UbiComp 2015 - Proc. 2015 ACM Int. Jt. Conf. Pervasive Ubiquitous Comput., pp. 283–294, 2015, doi: 10.1145/2750858.2804262.
[9] A.Mathurz, N. D.Lanezy, S.Bhattacharyaz, A.Boranz, C.Forlivesiz, andF.Kawsarz, “DeepEye: Resource efficient local execution of multiple deep vision models using wearable commodity hardware,” MobiSys 2017 - Proc. 15th Annu. Int. Conf. Mob. Syst. Appl. Serv., pp. 68–81, 2017, doi: 10.1145/3081333.3081359.
[10] A.Romero, N.Ballas, S. E.Kahou, A.Chassang, C.Gatta, andY.Bengio, “FitNets: Hints for thin deep nets,” 3rd Int. Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc., pp. 1–13, 2015.
[11] S.Zagoruyko and N.Komodakis, “Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer,” 5th Int. Conf. Learn. Represent. ICLR 2017 - Conf. Track Proc., pp. 1–13, 2017.
[12] W.Park, D.Kim, Y.Lu, and M.Cho, “Relational knowledge distillation,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2019-June, pp. 3962–3971, 2019, doi: 10.1109/CVPR.2019.00409.
[13] Y.Tian, D.Krishnan, and P.Isola, “Contrastive Representation Distillation,” no. 2014, pp. 1–19, 2019, [Online]. Available: http://arxiv.org/abs/1910.10699.
[14] Y.Jang, H.Lee, S. J.Hwang, and J.Shin, “Learning what and where to transfer,” 36th Int. Conf. Mach. Learn. ICML 2019, vol. 2019-June, pp. 5360–5369, 2019.
[15] K.Xu et al., “Show, attend and tell: Neural image caption generation with visual attention,” 32nd Int. Conf. Mach. Learn. ICML 2015, vol. 3, pp. 2048–2057, 2015.
[16] Z.Niu, G.Zhong, and H.Yu, “A review on the attention mechanism of deep learning,” Neurocomputing, vol. 452, pp. 48–62, 2021, doi: 10.1016/j.neucom.2021.03.091.
[17] R.Child, S.Gray, A.Radford, and I.Sutskever, “Generating Long Sequences with Sparse Transformers,” 2019, [Online]. Available: http://arxiv.org/abs/1904.10509.
[18] M.Zaheer et al., “Big bird: Transformers for longer sequences,” Adv. Neural Inf. Process. Syst., vol. 2020-Decem, no. NeurIPS, 2020.
[19] S.Wang, B. Z.Li, M.Khabsa, H.Fang, and H.Ma, “Linformer: Self-Attention with Linear Complexity,” vol. 2048, no. 2019, 2020, [Online]. Available: http://arxiv.org/abs/2006.04768.
[20] M.Ji, B.Heo, andS.Park, “Show, Attend and Distill:Knowledge Distillation via Attention-based Feature Matching,” 2021, [Online]. Available: http://arxiv.org/abs/2102.02973.
[21] Y.Netzer, T.Wang, A.Coates, A.Bissacco, B.Wu, and A. Y.Ng, “PROFESSOR V.N. SHamov.,” Vopr. Neirokhir., vol. 16, no. 5, pp. 9–13, 1952.
[22] M. B.McCrary, “Urban multicultural trauma patients.,” ASHA, vol. 34, no. 4, 1992.
[23] L. N.Darlow, E. J.Crowley, A.Antoniou, and A. J.Storkey, “CINIC-10 is not ImageNet or CIFAR-10,” no. September, 2018, [Online]. Available: http://arxiv.org/abs/1810.03505.
[24] J.Gou, B.Yu, S. J.Maybank, and D.Tao, “Knowledge Distillation: A Survey,” Int. J. Comput. Vis., vol. 129, no. 6, pp. 1789–1819, 2021, doi: 10.1007/s11263-021-01453-z.
[25] Y.Bengio, A.Courville, and P.Vincent, “Representation learning: A review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828, 2013, doi: 10.1109/TPAMI.2013.50.
[26] N.Passalis and A.Tefas, “Learning Deep Representations with Probabilistic Knowledge Transfer,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 11215 LNCS, pp. 283–299, 2018, doi: 10.1007/978-3-030-01252-6_17.
[27] X.Jin et al., “Knowledge distillation via route constrained optimization,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2019-Octob, pp. 1345–1354, 2019, doi: 10.1109/ICCV.2019.00143.
[28] J.Yim, D.Joo, J.Bae, and J.Kim, “A gift from knowledge distillation: Fast optimization, network minimization and transfer learning,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 7130–7138, 2017, doi: 10.1109/CVPR.2017.754.
[29] B.Peng et al., “Correlation congruence for knowledge distillation,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2019-Octob, pp. 5006–5015, 2019, doi: 10.1109/ICCV.2019.00511.
[30] F.Tung and G.Mori, “Similarity-preserving knowledge distillation,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2019-Octob, pp. 1365–1374, 2019, doi: 10.1109/ICCV.2019.00145.
[31] D.Bahdanau, K. H.Cho, andY.Bengio, “Neural machine translation by jointly learning to align and translate,” 3rd Int. Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc., pp. 1–15, 2015.
[32] M. T.Luong, H.Pham, and C. D.Manning, “Effective approaches to attention-based neural machine translation,” Conf. Proc. - EMNLP 2015 Conf. Empir. Methods Nat. Lang. Process., pp. 1412–1421, 2015, doi: 10.18653/v1/d15-1166.
[33] D.Britz, A.Goldie, M. T.Luong, and Q.V.Le, “Massive exploration of neural machine translation architectures,” EMNLP 2017 - Conf. Empir. Methods Nat. Lang. Process. Proc., pp. 1442–1451, 2017, doi: 10.18653/v1/d17-1151.
[34] A.Vaswani et al., “Attention Is All You Need,” Jun.2017, doi: 10.48550/arxiv.1706.03762.
[35] J.Gehring, M.Auli, D.Grangier, D.Yarats, andY. N.Dauphin, “Convolutional sequence to sequence learning,” 34th Int. Conf. Mach. Learn. ICML 2017, vol. 3, pp. 2029–2042, 2017.
[36] M. D.Zeiler and R.Fergus, “Stochastic pooling for regularization of deep convolutional neural networks,” 1st Int. Conf. Learn. Represent. ICLR 2013 - Conf. Track Proc., pp. 1–9, 2013.
[37] D.Miao, W.Pedrycz, D.Ślezak, G.Peters, Q.Hu, and R.Wang, “Rough Sets and Knowledge Technology: 9th International Conference, RSKT 2014 Shanghai, China, October 24–26, 2014 Proceedings,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 8818, no. October 2014, 2014, doi: 10.1007/978-3-319-11740-9.
[38] S.Zhai et al., “S3Pool: Pooling with stochastic spatial sampling,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 4003–4011, 2017, doi: 10.1109/CVPR.2017.426.
[39] A.Stergiou, R.Poppe, and G.Kalliatakis, “Refining activation downsampling with SoftPool,” pp. 10337–10346, 2022, doi: 10.1109/iccv48922.2021.01019.
[40] Y.Netzer, T.Wang, A.Coates, A.Bissacco, B.Wu, and A. Y.Ng, “Reading digits in natural images with unsupervised feature learning,” 2011.
[41] A.Krizhevsky, G.Hinton, andothers, “Learning multiple layers of features from tiny images,” 2009.
[42] K.He, X.Zhang, S.Ren, and J.Sun, “Deep residual learning for image recognition,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2016-Decem, pp. 770–778, 2016, doi: 10.1109/CVPR.2016.90.
[43] N.Komodakis, “Wide Residual Networks,” 2016.

 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *