帳號:guest(216.73.216.146)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):周家興
作者(外文):Zhou, Jia-Xing
論文名稱(中文):可微分查找矩陣乘法用於壓縮Transformer網路
論文名稱(外文):Differentiable Lookup-Based Matrix Multiplication for Compressing Transformer Network
指導教授(中文):林永隆
指導教授(外文):Lin, Young-Long
口試委員(中文):王廷基
吳凱強
口試委員(外文):Wang, Ting-Chi
Wu, Kai-Chiang
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:110062647
出版年(民國):113
畢業學年度:112
語文別:英文
論文頁數:34
中文關鍵詞:基於查找的矩陣乘法壓縮transformer 網路
外文關鍵詞:looup-based matrix multiplicationcompressiontransformer network
相關次數:
  • 推薦推薦:0
  • 點閱點閱:654
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
近年,研究者努力追求更高效的深度神經網絡,尤其是降低乘累加運算
的計算量。傳統如知識蒸餾、剪枝、和量化的策略已被深入挖掘。由於乘
法運算耗能問題,新策略如AdderNet 和ShiftCNN 應運而生,它們目標替
換原有運算,從而節能。
不久前,MADDNESS 提出一全新策略,直接用查找-累加方法取代了乘
累加運算。繼而有如PECAN 和LUT-NN 等研究也秉持此方向。我們的研
究進一步完善了LUT-NN,並提出了端到端的訓練方式。在ImageNet 數據
上的成果表明,我們的方法使LUT-NN 的基礎準確率上升最多至11%。
In recent years, the quest for efficient Deep Neural Networks (DNNs) has centered
on reducing the computational burden of multiply-accumulate (MAC) operations.
Traditional avenues such as Knowledge Distillation (KD), pruning, and
quantization have been explored extensively. With the energy cost of multiplication
operations being a significant concern, alternative methodologies like Adder-
Net and ShiftCNN have emerged, focusing on the direct substitution of operations
to save energy.
Recently, a novel approach called MADDNESS took this further by entirely replacing
MAC operations with lookup-accumulate (LAC) operations. Several subsequent
works, including PECAN and LUT-NN, have followed suit. Our research
builds on and notably improves the latest of these methods, LUT-NN, introducing
an end-to-end training procedure. Tested on the ImageNet dataset, our proposed
method significantly enhances the efficiency of DNNs, improving upon the baseline
LUT-NN model’s accuracy by up to 11%.
Acknowledgements
摘要i
Abstract ii
1 Introduction 1
2 Background 3
2.1 Scalar Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Product Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Lookup-Based Matrix Multiplication . . . . . . . . . . . . . . . . . . 5
2.2.2 Time and Space Complexity Analysis . . . . . . . . . . . . . . . . . . 7
2.3 Vision Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Multi-head Self Attention . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Transformer block for images . . . . . . . . . . . . . . . . . . . . . . 10
2.3.3 The class token . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 ResMLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Related Work 13
3.1 MADDNESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 MADDNESSHASH . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 Prototypes Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 PECAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Angle-based similarity . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.2 L1 norm-based similarity . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 LUT-NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Proposed Methods 16
4.1 Differentiable Product Quantization . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.1 Differentiable Encoding . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.2 Learned temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.3 Updating Prototypes Via Gradient Descent . . . . . . . . . . . . . . . 18
4.2 Scalar Quantization-Aware Training at Table Level . . . . . . . . . . . . . . . 19
4.3 K-Means Clustering for Initialization on More Samples . . . . . . . . . . . . . 20
4.4 Self-Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4.1 Soft Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4.2 Hard Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5 Experimental Results 23
5.1 Layer Compression: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Two Types of Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 The Impact of Data Types of Table . . . . . . . . . . . . . . . . . . . . . . . . 26
5.4 Comparison between our work and LUT-NN . . . . . . . . . . . . . . . . . . . 27
5.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.6 Evaluating Our Method’s Impact on the ResMLP-S12 . . . . . . . . . . . . . . 29
6 Conclusion And Future Work 30
References 31
[1] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image
recognition,” arXiv preprint arXiv:1409.1556, 2014.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional
neural networks,” Advances in neural information processing systems, vol. 25,
2012.
[3] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection
with region proposal networks,” Advances in neural information processing systems,
vol. 28, 2015.
[4] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,”
in Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 3431–3440, 2015.
[5] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted
residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 4510–4520, 2018.
[6] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer,
“Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size,”
arXiv preprint arXiv:1602.07360, 2016.
[7] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,”
in International conference on machine learning, pp. 6105–6114, PMLR, 2019.
[8] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for
the 2020s,” in Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, pp. 11976–11986, 2022.
[9] S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “Convnext v2:
Co-designing and scaling convnets with masked autoencoders,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16133–16142,
2023.
[10] G. Hinton, O. Vinyals, and J. Dean, “Distilling the in a neural network,” arXiv preprint
arXiv:1503.02531, 2015.
[11] D. Blalock, J. J. Gonzalez Ortiz, J. Frankle, and J. Guttag, “What is the state of neural
network pruning?,” Proceedings of machine learning and systems, vol. 2, pp. 129–146,
2020.
[12] T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste, “Sparsity in deep learning:
Pruning and growth for efficient inference and training in neural networks,” The Journal
of Machine Learning Research, vol. 22, no. 1, pp. 10882–11005, 2021.
[13] V. Natesh, A. Sabot, H. Kung, and M. Ting, “Rosko: Row skipping outer products for
sparse matrix multiplication kernels,” arXiv preprint arXiv:2307.03930, 2023.
[14] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and
D. Kalenichenko, “Quantization and training of neural networks for efficient integerarithmetic-
only inference,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 2704–2713, 2018.
[15] M. Nagel, R. A. Amjad, M. Van Baalen, C. Louizos, and T. Blankevoort, “Up or down?
adaptive rounding for post-training quantization,” in International Conference on Machine
Learning, pp. 7197–7206, PMLR, 2020.
[16] Y. Bhalgat, J. Lee, M. Nagel, T. Blankevoort, and N. Kwak, “Lsq+: Improving lowbit
quantization through learnable offsets and better initialization,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 696–
697, 2020.
[17] M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” in 2014
IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC),
pp. 10–14, 2014.
[18] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification
using binary convolutional neural networks,” in European conference on computer vision,
pp. 525–542, Springer, 2016.
[19] H. Chen, Y. Wang, C. Xu, B. Shi, C. Xu, Q. Tian, and C. Xu, “Addernet: Do we really
need multiplications in deep learning?,” in Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pp. 1468–1477, 2020.
[20] Y. Xu, C. Xu, X. Chen, W. Zhang, C. Xu, and Y. Wang, “Kernel based progressive distillation
for adder neural networks,” Advances in Neural Information Processing Systems,
vol. 33, pp. 12322–12333, 2020.
[21] D. A. Gudovskiy and L. Rigazio, “Shiftcnn: Generalized low-precision architecture for
inference of convolutional neural networks,” arXiv preprint arXiv:1706.02393, 2017.
[22] D. Blalock and J. Guttag, “Multiplying matrices without multiplying,” in International
Conference on Machine Learning, pp. 992–1004, PMLR, 2021.
[23] X. Tang, Y. Wang, T. Cao, L. L. Zhang, Q. Chen, D. Cai, Y. Liu, and M. Yang,
“Lut-nn: Towards unified neural network inference by table lookup,” arXiv preprint
arXiv:2302.03213, 2023.
[24] H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,”
IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 1, pp. 117–
128, 2010.
[25] J. MacQueen et al., “Some methods for classification and analysis of multivariate observations,”
in Proceedings of the fifth Berkeley symposium on mathematical statistics and
probability, pp. 281–297, Oakland, CA, USA, 1967.
[26] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers
for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and
I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems,
vol. 30, 2017.
[28] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint
arXiv:1606.08415, 2016.
[29] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint
arXiv:1607.06450, 2016.
[30] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Convolutional sequence
to sequence learning,” in International conference on machine learning, pp. 1243–1252,
PMLR, 2017.
[31] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional
transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[32] H. Touvron, P. Bojanowski, M. Caron, M. Cord, A. El-Nouby, E. Grave, G. Izacard,
A. Joulin, G. Synnaeve, J. Verbeek, et al., “Resmlp: Feedforward networks for image
classification with data-efficient training,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 45, no. 4, pp. 5314–5321, 2022.
[33] J. Ran, R. Lin, J. C. L. Li, J. Zhou, and N. Wong, “Pecan: A product-quantized content
addressable memory network,” in 2023 Design, Automation & Test in Europe Conference
& Exhibition (DATE), pp. 1–6, IEEE, 2023.
[34] T. Chen, L. Li, and Y. Sun, “Differentiable product quantization for end-to-end embedding
compression,” in International Conference on Machine Learning, pp. 1617–1626, PMLR,
2020.
[35] R. Gong, X. Liu, S. Jiang, T. Li, P. Hu, J. Lin, F. Yu, and J. Yan, “Differentiable soft
quantization: Bridging full-precision and low-bit neural networks,” in Proceedings of the
IEEE/CVF international conference on computer vision, pp. 4852–4861, 2019.
[36] A. Fan, P. Stock, B. Graham, E. Grave, R. Gribonval, H. Jegou, and A. Joulin,
“Training with quantization noise for extreme model compression,” arXiv preprint
arXiv:2004.07320, 2020.
[37] V. Markovtsev, “Kmcuda.” https://github.com/src-d/kmcuda, 2020.
[38] Y. Ding, Y. Zhao, X. Shen, M. Musuvathi, and T. Mytkowicz, “Yinyang k-means: A dropin
replacement of the classic k-means with consistent speedup,” in International conference
on machine learning, pp. 579–587, PMLR, 2015.
[39] H. Touvron, M. Cord, and H. Jégou, “Deit iii: Revenge of the vit,” in European Conference
on Computer Vision, pp. 516–533, Springer, 2022.

 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *