TAMP: 用於高效部署大型MoE 模型的任務無關合併流程__國立清華大學博碩士論文全文影像系統

帳號：guest(216.73.216.146) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	陳奕君
作者(外文):	Chen, I-Chun
論文名稱(中文):	TAMP: 用於高效部署大型MoE 模型的任務無關合併流程
論文名稱(外文):	TAMP: Task-Agnostic Merging Pipeline for Efficient Deployment of Large MoE Models
指導教授(中文):	李濬屹
指導教授(外文):	Lee, Chun-Yi
口試委員(中文):	許晏彰楊奕軒
口試委員(外文):	Hsu, Yen-Chang Yang, Yi-Hsuan
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	資訊工程學系
學號:	111062610
出版年(民國):	113
畢業學年度:	112
語文別:	中文
論文頁數:	35
中文關鍵詞:	任務無關、專家混合模型、零樣本語言基準、模型合併、模型壓縮
外文關鍵詞:	task-agnostic、mixture-of-experts、zero-shot-language-benchmark、model-merging、model-compression
相關次數:	推薦:0 點閱:558 評分: 下載:0 收藏:0

大型語言模型(LLMs)已經取得了顯著的性能表現，但同時也帶來了更高的推理成本和延遲。專家混合模型(MoE)通過在計算過程中僅激活一部分參數來緩解這些問題，但它們面臨著高內存需求、通信成本以及專家間冗餘等挑戰。現有方法在解決這些挑戰時往往需要進行大量的微調，或者缺乏跨架構的泛化能力。
我們提出了TAMP(任務無關的專家合併管道)，這是一種無需重新訓練的高效專家合併方法。TAMP在優化預訓練MoE模型的能耗和服務成本的同時，保持了其在零樣本基準測試中的通用能力。其主要組成部分包括基於知識的主要專家選擇、基於輸出相似性的專家分組、融合路由器信息的增強ZipIt合併，以及權重順序保持技術。
在Qwen1.5-MoE-A2.7B和Mixtral 8x7B等模型上的實驗表明，TAMP可以減少25%的專家數量，性能相比基線提升23%，並且表現接近原始模型的3.59%。這些結果突顯了TAMP在顯著提高大型MoE模型的效率和可部署性方面的潛力，同時保持性能降級在可接受的範圍內。

Large Language Models (LLMs) have achieved remarkable performance but come with increased inference costs and latency. Mixture of Experts (MoE) models mitigate these issues by activating only a subset of parameters during computation, however, face challenges such as high memory requirements, communication costs, and redundancy among experts. Existing methods to address these challenges often require fine-tuning or lack generalization across architectures.
We propose TAMP (Task-Agnostic Merging Pipeline), an efficient method for merging MoE experts without retraining. TAMP optimizes pre-trained MoE mod- els for energy and serving costs while maintaining general capabilities across zero- shot benchmarks. Key components include knowledge-based dominant expert selection, expert grouping based on output similarity, an enhanced ZipIt merge incorporating router information, and a weight order preservation technique.
Experiments with models like Qwen1.5-MoE-A2.7B and Mixtral 8x7B show that TAMP can reduce the number of experts by 25%, achieving a 23% improve- ment over the baseline and performing within 3.59% of the original model. These results highlight TAMP’s potential to significantly improve the efficiency and de- ployability of large MoE models with minimal performance degradation.

Contents
Abstract (Chinese) I
Acknowledgements (Chinese) II
Abstract III
Acknowledgements IV
Contents V
List of Figures VII
List of Tables IX
1 Introduction 1
2 Related Works 4
2.1 MixtralofExpertsModel ....................... 4
2.2 ModelMerging ............................. 5
3 Preliminaries 6
3.1 MixtralofExperts(MoE) ....................... 6
3.2 MC-SMoE................................ 7 3.3 ZipItMerging.............................. 8 3.4 KnowledgeComputation........................ 10
4 Methodology 12
4.1 OverviewofTAMP........................... 12
4.2 Knowledge-based Dominant Expert Selection . . . . . . . . . . . . . 13
4.3 ExpertsGrouping............................ 14
4.4 ZipItExpertsMerging ......................... 15
4.5 Fix-DominantExpertMerging..................... 16
5 Experimental Results 19
5.1 ExperimentalSetups .......................... 19
5.2 PerformanceComparison........................ 20
5.2.1 Qwen1.5-MoE-A2.7B...................... 20
5.2.2 Mixtral8x7B .......................... 22
5.3 AblationStudy ............................. 22
6 Conclusion and Future Works 24
Bibliography 25
7 Appendix 30
7.1 Frequency Analysis of TinyLLama-4x1.1B-MoE . . . . . . . . . . . 30 7.2 FrequencyAnalysisofMixtral8x7B.................. 30 7.3 EvaluationBenchmarks......................... 30

1. Samuel K. Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries, 2023.
2. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
3. Tianyu Chen, Shaohan Huang, Yuan Xie, Binxing Jiao, Daxin Jiang, Haoyi Zhou, Jianxin Li, and Furu Wei. Task-specific expert pruning for sparse mixture-of-experts, 2022.
4. William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2022.
5. Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis, 2020.
6. Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Hao-nan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework
for few-shot language model evaluation, 12 2023.
7. Shwai He, Run-Ze Fan, Liang Ding, Li Shen, Tianyi Zhou, and Dacheng Tao. Merging experts into one: Improving computational efficiency of mixture of experts. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14685–14691, Singapore, December 2023. Association for Compu-
tational Linguistics.
8. Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization, 2019.
9. Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E.
Hinton. Adaptive mixtures of local experts. Neural Computation, 3(1):79–87,1991.
10. Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, L ́elio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Th ́eophile Gervet, Thibaut Lavril, Thomas Wang, Timoth ́ee Lacroix, and William El Sayed. Mixtral of experts, 2024.
11. Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, and Pengxiang Cheng. Dataless knowledge fusion by merging weights of language models, 2023.
12. M.I. Jordan and R.A. Jacobs. Hierarchical mixtures of experts and the em algorithm. In Proceedings of 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan), volume 2, pages 1339–1344 vol.2, 1993.
13. Minsoo Kim, Sihwa Lee, Sukjin Hong, Du-Seong Chang, and Jungwook Choi. Understanding and improving knowledge distillation for quantization-aware training of large transformer encoders, 2022.
14. Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding, 2020.
15. Pingzhi Li, Zhenyu Zhang, Prateek Yadav, Yi-Lin Sung, Yu Cheng, Mohit Bansal, and Tianlong Chen. Merge, then compress: Demystify efficient smoe with hints from its routing policy, 2024.
16. Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John Hopcroft. Con-
vergent learning: Do different neural networks learn the same representations?, 2016.
17. Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, and Hongsheng Li. Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models, 2024.
18. Guillermo Ortiz-Jimenez, Alessandro Favero, and Pascal Frossard. Task arithmetic in the tangent space: Improved editing of pre-trained models, 2023.
19. Seungcheol Park, Hojun Choi, and U Kang. Accurate retraining-free pruning for pretrained encoder-based language models, 2024.
20. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.
21. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, 2020.
22. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, 2017.
23. George Stoica, Daniel Bolya, Jakob Bjorner, Pratik Ramesh, Taylor Hearn, and Judy Hoffman. Zipit! merging models from different tasks without training, 2024.
24. Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient knowledge distillation for bert model compression, 2019.
25. Qwen Team. Qwen1.5-moe: Matching 7b model performance with 1/3 activated parameters”, February 2024.
26. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ́ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023.
27. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra,
Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023.
28. Neha Verma and Maha Elbayad. Merging text transformer models from different initializations, 2024.
29. Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, 2022.

電子全文
摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文