作者(外文):Hsu, Keng-Jui
論文名稱(中文):UNUS: 兼具編程性、跨平台、可靠性、及計算效能的聯邦學習計算系統
論文名稱(外文):UNUS: A Reliable and Efficient Federated Learning Framework with Programmability and Portability
指導教授(外文):Chou, Jerry
口試委員(外文):Lee, Che-Rung
Yang, Chia-Ling
外文關鍵詞:Deep LearningMachine LearningEdge ComputingFederated LearningData PrivacyStale Synchronous ParallelFramework
近年來,機器學習正在逐漸融入在我們的生活中,有許多產品利用機器學習來最佳化他們的服務。為了提供更佳的使用者體驗,我們需要收集更多的資料來提高模型的精準度,而終端設備像是智慧型手機、個人電腦或是 IoT 等等的裝置都是很適合用來收集資料。然而,我們在收集資料的同時,必須保護使用者的資料隱私,而聯邦式學習就是被提出來解決這個問題的一種機器學習設置。使用聯邦式學習,機器學習的模型會在這些裝置上使用本地端收集到的資料進行訓練,因此不會有任何的原始數據被傳輸到網路上,藉此避免資料的直接外流。然而,目前的機器學習框架,像是 TensorFlow 以及 PyTorch 等等的框架,並不是設計給聯邦式學習使用,因此使用者要實作出聯邦式學習的模型需要很多突破這些框架限制的技術,然而這會影響到對於聯邦式學習的研究開發。為了解決這個問題,我們提出一個聯邦式學習的應用框架,UNUS。UNUS 具有高效能的特性,並且完整實做了底層的計算、溝通、容錯機制等等,並且提供高階的 API 讓使用者可以快速套用並自定義聯邦式學習的演算法,藉此加速聯邦式學習的研究與開發。同時,UNUS 也實作了 Stale Synchronous Parallel,使得框架可以同時提供同步與非同步的訓練機制。
In recent years, machine learning has integrated into our life deeply. Lots of products take advantage of machine learning to optimize their services. To provide better user experience, collecting more data is necessary, and terminal equipment like smartphones, PC, and IoT devices are good to go. However, we have to pay great attention to the data privacy of users while collecting the data. As a result, Federated Learning is proposed to deal with this kind of problem. With Federated Learning, models will be trained on the local device with only the local data. No data will be uploaded to the server so that direct data leakage can be prevented. Nevertheless, current ML/DL frameworks, such as TensorFlow and PyTorch, are not designed for FL. It takes lots of hacking to build a Federated Learning model with existing machine learning frameworks, which is not good for research of Federated Learning. Thus, we propose UNUS, a high-performance Federated Learning framework, which provides complete infrastructures for Federated Learning. UNUS handles the loss of messages and clients during the training of Federated Learning, and the convenience that comes from custom API makes the development FL simpler and faster. Also, UNUS is integrated with stale synchronous, which provides both synchronous and asynchronous mechanisms.
1 Introduction ................................................... 1
2 Related Work ................................................... 3
2.1 Algorithms ................................................... 3
2.2 Frameworks ................................................... 5
3 Design Goal .................................................... 8
4 System Design & Implementation ................................. 10
4.1 Execution Model .............................................. 10
4.1.1 Workflow ................................................... 10
4.1.2 Metadata ................................................... 11
4.1.3 Modules .................................................... 12
4.1.4 Synchronization Model ...................................... 14
4.2 Implementation Details ....................................... 14
4.2.1 Registration ............................................... 15
4.2.2 Communication .............................................. 15
4.2.3 Synchronization ............................................ 16
4.2.4 Model Extraction ........................................... 17
4.2.5 Fault Tolerance ............................................ 17
4.3 API Usage .................................................... 18
5 Case Study ..................................................... 20
5.1 Federated Averaging .......................................... 20
5.2 Asynchronous Averaging ....................................... 21
5.3 Customization ................................................ 22
6 Evaluation ..................................................... 25
6.1 Paper Result Reproduce ....................................... 25
6.1.1 Setup ...................................................... 25
6.1.2 Federated Averaging ........................................ 26
6.1.3 Distributed Selective SGD .................................. 28
6.2 Staleness .................................................... 30
6.3 Fault Tolerance .............................................. 32
6.4 Performance Comparison with PaddleFL ......................... 33
7 Conclusion ..................................................... 36
