機器學習的快速發展導致軟體系統建構方式的轉變。 例如,這些系統可以透過揭示輸入特徵中的隱藏模式來改善輸入處理,而無需明確編程。 同時,機器學習模型的規模一直呈指數級增長,這很大程度上歸因於最近引入的 Transformer 架構的可擴展性。 然而,硬體資源一直難以跟上這些機器學習模型的快速成長速度,導致部署大型機器學習模型時的成本巨大。

在這項工作中,我們將重點放在機器學習和軟體系統如何相互增強。 具體來說,我們演示了資料庫系統如何透過在執行之前估計交易延遲來最佳化事務處理。 此方法可將服務等級協定 (SLA) 違規行為減少 36%。 此外,我們還展示瞭如何利用我們在建置系統方面的專業知識來大幅減少部署大型機器學習模型時的 GPU 記憶體消耗。 這使我們能夠在 8 GB GPU 上運行 120 GB 模型。
The rapid advancement of machine learning has led to a shift in how software systems are built. For instance, these systems can improve input processing by uncovering concealed patterns within the input features without explicit programming. Concurrently, machine learning models have been experiencing exponential growth in size, largely attributed to the scalable nature of the recently introduced transformer architecture. Nevertheless, hardware resources have struggled to keep up with the rapid pace of growth in these machine learning models, leading to astronomical cost when deploying large machine learning models.

In this work, we focus on how machine learning and software systems can mutually enhance each other. Specifically, we demonstrate how database systems can optimize transaction processing by estimating transaction latencies prior to execution. This approach leads to a 36% reduction in service level agreement (SLA) violations. Furthermore, we show how our expertise in building systems can be used to substantially reduce the GPU memory consumption of large machine learning models when deployed. This allows us to run a 120 GB model on an 8 GB GPU.
1. Introduction----1
2. Background------5
2.1 Database Management System-----5
2.1.1 Deterministic Database Management System-----5
2.1.2 Deterministic Latency Estimator-----7
2.1.3 Service Level Agreement------8
2.2 Machine Learning-----8
2.2.1 Neural Network--9
2.2.2 Quantization----10
2.2.3 Memory Offloading-----11
3. Enhanced Sequencer for Database Systems-----12
3.1 Drop----13
3.2 Reorder-13
3.3 Hybrid--14
4. Evaluation – ML for Database System-----14
4.1 Performance on SLA Violations-----16
4.2 System Throughput-----17
5. Reducing Memory Usage for Large Transformer Models-----18
5.1 Low Rank Decomposition--18
5.2 Layer Offloading--------21
6. Evaluation – System for ML-----22
6.1 GPU Memory Usage--------23
6.2 Inference Latency-------24
6.3 IO vs Compute-----25
6.4 Post Training Quantization------26
6.5 Quantized Inference Latency-----28
6.6 Deployment Cost Analysis--------29
7. Conclusions and Future Works----30
8. References-----31
