帳號:guest(3.145.161.235)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):蕭宇彤
作者(外文):Hsiao, Yu-Tung
論文名稱(中文):Oracle: 預測並優化複雜查詢計算流程工作的深度學習模型
論文名稱(外文):Oracle: A Deep Learning Model for Predicting and Optimizing Complex Query Workflows
指導教授(中文):周志遠
指導教授(外文):Chou, Chi-Yuan
口試委員(中文):金仲達
李哲榮
口試委員(外文):King, Chung-Ta
Lee, Che-Rung
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系所
學號:105062539
出版年(民國):107
畢業學年度:106
語文別:英文
論文頁數:37
中文關鍵詞:Hive 查詢DAG 執行計劃數據分析深度學習預測優化
外文關鍵詞:Hive queryDAG execution planData analyticDeep learningPredictionOptimization
相關次數:
  • 推薦推薦:0
  • 點閱點閱:407
  • 評分評分:*****
  • 下載下載:11
  • 收藏收藏:0
Hive是一個廣泛使用的數據倉庫系統開源框架,基於Hadoop執行引擎和分佈式存儲系統,Hive採用SQL語句來簡化開發大數據分析應用程序的難度,隨著Hive性能優化受到更多關注,執行時間的評估成為在優化參數設定上的重要因素。然而,由於Hive查詢的執行時間會受到超過數百個配置的影響,進而導致不同的執行計劃和作業行為,性能預測變得更具有挑戰性。

在本文中,我們提出一個採用深度學習技術的數據驅動解決方案 Oracle,以構建用於估計Hive查詢運行時間的兩步預測模型,並通過優化方法大幅減少作業執行時間,在方法設計中,我們使用遞歸神經網絡(RNN)來考慮DAG工作流程之間的依賴關係。在實驗中,我們針對TPC-H benchmark不同複雜度的查詢在自建群集進行深入的評估,實驗結果顯示Oracle可以達到5.8%的錯誤率並且優過其他三種比較方法,另外基於預測模型,我們可以在不對架構進行任何修改的情況下實現40%的性能提升。
Hive is a widely-used open-source framework for data warehouse system. Based on the Hadoop execution engine and distributed storage systems, Hive adopts high-level SQL statements to simplify the difficulties for developing big data analytic applications. As more attention has been drawn to optimize the performance of Hive, the performance estimation has been an important role in finding the appropriate parameters. However, since execution time of Hive queries is affected by over hundreds of configurations, resulting in different execution plans and job behaviors, performance prediction becomes more challenging.

In this thesis we proposed a time prediction model for optimizing the execution of Hive. Our proposed prediction model is called Oracle, which is a data-driven solution based on deep learning techniques. The prediction also employs recurrent neural networks(RNN) to consider dependencies between stages in a DAG work-flow. We have implemented Oracle with intensive evaluation for TPC-H benchmark queries in different complexity running on the in-house cluster. The experiment results show that Oracle achieved 5.8\% error rate and outperformed three other comparison approaches. Based on prediction models, we can bring about 40% performance improvements without any modification on architecture.
摘要
目錄
Introduction -------------------- 1
Background -------------------- 5
System overview ----------------- 8
Data collection -------------------10
Performance prediction ---------- 12
Optimization --------------------- 16
Experiment Setup ---------------- 17
Experiments --------------------- 20
Related work --------------------- 31
Conclusion ----------------------- 34
Reference ------------------------ 35
[1] Apache. Apache spark. [Online]. Available: https://spark.apache.org.
[2] Apache. Hbase. [Online]. Available: https://hbase.apache.org.
[3] Apache. Hive. [Online]. Available: https://hive.apache.org.
[4] Apache. Hiveql. [Online]. Available: https://cwiki.apache.org/confluence/ display/Hive/LanguageManual.
[5] Apache. Mapreduce history server rest api. [Online]. Available:
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/ hadoop-mapreduce-client-hs/HistoryServerRest.html.
[6] Chen, C. O., Zhuo, Y. Q., Yeh, C. C., Lin, C. M., and Liao, S. W. Machine learning-based configuration parameter tuning on hadoop system. In 2015 IEEE International Congress on Big Data (June 2015), pp. 386–392.
[7] Dokeroglu, T., Cınar, M. S., Sert, S. A., Cosar, A., and Yazıcı, A. Improv- ing Hadoop Hive Query Response Times Through Efficient Virtual Resource Allocation. Springer International Publishing, Cham, 2016, pp. 215–225.
[8] Dokeroglu, T., Ozal, S., Bayir, M. A., Cinar, M. S., and Cosar, A. Improv- ing the performance of hadoop hive by sharing scan and computation tasks. Journal of Cloud Computing 3, 1 (Jul 2014), 12.
[9] Gandhi, A., Thota, S., Dube, P., Kochut, A., and Zhang, L. Autoscaling for hadoop clusters. In 2016 IEEE International Conference on Cloud Engineering (IC2E) (April 2016), pp. 109–118.
[10] Google. Tensorflow. [Online]. Available: https://www.tensorflow.org.
[11] Haryono, G. P., and Zhou, Y. Profiling apache hive query from run time logs.
2016 International Conference on Big Data and Smart Computing (BigComp)
(2016), 61–68.
[12] He,K.,Zhang,X.,Ren,S.,andSun,J.Deepresiduallearningforimagerecog- nition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 770–778.
[13] Hortonworks. hive-testbench. [Online]. Available: https://github.com/ hortonworks/hive-testbench.
35
[14] Hu, S., Liu, W., Rabl, T., Huang, S., Liang, Y., Xiao, Z., Jacobsen, H. A., Pei, X., and Wang, J. Dualtable: A hybrid storage model for update optimization in hive. In 2015 IEEE 31st International Conference on Data Engineering (April 2015), pp. 1340–1351.
[15] Huai,Y.,Chauhan,A.,Gates,A.,Hagleitner,G.,Hanson,E.N.,O’Malley,O., Pandey, J., Yuan, Y., Lee, R., and Zhang, X. Major technical advancements in apache hive. In Proceedings of the 2014 ACM SIGMOD International Con- ference on Management of Data (New York, NY, USA, 2014), SIGMOD ’14, ACM, pp. 1235–1246.
[16] Kadirvel, S., and Fortes, J. A. B. Grey-box approach for performance predic- tion in map-reduce based platforms. 2012 21st International Conference on Computer Communications and Networks (ICCCN) (2012), 1–9.
[17] Lama, P., and Zhou, X. Aroma: Automated resource allocation and configu- ration of mapreduce environment in the cloud. In Proceedings of the 9th Inter- national Conference on Autonomic Computing (New York, NY, USA, 2012), ICAC ’12, ACM, pp. 63–72.
[18] Lee, R., Luo, T., Huai, Y., Wang, F., He, Y., and Zhang, X. Ysmart: Yet another sql-to-mapreduce translator. In 2011 31st International Conference on Distributed Computing Systems (June 2011), pp. 25–36.
[19] Ng, J. Y., Hausknecht, M. J., Vijayanarasimhan, S., Vinyals, O., Monga, R., and Toderici, G. Beyond short snippets: Deep networks for video classifica- tion. CoRR abs/1503.08909 (2015).
[20] Reinsel, D., Gantz, J., and Rydning, J. Data age 2025: The evolution of data to life-critical don’t focus on big data. Focus on the Data That’s Big Sponsored by Seagate The Evolution of Data to Life-Critical Don’t Focus on Big Data (2017).
[21] Sak, H., Senior, A. W., Rao, K., and Beaufays, F. Fast and accurate recurrent neural network acoustic models for speech recognition. CoRR abs/1507.06947 (2015).
[22] Sangroya, A., and Singhal, R. Performance assurance model for hiveql on large data volume. In 2015 IEEE 22nd International Conference on High Per- formance Computing Workshops (Dec 2015), pp. 26–33.
[23] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driess- che, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of go with deep neural networks and tree search. Nature 529 (2016), 484–503.
[24] Verma, A., Cherkasova, L., and Campbell, R. H. Aria: Automatic resource inference and allocation for mapreduce environments. In Proceedings of the 8th ACM International Conference on Autonomic Computing (New York, NY, USA, 2011), ICAC ’11, ACM, pp. 235–244.
36

[25] Wang, G., Butt, A. R., Pandey, P., and Gupta, K. A simulation approach to evaluating design decisions in mapreduce setups. In 2009 IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommuni- cation Systems (Sept 2009), pp. 1–11.
[26] Wang, K., Bian, Z., Chen, Q., Wang, R., and Xu, G. Simulating hive clus- ter for deployment planning, evaluation and optimization. In 2014 IEEE 6th International Conference on Cloud Computing Technology and Science (Dec 2014), pp. 475–482.
[27] Wang, Y., Xu, Y., Liu, Y., Chen, J., and Hu, S. Qmapper for smart grid: Migrating sql-based application to hive. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2015), SIGMOD ’15, ACM, pp. 647–658.
[28] Wu,G.,Greathouse,J.L.,Lyashevsky,A.,Jayasena,N.,andChiou,D.Gpgpu performance and power estimation using machine learning. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA) (Feb 2015), pp. 564–576.
[29] Zhang, J., and Zong, C. Deep neural networks in machine translation: An overview. IEEE Intelligent Systems 30, 5 (Sept 2015), 16–25.
[30] Zhang, Z., Cherkasova, L., and Loo, B. T. Autotune: Optimizing execution concurrency and resource usage in mapreduce workflows. In Proceedings of the 10th International Conference on Autonomic Computing (ICAC 13) (San Jose, CA, 2013), USENIX, pp. 175–181.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *