帳號:guest(3.145.161.235)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):吳姿嫻
作者(外文):Wu, Tzu-Hsien
論文名稱(中文):透過區塊索引技術的實作、分析和優化減少搜尋資料所需的時間與空間需求
論文名稱(外文):Reduce Space and Time Requirements for Searching Large Data Files Through the Implementation, Analysis and Optimization of Block Index Technique
指導教授(中文):周志遠
指導教授(外文):Chou, Jerry
口試委員(中文):金仲達
李哲榮
口試委員(外文):King, Chung-Ta
Lee, Che-Rung
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系所
學號:104062571
出版年(民國):106
畢業學年度:105
語文別:英文
論文頁數:44
中文關鍵詞:科學資料索引I/O系統模型化效能分析
外文關鍵詞:Scientific dataIndexingI/O systemModelingPerformance analysis
相關次數:
  • 推薦推薦:0
  • 點閱點閱:231
  • 評分評分:*****
  • 下載下載:15
  • 收藏收藏:0
科學實驗、科學觀察,以及大規模的科學模擬產生了大量的資料,這些資料的大小通常可以達到數百GB甚至數十PB,因此,索引技術成為科學家在進行資料分析時不可或缺的工具,索引技術讓科學家能夠直接讀取最重要的資料點,而不用讀取整個資料集。近年來,有許多目標是加速資料讀取的資料管理系統被開發,例如ADIOS、SciDB、FastBit等,然而,這些系統通常有著在建立並儲存資料索引檔時花費的時間以及空間成本過高的問題。因此,在這篇論文裡,我們提出了“區塊索引”這個輕量的索引技術,區塊索引利用了檔案儲存系統的I/O特性來大幅降低索引檔的大小以及建立索引檔花費的時間,並且避免犧牲查詢時的效能。在研究了區塊索引的優點以及挑戰後,我們更進一步提出了三個優化技術來提升查詢效能,我們所提出的優化技術是基於我們對科學資料以及高效能I/O系統所做的特性研究以及模型化所提出,因此,這些優化能夠大幅增進查詢效能,比起原本的區塊索引技術,我們的優化可以達到2.3倍的效能提升。
Scientific experiments, observations, and large-scale simulations generate massive amounts of data. The size of these datasets typically ranges from hundreds of gigabytes to tens of petabytes. Therefore, indexing technique has become an essential tool which enables scientists to directly access the most relevant data records instead of shifting through the whole dataset. In recent years, many data management tools or techniques have been made to accelerate the data access process, including ADIOS, SciDB, and FastBit. However, the time and space required for building and storing these indexes are often too expensive. In this thesis, we propose a light-weight indexing technique called "block index", which exploits the I/O characteristics of storage systems to significantly reduce index size and index building time without sacrificing query performance. After investigating the challenges and benefits of using block index technique, we further develop three optimization techniques to improve query performance. All these techniques are driven by our extensive effort in characterizing and modeling real scientific datasets and HPC I/O systems. As a result, our optimizations significantly improve query performance by up to a factor of 2.3 comparing to the original block index implementation.
Contents
1 Introduction 3
2 Block Index Overview 7
2.1 Basic Design 7
2.2 Challenges and Proposed Solutions 8
3 Block Index Performance Modeling 12
4 Block Index Optimization 16
4.1 Merge Read 16
4.2 Adaptive Dynamic Schedule 19
4.3 Partial Sort 21
5 Datasets Characteristics Study 23
5.1 Datasets 23
5.2 Data locality 24
5.3 Data variance 25
6 Experimental Evaluations 27
6.1 Index Performance 27
6.2 Query Performance 30
6.3 Merge Read Optimization 31
6.4 Adaptive Dynamic Schedule Optimization 33
6.5 Partial Sort Optimization 34
6.6 Overall Optimized Performance 35
6.7 Performance on ADIOS 36
7 Related work 39
8 Conclusion 41
[1] IPCC Fifth Assessment Report.http://en.wikipedia.org/wiki/IPCC_Fifth_Assessment_Report.
[2] B. Behzad, H. V. T. Luu, J. Huchette, S. Byna, Prabhat, R. Aydt, Q. Koziol,and M. Snir. Taming parallel i/o complexity with auto-tuning. InSC, pages68:1–68:12, 2013.
[3] S. B. Bin Dong and K. Wu. ”spatially clustered join on heterogeneous scientificdata sets”. In2015 IEEE International Conference on Big Data (IEEE BigData2015), 2015.
[4] K. J. Bowers, B. J. Albright, L. Yin, B. Bergen, and T. J. T. Kwan. Ultra-high performance three-dimensional electromagnetic relativistic kinetic plasmasimulation.Physics of Plasmas, 15(5):7, 2008.
[5] S. Byna, J. Chou, O. R ̈ubel, Prabhat, H. Karimabadi, W. S. Daughton,V. Roytershteyn, E. W. Bethel, M. Howison, K.-J. Hsu, K.-W. Lin, A. Shoshani,A. Uselton, and K. Wu. Parallel I/O, analysis, and visualization of a trillionparticle simulation. InSC, page 59, 2012.
[6] C. Chen, X. Huang, H. Fu, and G. Yang. The chunk-locality index: An effi-cient query method for climate datasets. InParallel and Distributed ProcessingSymposium Workshops PhD Forum (IPDPSW), 2012 IEEE 26th International,pages 2104–2110, May 2012.
[7] J. Chou, M. Howison, B. Austin, K. Wu, J. Qiang, E. W. Bethel, A. Shoshani,O. Rbel, and P. R. D. Ryne. Parallel index and query for large scale dataanalysis. In2011 International Conference for High Performance Computing,Networking, Storage and Analysis (SC), pages 1–11, Nov 2011.
[8] D. Comer. Ubiquitous b-tree.ACM Comput. Surv., 11(2):121–137, June 1979.
[9] P. Cudre-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, R. Simakov, E. Soroush,P. Velikhov, D. L. Wang, M. Balazinska, J. Becla, D. DeWitt, B. Heath,D. Maier, S. Madden, J. Patel, M. Stonebraker, and S. Zdonik. A Demonstra-tion of SciDB: A Science-oriented DBMS.Proc. VLDB Endow., 2(2):1534–1537,Aug. 2009.
[10] G. S. Davidson, K. W. Boyack, R. A. Zacharski, S. C. Helmerich, and J. R.Cowie. Data-centric computing with the netezza architecture. Technical ReportSAND2006-3640, Sandia National Laboratory, 2006.
[11] A. Herrera. Minmax indexes. pg hackers.
[12] Apache hive.https://hive.apache.org/.
[13] S. Klasky, H. Abbasi, et al. In Situ Data Processing for Extreme-Scale Com-puting. InSciDAC, July 2011.
[14] ADIOS.http://www.nccs.gov/user-support/center-projects/adios/.
[15] S. Lakshminarasimhan, D. A. Boyuka, S. V. Pendse, X. Zou, J. Jenkins,V. Vishwanath, M. E. Papka, and N. F. Samatova. Scalable in situ scientificdata encoding for analytical query processing. InHPDC, pages 1–12, 2013.
[16] S. Lakshminarasimhan, J. Jenkins, et al. ISABELA-QA: Query-driven analyticswith ISABELA-compressed extreme-scale scientific data. InSC, pages 1–11,Nov 2011.
[17] S. Lakshminarasimhan, N. Shah, S. Ethier, S. Klasky, R. Latham, R. Ross,and N. F. Samatova. Compressing the Incompressible with ISABELA: In-situReduction of Spatio-temporal Data. InEuro-Par, pages 366–379, 2011.
[18] K.-L. Ma. In situ visualization at extreme scale: Challenges and opportunities.Computer Graphics and Applications, IEEE, 29(6):14–19, Nov 2009.
[19] A. Nanda. Smart scans meet storage indexes.Oracle Magazine, 2011.
[20] P. O’Neil. Model 204 architecture and performance. In2nd International Work-shop in High Performance Transaction Systems, Asilomar, CA, volume 359 ofLecture Notes in Computer Science, pages 40–59. Springer-Verlag, Sept. 1987.
[21] P. O’Neil and E. O’Neil.Database: principles, programming, and performance.Morgan Kaugmann, 2nd edition, 2000.
[22] A. Shoshani and D. Rotem, editors.Scientific Data Management: Challenges,Technology, and Deployment. Chapman & Hall/CRC Press, 2010.
[23] K. Stockinger, E. W. Bethel, S. Campbell, E. Dart, , and K. Wu. DetectingDistributed Scans Using High-Performance Query-Driven Visualization. InSC.IEEE Computer Society Press, Nov. 2006.
[24] K. Stockinger, J. Shalf, W. Bethel, and K. Wu. Query-driven visualization oflarge data sets. InIEEE Visualization 2005, Minneapolis, MN, October 23-28,2005, page 22, 2005.
[25] The HDF Group. HDF5 user guide.http://hdf.ncsa.uiuc.edu/HDF5/doc/H5.user.html, 2010.
[26] T. Tu, H. Yu, et al. Remote runtime steering of integrated terascale simulationand visualization. InSC HPC Analytics Challenge, 2006.
[27] Unidata.The NetCDF users’ guide.http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/, 2010.
[28] K. Wu, S. Ahern, et al. FastBit: Interactively searching massive data. InSciDAC, 2009.
[29] T. Wu, J. Chou, N. Podhorszki, J. Gu, Y. Tian, S. Klasky, and K. Wu. Applyblock index technique to scientific data analysis and i/o systems. InIEEE/ACMInternational Workshop on Distributed Big Data Management (DBDM) at CC-Grid, May 2017.
[30] T. Wu, H. Shyng, J. Chou, B. Dong, and K. Wu. Indexing blocks to reduce spaceand time requirements for searching large data files. In2016 16th IEEE/ACMInternational Symposium on Cluster, Cloud and Grid Computing (CCGrid),pages 398–402, May 2016.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *