帳號:guest(18.117.101.108)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):邱宣德
作者(外文):Chiu, Hsuan Te
論文名稱(中文):科學資料之內存運算查詢系統
論文名稱(外文):In-memory query system for scientific datasets
指導教授(中文):周志遠
指導教授(外文):Chou, Jerry
口試委員(中文):李哲榮
蕭宏璋
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:102062540
出版年(民國):104
畢業學年度:103
語文別:英文
論文頁數:52
中文關鍵詞:索引科學資料
外文關鍵詞:In-situ computingquery-driven analysisindexing,scientifidistributed shared memory
相關次數:
  • 推薦推薦:0
  • 點閱點閱:390
  • 評分評分:*****
  • 下載下載:1
  • 收藏收藏:0
隨著現今電腦的運算能力愈來愈強大,而且在資料量不斷提升
的情況下有限的I/O頻寬卻無法等比例的提升,兩者間日趨擴大的效
能差異導致傳統的模擬後數據處理方法(post-simulation data
processing method)已面臨效能上的瓶頸。因此原位計算(in-situ
computing)與查詢驅動數據分析(query-driven data analysis)是
用於縮短資料搬移路徑很重要的技巧。我們實作一個結合了位圖索
引(bitmap indexing)、空間資料結構重組(spatial data reorganization)
、分散式共享內存(distributed shared memory)與
位置感知平行執行(location-aware parallel execution)的索引系
統,並且使用了NERSC的超級電腦作為真實環境對兩個真實科學模擬
資料運行實驗分析。結果顯示對比於傳統依賴平行儲存檔案系統的
查詢系統,我們的系統可以達到10倍以上的效能優化。
The growing gap between compute performance and I/O bandwidth coupled with the increasing data volumes has resulted in a bottleneck to the traditional post- simulation data processing method. Hence in-situ computing and query-driven data analysis are important techniques to minimize data movement. By taking advantage of the growing memory capacity on supercomputers, we developed an in-memory query system for scientific data analysis. Our approach is a combination of bitmap indexing, spatial data layout re-organization, distributed shared memory, and location-aware parallel execution. Our evaluations on a NERSC supercomputer using two real scientific datasets showed that we can aggregate the memory ca- pacity from thousands of computes nodes to analyze a 750GB simulation dataset without transferring data to remote nodes or storage systems. Comparing to the traditional solutions based on out-of-core parallel file systems, we achieve more than x10 speedup. Therefore, our system can support interactive query and serve as a vehicle for steering simulations.
Contents
1 Introduction 5
2 Related Work 9
2.1 Array-based database systems . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Query and indexing techniques . . . . . . . . . . . . . . . . . . . . . 10
2.3 In-memory & parallel processing . . . . . . . . . . . . . . . . . . . . . 11
3 System Overview 13
3.1 Design Principal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Data Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 API & Use Case Example . . . . . . . . . . . . . . . . . . . . . . . . 18
4 DSM Storage Layer 19
4.1 Data Partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Data Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Swap Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Variable Creation 22
6 Variable Transformation 24
7 Query & Indexing 28
7.1 Spatial Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.2 Spatial Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8 Experimental Evaluation 33
8.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.2 Range Query Indexing & Query . . . . . . . . . . . . . . . . . . . . . 34
8.3 Spatial Indexing & Query . . . . . . . . . . . . . . . . . . . . . . . . 36
8.4 Data Caching & Processing . . . . . . . . . . . . . . . . . . . . . . . 39
8.5 Compared with SciDB . . . . . . . . . . . . . . . . . . . . . . . . . . 39
9 Conclusion 45
Bibliography
[1] D. J. Abadi, S. R. Madden, and N. Hachem. Column-stores vs. row-stores:
How di erent are they really? In Proceedings of the 2008 ACM SIGMOD
International Conference on Management of Data, SIGMOD '08, pages 967{
980, 2008.
[2] H. Abbasi, G. Eisenhauer, M. Wolf, K. Schwan, and S. Klasky. Just in time:
Adding value to the io pipelines of high performance applications with jitstaging.
In Proceedings of the 20th International Symposium on High Performance
Distributed Computing, HPDC '11, pages 27{36, 2011.
[3] I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki. Nodb:
Ecient query execution on raw data les. In Proceedings of the 2012 ACM
SIGMOD International Conference on Management of Data, SIGMOD '12,
pages 241{252, 2012.
[4] IPCC Fifth Assessment Report. http://en.wikipedia.org/wiki/IPCCF ifthAssessmentReport:
[5] P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. The multidimensional
database system rasdaman. In Proceedings of the 1998 ACM SIG-
MOD International Conference on Management of Data, SIGMOD '98, pages
575{577, New York, NY, USA, 1998. ACM.
[6] S. Blanas, K. Wu, S. Byna, B. Dong, and A. Shoshani. Parallel data analysis
directly on scienti c le formats. In Proceedings of the 2014 ACM SIGMOD
International Conference on Management of Data, SIGMOD '14, pages 385{
396, 2014.
[7] P. A. Boncz, M. L. Kersten, and S. Manegold. Breaking the memory wall in
monetdb. Commun. ACM, 51(12):77{85, Dec. 2008.
[8] K. J. Bowers, B. Albright, L. Yin, B. Bergen, and T. Kwan. Ultrahigh performance
three-dimensional electromagnetic relativistic kinetic plasma simulationa).
Physics of Plasmas (1994-present), 15(5):055703, 2008.
[9] K. J. Bowers, B. J. Albright, L. Yin, B. Bergen, and T. J. T. Kwan. Ultrahigh
performance three-dimensional electromagnetic relativistic kinetic plasma
simulation. Physics of Plasmas, 15(5):7, 2008.
[10] P. G. Brown. Overview of scidb: Large scale array storage, processing and
analysis. In Proceedings of the 2010 ACM SIGMOD International Conference
on Management of Data, SIGMOD '10, pages 963{968, 2010.
[11] S. Byna, J. Chou, O. Rubel, Prabhat, H. Karimabadi, W. S. Daughton,
V. Roytershteyn, E. W. Bethel, M. Howison, K.-J. Hsu, K.-W. Lin, A. Shoshani,
A. Uselton, and K. Wu. Parallel i/o, analysis, and visualization of a trillion
particle simulation. In SC, page 59, 2012.
[12] Y. Cheng and F. Rusu. Parallel in-situ data processing with speculative loading.
In Proceedings of the 2014 ACM SIGMOD International Conference on
Management of Data, SIGMOD '14, pages 1287{1298, 2014.
[13] J. Chou, M. Howison, B. Austin, K. Wu, J. Qiang, E. W. Bethel, A. Shoshani,
O. Rubel, Prabhat, and R. D. Ryne. Parallel index and query for large scale
data analysis. In SC, page 30, 2011.
[14] J. Chou, K. Wu, and Prabhat. FastQuery: A parallel indexing system for
scienti c data. In IASDS. IEEE, 2011.
[15] J. Chou, K. Wu, O. Rubel, M. Howison, J. Qiang, Prabhat, B. Austin, E. W.
Bethel, R. D. Ryne, and A. Shoshani. Parallel index and query for large scale
data analysis. In SC11, 2011.
[16] P. Cudre-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, R. Simakov, E. Soroush,
P. Velikhov, D. L. Wang, M. Balazinska, J. Becla, D. DeWitt, B. Heath,
D. Maier, S. Madden, J. Patel, M. Stonebraker, and S. Zdonik. A demonstration
of scidb: A science-oriented dbms. Proc. VLDB Endow., 2(2):1534{1537,
Aug. 2009.
[17] J. Dean and S. Ghemawat. Mapreduce: Simpli ed data processing on large
clusters. Commun. ACM, 51(1):107{113, Jan. 2008.
[18] B. Dong, S. Byna, and K. Wu. Sds: A framework for scienti c data services.
In Proceedings of the 8th Parallel Data Storage Workshop, PDSW '13, pages
27{32, 2013.
[19] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed
data-parallel programs from sequential building blocks. In Proceedings of the
2Nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007,
EuroSys '07, pages 59{72, 2007.
[20] J. Jenkins, I. Arkatkar, S. Lakshminarasimhan, I. Boyuka, DavidA., E. Schendel,
N. Shah, S. Ethier, C.-S. Chang, J. Chen, H. Kolla, S. Klasky, R. Ross, and
N. Samatova. Alacrity: Analytics-driven lossless data compression for rapid insitu
indexing, storing, and querying. In Transactions on Large-Scale Data- and
Knowledge-Centered Systems X, volume 8220 of Lecture Notes in Computer
Science, pages 95{114. Springer Berlin Heidelberg, 2013.
[21] J. Kim, H. Abbasi, L. Chacon, C. Docan, S. Klasky, Q. Liu, N. Podhorszki,
A. Shoshani, and K. Wu. Parallel in situ indexing for data-intensive computing.
In Large Data Analysis and Visualization (LDAV), 2011 IEEE Symposium on,
pages 65{72, Oct 2011.
[22] J. Kim, H. Abbasi, L. Chacon, C. Docan, S. Klasky, Q. Liu, N. Podhorszki,
A. Shoshani, and K. Wu. Parallel in situ indexing for data-intensive computing.
In Large Data Analysis and Visualization (LDAV), 2011 IEEE Symposium on,
pages 65{72, Oct 2011.
[23] S. Klasky, H. Abbasi, J. Logan, M. Parashar, K. Schwan, A. Shoshani, M. Wolf,
S. Ahern, I. Altintas, W. Bethel, L. Chacon, C. Chang, J. Chen, H. Childs,
J. Cummings, S. Ethier, R. Grout, Z. Lin, Q. Liu, X. Ma, K. Moreland, V. Pascucci,
N. Podhorszki, N. Samatova, W. Schroeder, R. Tchoua, J. Wu, and
W. Yu. In Situ Data Processing for Extreme-Scale Computing. In SciDAC,
July 2011.
[24] ADIOS. http://www.nccs.gov/user-support/center- projects/adios/.
[25] S. Lakshminarasimhan, D. A. Boyuka, S. V. Pendse, X. Zou, J. Jenkins,
V. Vishwanath, M. E. Papka, and N. F. Samatova. Scalable in situ scienti c
data encoding for analytical query processing. In Proceedings of the 22Nd Inter-
national Symposium on High-performance Parallel and Distributed Computing,
HPDC '13, pages 1{12, New York, NY, USA, 2013. ACM.
[26] S. Lakshminarasimhan, J. Jenkins, I. Arkatkar, Z. Gong, H. Kolla, S.-H. Ku,
S. Ethier, J. Chen, C. Chang, S. Klasky, R. Latham, R. Ross, and N. Samatova.
Isabela-qa: Query-driven analytics with isabela-compressed extreme-scale
scienti c data. In High Performance Computing, Networking, Storage and Anal-
ysis (SC), 2011 International Conference for, pages 1{11, Nov 2011.
[27] S. Lakshminarasimhan, N. Shah, S. Ethier, S. Klasky, R. Latham, R. Ross, and
N. F. Samatova. Compressing the incompressible with isabela: In-situ reduction
of spatio-temporal data. In Proceedings of the 17th International Conference
on Parallel Processing - Volume Part I, Euro-Par'11, pages 366{379, 2011.
[28] J. K. Lawder. Querying multi-dimensional data indexed using the hilbert space-
lling curve. SIGMOD Record, 30:2001, 2001.
[29] L. Libkin, R. Machlin, and L. Wong. A query language for multidimensional
arrays: Design, implementation, and optimization techniques. In Proceedings
of the 1996 ACM SIGMOD International Conference on Management of Data,
SIGMOD '96, pages 228{239, 1996.
[30] K.-L. Ma. In situ visualization at extreme scale: Challenges and opportunities.
Computer Graphics and Applications, IEEE, 29(6):14{19, Nov 2009.
[31] J. Mache, V. Lo, and S. Garg. The impact of spatial layout of jobs on I/O
hotspots in mesh networks. JPDC, 65(10):1190{1203, Oct. 2005.
[32] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and
G. Czajkowski. Pregel: A system for large-scale graph processing. In Proceed-
ings of the 2010 ACM SIGMOD International Conference on Management of
Data, SIGMOD '10, pages 135{146, 2010.
[33] A. P. Marathe and K. Salem. Query processing techniques for arrays. The
VLDB Journal, 11(1):68{91, Aug. 2002.
[34] F. Rusu and A. Dobra. Glade: A scalable framework for ecient analytics.
SIGOPS Oper. Syst. Rev., 46(1):12{18, Feb. 2012.
[35] H. Sagan. Space-Filling Curves. Springer-Verlag, New York, NY.
[36] E. Soroush, M. Balazinska, and D. Wang. Arraystore: A storage manager for
complex parallel array processing. In Proceedings of the 2011 ACM SIGMOD
International Conference on Management of Data, SIGMOD '11, pages 253{
264, New York, NY, USA, 2011. ACM.
[37] T. Tu, H. Yu, J. Bielak, O. Ghattas, J. C. Lopez, K.-L. Ma, D. R. O'Hallaron,
L. Ramirez-Guzman, N. Stone, R. Taborda-Rios, and J. Urbanic. Remote
runtime steering of integrated terascale simulation and visualization. In Pro-
ceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC '06, New
York, NY, USA, 2006. ACM.
[38] V. Vishwanath, M. Hereld, V. Morozov, and M. E. Papka. Topology-aware
data movement and staging for i/o acceleration on blue gene/p supercomputing
systems. In Proceedings of 2011 International Conference for High Performance
Computing, Networking, Storage and Analysis, SC '11, pages 19:1{19:11, 2011.
[39] A. Witkowski, M. Colgan, A. Brumm, T. Cruanes, and H. Baer. Performant
and Scalable Data Loading with Oracle Database 11g, 2011.
[40] K. Wu, S. Ahern, E. W. Bethel, J. Chen, H. Childs, E. Cormier-Michel,
C. Geddes, J. Gu, H. Hagen, B. Hamann, W. Koegler, J. Lauret, J. Meredith,
P. Messmer, E. Otoo, V. Perevoztchikov, A. Poskanzer, Prabhat, O. Rubel,
A. Shoshani, A. Sim, K. Stockinger, G. Weber, and W.-M. Zhang. FastBit:
Interactively searching massive data. In SciDAC, 2009.
[41] H. Yu, C. Wang, R. W. Grout, J. H. Chen, and K.-L. Ma. In situ visualization
for large-scale combustion simulations. IEEE Comput. Graph. Appl., 30(3):45{
57, May 2010.
[42] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark:
Cluster computing with working sets. In Proceedings of the 2Nd USENIX Con-
ference on Hot Topics in Cloud Computing, HotCloud'10, pages 10{10, 2010.
[43] Y. Zhang, M. Kersten, and S. Manegold. Sciql: Array data processing inside
an rdbms. In Proceedings of the 2013 ACM SIGMOD International Conference
on Management of Data, SIGMOD '13, pages 1049{1052, 2013.
[44] F. Zheng, H. Abbasi, C. Docan, J. Lofstead, Q. Liu, S. Klasky, M. Parashar,
N. Podhorszki, K. Schwan, and M. Wolf. PreDatA: Preparatory Data Analytics
on Peta-scale Machines. In Parallel Distributed Processing (IPDPS), 2010 IEEE
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *