帳號:guest(3.139.83.210)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):余修丞
作者(外文):Yu, Hsiu-Cheng
論文名稱(中文):以雲端平台實作瘦長QR分解及其應用
論文名稱(外文):Implementations of TSQR for Cloud Platforms and Its Applications of SSVD and Collaborative Filtering
指導教授(中文):李哲榮
指導教授(外文):Lee, Che-Rung
口試委員(中文):周志遠
劉炳傳
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:101062526
出版年(民國):103
畢業學年度:103
語文別:英文
論文頁數:70
中文關鍵詞:QR分解TSQR分解SVD分解Stochastic SVD分解協同式過濾分散式計算
外文關鍵詞:QRTSQRSVDStochastic SVDCollaborative filteringDistributed computationMapReduceApache HadoopApache Spark
相關次數:
  • 推薦推薦:0
  • 點閱點閱:280
  • 評分評分:*****
  • 下載下載:12
  • 收藏收藏:0
可擴展性(Scalability)演算法之實作確保了計算效能可以隨著機器的數量成長,這是影響巨量資料處理效能表現的關鍵要素之一。時至今日,機器的數量與儲存空間可以與被儲存的資料有相應的成長,然而沒有可擴展性的(scalable)演算法,增加更多的機器反而會減緩資料的處理速度。
在本篇論文中,我們探索並且改善QR分解演算法,並以高瘦(tall-and-skinny)矩陣來實作QR分解演算法運行於雲端平台。我們的演算法是基於 Demmel 等人所提出的 TSQR (Tall-and-Skinny QR)演算法進行實作,此演算法被指出可以對高瘦矩陣分解的溝通成本(communication cost)之降低有最佳的優化。然而我們的分析指出 MapReduce 之效能表現大大地受到disk IO的影響。因此我們使用 Apache Spark 來實作TSQR演算法,它是個可以在分散式計算環境使用記憶體做資料處理的編程模型。
在本篇論文中,我們所實作TSQR會應用於SSVD協同式濾過(SSVD-based Collaborative Filtering)。協同式濾過核心經常被使用於電子商務,像是Amazon購物推薦、Google廣告、Facebook朋友推薦等等。SSVD協同式濾過與其他同樣用途的演算法相比具備優異的效能與精確度,然而它在SSVD(Stochastic SVD)步驟的QR分解中存在一個bottleneck。文中的實驗將呈現出我們的 TSQR Spark 實作具有較佳的效能,於數個benchmarks下呈現其效能相較Hadoop MapReduce實作可提升至400%。
Scalability of algorithms and implementations that ensures the computational efficiency can sustain with more machines is one of the most crucial performance factor in big data processing. Nowadays, the scale of machines and storages can be extended to match the growth of data size easily. However, without scalable algorithms, more machines can even slow down the data processing.
In this thesis, we investigated and improved the scalability of the algorithms and implementations of the QR decomposition for tall-and-skinny matrices on cloud platforms. Our algorithm is based on the TSQR (Tall-and-Skinny QR) al-gorithm, proposed by Demmel et al., which has been shown optimal in communi-cation cost for QR decomposing tall-and-skinny matrices. However, our analysis shows that the disk IO dominates the entire performance of MapReduce implemen-tation. Therefore, we implemented it using Apache Spark, an in-memory pro-cessing programming model for distributed computing environment.
We applied our TSQR implementation to the SSVD-based Collaborative Fil-tering (CF). CF is a computational kernel commonly used in e-commerce, such as Amazon recommendation, Goggle Ads, Facebook friend suggestion, etc. The SSVD-based CF has superior performance and accuracy comparing to existing methods. However, it has a performance bottleneck of QR decomposition step in the SSVD (Stochastic SVD) step. Experiments show that our implementation of TSQR in Spark is more efficient than that of in Hadoop MapReduce, and the over-all performance of TSQR can be improved by upto 400% for several benchmarks.
摘要 i
Abstract ii
Contents iii
List of Figures vi
List of Tables ix
List of Algorithms x

Chapter 1 - Introduction 1

Chapter 2 - Background 3
2.1. Apache Hadoop 3
2.1.1. Hadoop MapReduce 4
2.1.2. Hadoop MapReduce I/O Mechanism 5
2.2. Apache Spark 5
2.2.1. Spark Resistant Distributed Dataset (RDD) 6
2.2.2. Spark I/O Mechanism 6
2.3. Recommendation System 7
2.3.1. Content-based 7
2.3.2. Collaborative Filtering 7
2.3.3. Long Tail 8
2.4. Mahout 9
2.4.1. Mahout Item-Based Distributed Recommendation 9

Chapter 3 - Previous Work 11
3.1. Stochastic SVD Item-Based Recommendation System 11
3.1.1. SVD Approximation 11
3.1.2. Stochastic SVD Decomposition 12
3.1.3. Derivation of Stochastic Item-Based Recommendation 13
3.1.4. Implementation in MapReduce 14
3.1.4.1. Preparation-Job 14
3.1.4.2. StochasticSVD-Job 16
3.1.4.3. Recommendation Job 17
3.2. Tall-and-Skinny QR (TSQR) Factorization 17
3.2.1. Communication-Avoiding 17
3.2.2. TSQR 18
3.2.3. TSQR in MapReduce 19
3.2.3.1. Indirect TSQR 19
3.2.3.2. Direct TSQR 21

Chapter 4 - Algorithms and Implementations 23
4.1. Iterative TSQR Algorithm 23
4.1.1. FirstQ-Job 23
4.1.2. BuildQ-Job 23
4.2. Iterative TSQR Stochastic SVD Item-Based Collaborative Filtering 26
4.2.1. Preparation-Job 26
4.2.2. SSVD-Job 27
4.2.2.1. Q-Job 27
4.2.2.2. Bt-Job 27
4.2.2.3. U-Job 27
4.2.2.4. V-Job 28
4.2.3. Recommendation-Job 29
4.3. Iterative TSQR Stochastic SVD Item-Based Collaborative Filtering in Spark Implementation 30
4.3.1. ITSQR Stochastic SVD Collaborative Filtering in Spark implementation 32
4.3.1.1. Preparation-Job 32
4.3.1.2. SSVD-Job 33
4.3.1.3. Recommendation-Job 36
4.4. Optimization of Iterative Stochastic SSVD Item-Based Collaborative Filtering in MapReduce Implementation 37
4.4.1. BLAS3 37
4.4.2. TSQR 38

Chapter 5 - Experiments and Results 39
5.1. Experimental Settings 39
5.1.1. Hardware and Software Specifications 39
5.1.2. Library of Hadoop Implementation 39
5.1.3. Hadoop Configuration 39
5.1.4. Library of Spark Implementation 40
5.1.5. Spark Configuration 40
5.2. Evaluation 40
5.2.1. Datasets 40
5.2.2. Quality Measurement 41
5.3. Performance Tuning of Iterative TSQR Stochastic SVD Item-Based Collaborative Filtering 43
5.3.1. Hadoop Implementation 43
5.3.2. Varying Number of Map Tasks 44
5.3.3. Varying Number of Reduce Tasks 46
5.3.4. Varying Rows of Block 47
5.3.5. Suggestion of Setting of Argument 48
5.4. Comparison of Direct TSQR and Iterative TSQR 49
5.4.1. Performance Comparison of Direct TSQR and Iterative TSQR 49
5.4.2. Comparison of Iterative TSQR with Iterative QR Decomposition and Single QR Decomposition 50
5.5. Comparison of SSVD-CF and ITSSVD-CF 51
5.5.1. Real Dataset 52
5.5.2. Synthesized Dataset - Varying Number of Users 54
5.5.3. Synthesized Dataset – Varying Oversampling 56
5.6. Comparison Performance and Accuracy of Various CF algorithms 58
5.6.1. Comparison of performance 58
5.6.2. Comparison of Accuracy 60
5.7. Comparison of Iterative TSQR in Hadoop and Spark Implementation 63
5.7.1. Performance Comparison of Iterative TSQR in Hadoop MapReduce Implementation 63
5.7.2. Comparison of Iterative TSQR with Iterative QR Decomposition and Single QR Decomposition in Spark Implementation 65

Chapter 6 - Conclusion and Future Works 67

Chapter 7 - Reference 68
[1] M. Anderson. Communication-Avoiding QR Decomposition for GPUs
[2] J. Demmel, L. Grigori, M. Hoemmen, and J. Langou. Communica-tion-avoiding parallel and sequential QR factorizations. CoRR,arixv.org/abs/0806.2159, 2008
[3] Mahout Stochastic SVD Working Note, Mahout-376
[4] P. G. Constantine. Tall and Skinny QR factorizations in MapReduce archi-tectures.
[5] A. R. Benson. Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures.
[6] K. Bosteels. Dumbo.
http://projects.dumbotics.com/dumbo/, 2012.
[7] MapReduce: Simplied Data Processing on Large Clusters
[8] M. Snir and S. Graham, editors. Getting up to speed:The Future of Super-computing. National Research Council, 2004. 227 pages.
[9] J. Demmel, slide of Communication-Avoiding Algorithms course. UC Berkley.
[10] Apache Mahout Official Website.
https://mahout.apache.org/
[11] Wikipedia - Recommender System.
http://en.wikipedia.org/wiki/Recommender_system
[12] Implement Map-reduce version of stochastic SVD - SSVD working notes: https://issues.apache.org/jira/browse/MAHOUT-376
[13] Ya-Fang Chang, Che-Rung Lee. MapReduce Implementations of Distributed Collaborative Based Recommendation System.
[14] S. Owen, R. Anil, T. Dunning, e. friedman, Mahout in action. Manning Pub-lication Co., 2012.
[15] N. Halko, Per-G. Martinsson, J. A. Tropp, Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.
[16] A. Rajaraman, J. Leskovec, J. D. Ullman. Mining of Massive Dataset.
[17] Apache Hadoop Official Website.
http://hadoop.apache.org/
[18] Apache Spark Official Website.
https://spark.incubator.apache.org/
[19] M. Zaharia, M.f Chowdhury, M. J. Franklin, S. Shenker, I. Stoica, Spark: Cluster Computing with Working Sets.
[20] Matrix-toolkits-java.
https://github.com/fommil/matrix-toolkits-java.
[21] JLAPACK package.
http://www.netlib.org/java/f2j/.
[22] MovieLens Web Site. http://grouplens.org/datasets/movielens/
[23] Netflix prize competition.
[24] The Echo Nest Taste Profile Subset Web Site.
http://labrosa.ee.columbia.edu/millionsong/tasteprofile
[25] A. Gunawardana, G. Shani, L. Ungar. A Survey of Accuracy Evaluation Metrics of Recommendation Tasks. Journal of Machine Learning Research 10th (2009). pp. 2935~2962.
[26] J. L. Herlocker et al. Evaluating Collaborative Filtering Recommender sys-tem. ACM Transactions on Information Systems, 2004.
[27] B. Sarwar, G. Karypis, J. Konstan and J. Riedl. Analysis of Recommenda-tion Algorithms for E-Commerce. In Proceedings of the 2nd ACM Confer-ence on Electronic Commerce (EC’00). ACM. New York. pp. 258-295.
[28] Introduction of Mahout Implementation of ALS Recommendation.
https://mahout.apache.org/users/recommender/intro-als-hadoop.html
[29] Y. Zhou, D. Wilkinson, R. Schreiber, R. Pan. Large-scale Parallel Collabo-rative Filtering for the Netflix Prize. HP Labs, 1501 Page Mill Rd, Palo Alto, CA, 94304.
[30] J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff, “A set of level 3 basic linear algebra subprograms,” ACM Trans. Math. Softw., vol. 16, no. 1, pp. 1–17, Mar. 1990.
[31] J. Dean, S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Google, Inc., OSDI, 2004.
[32] Mahout Apache Official Website: Introduction to Item-Based Recommen-dations with Hadoop.
https://mahout.apache.org/users/recommender/intro-itembased-hadoop.html
[33] A Guide to Python Frameworks for Hadoop.
http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
[34] Introduction of Hadoop Streaming.
http://hadoop.apache.org/docs/r1.2.1/streaming.html
[35] Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale Parallel Col-laborative Filtering for the Netflix Prize. In AAIM ’08, pages 337–348, Ber-lin, Heidelberg, 2008. Springer-Verlag.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *