於半度量空間中對資料點進行取樣、分群、及嵌入的統一架構_

帳號：guest(216.73.216.49) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	張佳泰
作者(外文):	Chang, Chia-Tai.
論文名稱(中文):	於半度量空間中對資料點進行取樣、分群、及嵌入的統一架構
論文名稱(外文):	A Unified Framework for Sampling, Clustering and Embedding Data Points in Semi-Metric Spaces
指導教授(中文):	張正尚
指導教授(外文):	Chang, Cheng-Shang
口試委員(中文):	李端興林華君黃之浩
口試委員(外文):	Lee, Duan-Shin Lin, Hwa-Chun Huang, Chih-Hao
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	通訊工程研究所
學號:	104064540
出版年(民國):	106
畢業學年度:	105
語文別:	英文
論文頁數:	51
中文關鍵詞:	分群、取樣、嵌入、半度量空間
外文關鍵詞:	Clustering、Sampling、Embedding、Semi-metric spaces、Softmax function
相關次數:	推薦:0 點閱:794 評分: 下載:8 收藏:0

在本論文中，我們提出了在半度量空間中對資料點進行取樣、分群及嵌入的統一架構。首先，一位於半度量空間中的一組資料點Omega，使用半度量來衡量兩點間的距離。我們在半度量空間中對資料點進行取樣的想法是考慮其為有n節點和n個邊的完整圖(complete graph)，然後將Omega中的每個資料點映射到圖中的節點，而兩節點之間的邊上權重對應為兩節點間的距離。透過這樣做，為社群偵測而發展的取樣技術即可被應用於半度量空間中的資料點分群。這之中有一個特別有趣的取樣技術是指數隨機圖模型取樣，其可以透過取樣分布去指定期望的平均距離來偵測不同解析度下的群，也就是本論文所用之方法。
每一個取樣分布都會產生一個共變異數矩陣用來測量兩點有多麼相關。透過使用共變異數矩陣做為輸入，我們還提出了一種softmax分群演算法，不僅可用於分群，還可以將資料點從半度量空間中嵌入至低維度的歐式空間。我們的實驗結果表示，儘管只使用成對的距離作為資訊，經過一定數量的``訓練(training)''的迭代，我們的softmax演算法能夠從高維度的歐式空間中揭示數據的``拓樸(topology)''。為了對我們的研究結果提供進一步的理論支持，我們證明當使用平方歐式距離作為高維數據的半度量時，其共變異數矩陣的特徵分解相當於主成分分析(PCA)。
為了處理群的階層結構，我們的softmax分群演算法也能和階層聚合式分群演算法一起使用。為此，在本論文我們提出了一種迭代分區式階層演算法(iPHD)。Softmax分群演算法及iPHD演算法都是模塊最大化演算法；另一方面，文獻中的K-means及K-sets演算法則是基於標準化的模塊最大化。我們將我們的演算法與這些現有的演算法進行比較以顯示目標函數和距離度量的選擇將會如何影響分群結果的性能。我們的實驗顯示，基於標準化模塊最大化的演算法往往會平衡偵測到的群的大小，因此，當真實結果的社群大小不同時，該類型的演算法則會表現不佳。此外，使用度量比使用半度量更好，因為後者並不需滿足三角不等式，而這將會使分群更容易出現錯誤。

In this thesis, we propose a unified framework for sampling, clustering and embedding data points in semi-metric spaces. For a set of data points Omega={x_1, x_2, ..., x_n} in a semi-metric space, there is a semi-metric that measures the distance between two points. Our idea of sampling the data points in a semi-metric space is to consider a complete graph with n nodes and n self edges and then map each data point in Omega to a node in the graph with the edge weight between two nodes being the distance between the corresponding two points in Omega. By doing so, several well-known sampling techniques developed for community detections in graphs can be applied for clustering data points in a semi-metric space. One particularly interesting sampling technique is the exponentially twisted sampling in which one can specify the desired average distance from the sampling distribution to detect clusters with various resolutions.
Each sampling distribution leads to a covariance matrix that measures how two points are correlated. By using a covariance matrix as input, we also propose a softmax clustering algorithm that can be used for not only clustering but also embedding data points in a semi-metric space to a low dimensional Euclidean space. Our experimental results show that after a certain number of iterations of ``training,'' our softmax algorithm can reveal the ``topology'' of the data from a high dimensional Euclidean space by only using the pairwise distances. To provide further theoretical support for our findings, we show that the eigendecomposition of a covariance matrix is equivalent to the principal component analysis (PCA) when the squared Euclidean distance is used as the semi-metric for high dimensional data.
To deal with the hierarchical structure of clusters, our softmax clustering algorithm can also be used with a hierarchical agglomerative clustering algorithm. For this, we propose an iterative partitional-hierarchical algorithm, called iPHD, in this thesis. Both the softmax clustering algorithm and the iPHD algorithm are modularity maximization algorithms. On the other hand, the K-means algorithm and the K-sets algorithm in the literature are based on maximization of the normalized modularity. We compare our algorithms with these existing algorithms to show how the choice of the objective function and the choice of the distance measure affect the performance of the clustering results. Our experimental results show that those algorithms based on the maximization of normalized modularity tend to balance the sizes of detected clusters and thus do not perform well when the ground-truth clusters are different in sizes. Also, using a metric is better than using a semi-metric as the triangular inequality is not satisfied for a semi-metric and that is more prone to clustering errors.

Contents 1
List of Figures 4
1 Introduction 5
2 Clustering and embedding data points in semi-metric spaces 8
2.1 Semi-metrics and semi-cohesion measures . . . . . . . . . . . . . . . . . . 8
2.2 Exponentially twisted sampling . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Clusters in a sampled graph . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 The softmax clustering algorithm . . . . . . . . . . . . . . . . . . . . . . 15
2.5 An illustrating experiment with three rings . . . . . . . . . . . . . . . . . 20
2.6 Further supporting evidence by using the eigendecomposition of the semicohesion measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7 Connections to PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.8 Using the softmax clustering algorithm with a hierarchical agglomerative
clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3 Performance comparisons 36
3.1 Choice of the objective function . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Choice of the similarity/dissimilarity measure . . . . . . . . . . . . . . . 42
4 Conclusion 48

[1] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan, “Group formation in large social networks: membership, growth, and evolution,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006, pp. 44–54.

[2] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney, “Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters,”Internet Mathematics, vol. 6, no. 1, pp. 29–123, 2009.

[3] C.-S. Chang, W. Liao, Y.-S. Chen, and L.-H. Liou, “A mathematical theory for clustering in metric spaces,” IEEE Transactions on Network Science and Engineering, vol. 3, no. 1, pp. 2–16, 2016.

[4] C.-S. Chang, C.-T. Chang, D.-S. Lee, and L.-H. Liou, “K-sets+: a linear-time clustering algorithm for data points with a sparse similarity measure,” arXiv preprint arXiv:1705.04249, 2017.

[5] C.-S. Chang, C.-J. Chang, W.-T. Hsieh, D.-S. Lee, L.-H. Liou, and W. Liao, “Relative centrality and local community detection,” Network Science, vol. 3, no. 4, pp. 445–479, 2015.

[6] C.-S. Chang, D.-S. Lee, L.-H. Liou, S.-M. Lu, and M.-H. Wu, “A probabilistic framework for structural analysis in directed networks,” in Communications (ICC), 2016 IEEE International Conference on. IEEE, 2016, pp. 1–6.

[7] M. Newman, Networks: an introduction. OUP Oxford, 2009.

[8] C. M. Bishop, “Pattern recognition,” Machine Learning, vol. 128, pp. 1–58, 2006.

[9] C.-S. Chang, Performance guarantees in communication networks. Springer Science & Business Media, 2012.

[10] C.-S. Chang, D.-S. Lee, L.-H. Liou, and S.-M. Lu, “Community detection in signed networks: an error-correcting code approach,” arXiv preprint arXiv:1705.04254, 2017.

[11] E. W. Dijkstra, “A note on two problems in connexion with graphs,” Numerische mathematik, vol. 1, no. 1, pp. 269–271, 1959.

電子全文
中英文摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文