圖算法 · Spark 編程指南簡體中文版

# 圖算法 GraphX包括一組圖算法來簡化分析任務。這些算法包含在`org.apache.spark.graphx.lib`包中，可以被直接訪問。 ## PageRank算法 PageRank度量一個圖中每個頂點的重要程度，假定從u到v的一條邊代表v的重要性標簽。例如，一個Twitter用戶被許多其它人粉，該用戶排名很高。GraphX帶有靜態和動態PageRank的實現方法，這些方法在[PageRank object](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.lib.PageRank$)中。靜態的PageRank運行固定次數的迭代，而動態的PageRank一直運行，直到收斂。[GraphOps]()允許直接調用這些算法作為圖上的方法。 GraphX包含一個我們可以運行PageRank的社交網絡數據集的例子。用戶集在`graphx/data/users.txt`中，用戶之間的關系在`graphx/data/followers.txt`中。我們通過下面的方法計算每個用戶的PageRank。 ```scala // Load the edges as a graph val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt") // Run PageRank val ranks = graph.pageRank(0.0001).vertices // Join the ranks with the usernames val users = sc.textFile("graphx/data/users.txt").map { line => val fields = line.split(",") (fields(0).toLong, fields(1)) } val ranksByUsername = users.join(ranks).map { case (id, (username, rank)) => (username, rank) } // Print the result println(ranksByUsername.collect().mkString("\n")) ``` ## 連通體算法連通體算法用id標注圖中每個連通體，將連通體中序號最小的頂點的id作為連通體的id。例如，在社交網絡中，連通體可以近似為集群。GraphX在[ConnectedComponents object](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.lib.ConnectedComponents$) 中包含了一個算法的實現，我們通過下面的方法計算社交網絡數據集中的連通體。 ```scala / Load the graph as in the PageRank example val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt") // Find the connected components val cc = graph.connectedComponents().vertices // Join the connected components with the usernames val users = sc.textFile("graphx/data/users.txt").map { line => val fields = line.split(",") (fields(0).toLong, fields(1)) } val ccByUsername = users.join(cc).map { case (id, (username, cc)) => (username, cc) } // Print the result println(ccByUsername.collect().mkString("\n")) ``` ## 三角形計數算法一個頂點有兩個相鄰的頂點以及相鄰頂點之間的邊時，這個頂點是一個三角形的一部分。GraphX在[TriangleCount object](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.lib.TriangleCount$) 中實現了一個三角形計數算法，它計算通過每個頂點的三角形的數量。需要注意的是，在計算社交網絡數據集的三角形計數時，`TriangleCount`需要邊的方向是規范的方向(srcId < dstId), 并且圖通過`Graph.partitionBy`分片過。 ```scala // Load the edges in canonical order and partition the graph for triangle count val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt", true).partitionBy(PartitionStrategy.RandomVertexCut) // Find the triangle count for each vertex val triCounts = graph.triangleCount().vertices // Join the triangle counts with the usernames val users = sc.textFile("graphx/data/users.txt").map { line => val fields = line.split(",") (fields(0).toLong, fields(1)) } val triCountByUsername = users.join(triCounts).map { case (id, (username, tc)) => (username, tc) } // Print the result println(triCountByUsername.collect().mkString("\n")) ```