屬性圖 · Spark 編程指南簡體中文版

# 屬性圖 [屬性圖](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.Graph)是一個有向多重圖，它帶有連接到每個頂點和邊的用戶定義的對象。有向多重圖中多個并行(parallel)的邊共享相同的源和目的地頂點。支持并行邊的能力簡化了建模場景，這個場景中，相同的頂點存在多種關系(例如co-worker和friend)。每個頂點由一個唯一的64位長的標識符（VertexID）作為key。GraphX并沒有對頂點標識強加任何排序。同樣，頂點擁有相應的源和目的頂點標識符。屬性圖通過vertex(VD)和edge(ED)類型參數化，這些類型是分別與每個頂點和邊相關聯的對象的類型。在某些情況下，在相同的圖形中，可能希望頂點擁有不同的屬性類型。這可以通過繼承完成。例如，將用戶和產品建模成一個二分圖，我們可以用如下方式 ~~~ class VertexProperty() case class UserProperty(val name: String) extends VertexProperty case class ProductProperty(val name: String, val price: Double) extends VertexProperty // The graph might then have the type: var graph: Graph[VertexProperty, String] = null ~~~ 和RDD一樣，屬性圖是不可變的、分布式的、容錯的。圖的值或者結構的改變需要按期望的生成一個新的圖來實現。注意，原始圖的大部分都可以在新圖中重用，用來減少這種固有的功能數據結構的成本。執行者使用一系列頂點分區試探法來對圖進行分區。如RDD一樣，圖中的每個分區可以在發生故障的情況下被重新創建在不同的機器上。邏輯上的屬性圖對應于一對類型化的集合(RDD),這個集合編碼了每一個頂點和邊的屬性。因此，圖類包含訪問圖中頂點和邊的成員。 ~~~ class Graph[VD, ED] { val vertices: VertexRDD[VD] val edges: EdgeRDD[ED] } ~~~ `VertexRDD[VD]`和`EdgeRDD[ED]`類分別繼承和優化自`RDD[(VertexID, VD)]`和`RDD[Edge[ED]]`。`VertexRDD[VD]`和`EdgeRDD[ED]`都支持額外的功能來建立在圖計算和利用內部優化。 ### 屬性圖的例子在GraphX項目中，假設我們想構造一個包括不同合作者的屬性圖。頂點屬性可能包含用戶名和職業。我們可以用描述合作者之間關系的字符串標注邊緣。 ![屬性圖](https://box.kancloud.cn/2015-08-16_55d04e997662d.png) 所得的圖形將具有類型簽名 ~~~ val userGraph: Graph[(String, String), String] ~~~ 有很多方式從一個原始文件、RDD構造一個屬性圖。最一般的方法是利用[Graph object](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.Graph$)。下面的代碼從RDD集合生成屬性圖。 ~~~ // Assume the SparkContext has already been constructed val sc: SparkContext // Create an RDD for the vertices val users: RDD[(VertexId, (String, String))] = sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")), (5L, ("franklin", "prof")), (2L, ("istoica", "prof")))) // Create an RDD for edges val relationships: RDD[Edge[String]] = sc.parallelize(Array(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) // Define a default user in case there are relationship with missing user val defaultUser = ("John Doe", "Missing") // Build the initial Graph val graph = Graph(users, relationships, defaultUser) ~~~ 在上面的例子中，我們用到了[Edge](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.Edge)樣本類。邊有一個`srcId`和`dstId`分別對應于源和目標頂點的標示符。另外，`Edge`類有一個`attr`成員用來存儲邊屬性。我們可以分別用`graph.vertices`和`graph.edges`成員將一個圖解構為相應的頂點和邊。 ~~~ val graph: Graph[(String, String), String] // Constructed from above // Count all users which are postdocs graph.vertices.filter { case (id, (name, pos)) => pos == "postdoc" }.count // Count all the edges where src > dst graph.edges.filter(e => e.srcId > e.dstId).count ~~~ ~~~ 注意，graph.vertices返回一個VertexRDD[(String, String)]，它繼承于 RDD[(VertexID, (String, String))]。所以我們可以用scala的case表達式解構這個元組。另一方面， graph.edges返回一個包含Edge[String]對象的EdgeRDD。我們也可以用到case類的類型構造器，如下例所示。 graph.edges.filter { case Edge(src, dst, prop) => src > dst }.count ~~~ 除了屬性圖的頂點和邊視圖，GraphX也包含了一個三元組視圖，三元視圖邏輯上將頂點和邊的屬性保存為一個`RDD[EdgeTriplet[VD, ED]]`，它包含[EdgeTriplet](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.EdgeTriplet)類的實例。可以通過下面的Sql表達式表示這個連接。 ~~~ SELECT src.id, dst.id, src.attr, e.attr, dst.attr FROM edges AS e LEFT JOIN vertices AS src, vertices AS dst ON e.srcId = src.Id AND e.dstId = dst.Id ~~~ 或者通過下面的圖來表示。 ![triplet](https://box.kancloud.cn/2015-08-16_55d04e99926f1.png) `EdgeTriplet`類繼承于`Edge`類，并且加入了`srcAttr`和`dstAttr`成員，這兩個成員分別包含源和目的的屬性。我們可以用一個三元組視圖渲染字符串集合用來描述用戶之間的關系。 ~~~ val graph: Graph[(String, String), String] // Constructed from above // Use the triplets view to create an RDD of facts. val facts: RDD[String] = graph.triplets.map(triplet => triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1) facts.collect.foreach(println(_)) ~~~