Spark 概述 · <center>Apache Spark 官方文檔中文版</center>

# Spark 概述 Apache Spark 是一個快速的，通用的集群計算系統。它對 Java，Scala，Python 和 R 提供了的高層 API，并有一個經優化的支持通用執行圖計算的引擎。它還支持一組豐富的高級工具，包括用于 SQL 和結構化數據處理的 [Spark SQL](sql-programming-guide.html)，用于機器學習的 [MLlib](ml-guide.html)，用于圖計算的 [GraphX](graphx-programming-guide.html) 和 [Spark Streaming](streaming-programming-guide.html)。 # 安全默認情況下，Spark中的安全性處于關閉狀態。這意味著您默認情況下容易受到攻擊。在下載和運行Spark之前，請參閱[Spark Security](https://spark.apache.org/docs/latest/security.html)。 # 下載從該項目官網的 [下載頁面](http://spark.apache.org/downloads.html) 獲取 Spark。該文檔用于 Spark 2.4.4 版本。Spark 可以通過 Hadoop client 庫使用 HDFS 和 YARN。下載一個預編譯主流 Hadoop 版本比較麻煩。用戶可以下載一個編譯好的 Hadoop 版本，并且可以通過[設置 Spark 的 classpath](hadoop-provided.html) 來與任何的 Hadoop 版本一起運行 Spark。Scala 和 Java 用戶可以在他們的工程中通過 Maven 的方式引入 Spark，并且在將來 Python 用戶也可以從 PyPI 中安裝 Spark。如果您希望從源碼中編譯一個Spark，請訪問 [編譯 Spark](building-spark.html)。 Spark 可以在 Windows 和類 UNIX 系統（例如，Linux，Mac OS）上運行。它可以很容易的在一臺本地機器上運行 ——你只需要安裝一個 JAVA 環境并配置 PATH 環境變量，或者讓 JAVA_HOME 指向你的 JAVA 安裝路徑。 Spark 可運行在 Java 8，Python 2.7+/3.4+ 和 R 3.1+ 的環境上。針對 Scala API，Spark 2.4.4 使用了 Scala 2.12。您需要一個可兼容的 Scala 版本（2.12.x）。請注意，從 Spark 2.2.0 起，對 Java 7，Python 2.6 和舊的 Hadoop 2.6.5 之前版本的支持均已被刪除。請注意，Scala 2.10 的支持已經在 Spark 2.3.0 中刪除。Scala 2.11 的支持已經不再適用于 Spark 2.4.1，并將會在 Spark 3.0 中刪除。 # 運行示例和 Shell Spark 自帶了幾個示例程序。Scala，Java，Python 和 R 示例在 `examples/src/main` 目錄中。要運行 Java 或 Scala 中的某個示例程序，在最頂層的 Spark 目錄中使用 `bin/run-example <class> [params]` 命令即可。（這個命令底層調用了 [`spark-submit` 腳本](submitting-applications.html)去加載應用程序）。例如， ``` ./bin/run-example SparkPi 10 ``` 您也可以通過一個改進版的 Scala shell 來運行交互式的 Spark。這是一個來學習該框架比較好的方式。 ``` ./bin/spark-shell --master local[2] ``` 該 `--master` 選項指定了 [分布式集群的 master URL](submitting-applications.html#master-urls)，或者指定以 `local` 模式使用 1 個線程在本地運行，`local[N]` 會使用 N 個線程在本地運行。你應該先使用 `local` 模式進行測試。可以通過 `--help` 選項來獲取 ` spark-shell` 的所有配置項。 Spark 同樣支持 Python API。在 Python interpreter（解釋器）中運行交互式的 Spark，請使用 `bin/pyspark`: ``` ./bin/pyspark --master local[2] ``` Spark 中也提供了 Python 應用示例。例如， ``` ./bin/spark-submit examples/src/main/python/pi.py 10 ``` 從 1.4 開始（僅包含了 DataFrames APIs）Spark 也提供了一個實驗性的 [R API](sparkr.html)。為了在 R interpreter（解釋器）中運行交互式的 Spark，請執行 `bin/sparkR`: ``` ./bin/sparkR --master local[2] ``` R 中也提供了應用示例。例如， ``` ./bin/spark-submit examples/src/main/r/dataframe.R ``` # 在集群上運行該 Spark [集群模式概述](cluster-overview.html) 說明了在集群上運行的主要的概念。Spark 既可以獨立運行，也可以在一些現有的 Cluster Manager（集群管理器）上運行。它當前提供了幾種用于部署的選項: * [Standalone Deploy Mode](spark-standalone.html)：在私有集群上部署 Spark 最簡單的方式 * [Apache Mesos](running-on-mesos.html) * [Hadoop YARN](running-on-yarn.html) * [Kubernetes](https://spark.apache.org/docs/latest/running-on-kubernetes.html) # 進一步學習鏈接 **編程指南:** - [快速入門](https://spark.apache.org/docs/latest/quick-start.html): 對Spark API的快速介紹；從這里開始！ - [RDD 編程指南](https://spark.apache.org/docs/latest/rdd-programming-guide.html): Spark基礎知識概述——RDD（核心但舊的API），累加器和廣播變量 - [Spark SQL, Datasets, 和 DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html): 使用關系查詢（比RDD更新的API）處理結構化數據 - [Structured Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html): 使用關系查詢處理結構化的數據流（使用數據集和數據幀，比DStreams更新的API） - [Spark Streaming](https://spark.apache.org/docs/latest/streaming-programming-guide.html): 使用DStreams處理數據流（舊API） - [MLlib](https://spark.apache.org/docs/latest/ml-guide.html): 運用機器學習算法 - [GraphX](https://spark.apache.org/docs/latest/graphx-programming-guide.html): 處理圖 **API 文檔:** - [Spark Scala API (Scaladoc)](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package) - [Spark Java API (Javadoc)](https://spark.apache.org/docs/latest/api/java/index.html) - [Spark Python API (Sphinx)](https://spark.apache.org/docs/latest/api/python/index.html) - [Spark R API (Roxygen2)](https://spark.apache.org/docs/latest/api/R/index.html) - [Spark SQL, Built-in Functions (MkDocs)](https://spark.apache.org/docs/latest/api/sql/index.html) **部署指南：** - [集群概述](https://spark.apache.org/docs/latest/cluster-overview.html)：在集群上運行時的概念和組件概述 - [Submitting Applications](https://spark.apache.org/docs/latest/submitting-applications.html): packaging and deploying applications - [提交應用](https://spark.apache.org/docs/latest/submitting-applications.html)：打包和部署應用程序 - 部署模式： - [Amazon EC2](https://github.com/amplab/spark-ec2): 可讓您在5分鐘左右的時間內在EC2上啟動集群的腳本 - [Standalone 部署模式](https://spark.apache.org/docs/latest/spark-standalone.html): 此模式下無需第三方集群管理器即可快速啟動獨立集群 - [Mesos](https://spark.apache.org/docs/latest/running-on-mesos.html): 使用 [Apache Mesos](https://mesos.apache.org/) 部署私有集群 - [YARN](https://spark.apache.org/docs/latest/running-on-yarn.html): 在Hadoop NextGen（YARN）之上部署Spark - [Kubernetes](https://spark.apache.org/docs/latest/running-on-kubernetes.html): 在Kubernetes上部署Spark