Getting started · Prometheus 官方文檔中文翻譯

# **開始** 本指南是一種 “ Hello World” 風格的教程，它通過簡單的示例設置顯示了如何安裝，配置和使用 Prometheus。您將在本地下載并運行 Prometheus，對其進行配置以采樣 Prometheus 自身和示例應用程序，然后使用 query、rule 和 graph 來利用收集的時間序列數據。 ## **下載、運行 Prometheus** 根據您的平臺下載[最新版本的 Prometheus](https://prometheus.io/download)，然后解壓縮并運行它： ~~~ tar xvfz prometheus-*.tar.gz cd prometheus-* ~~~ 在啟動 Prometheus 之前，讓我們對其進行配置。 ## **配置 Prometheus 來監控其自身** Prometheus 在這些監控 targets 上通過采樣 HTTP endpoint 來獲取指標。由于Prometheus 還以相同的方式暴露其自身的數據，因此它也可以采樣并監視其自身的健康狀況。雖然僅收集 Prometheus 服務器自身指標在實踐中不是很有用，但它是一個很好的入門示例。將如下基本的 Prometheus 配置保存為名為 prometheus.yml 的文件： ~~~ global: scrape_interval: 15s # 默認每 15 秒采樣一次目標 # 與其他外部系統（e.g. federation，remote storage，Alertmanager）通信時，將會附加這些標簽到時序數據或警報 external_labels: monitor: 'codelab-monitor' # 一份采樣配置僅包含一個 endpoint 來做采樣，下面是 Prometheus 本身的endpoint。 scrape_configs: # 被采樣的任意時序都會將這個 job 名稱會被添加作為一個標簽 `job=<job_name>` - job_name: 'prometheus' # 覆蓋全局默認值，每 5s 從該 job 進行采樣 scrape_interval: 5s static_configs: - targets: ['localhost:9090'] ~~~ 有關配置選項的完整說明，請參閱[配置文檔](https://prometheus.io/docs/prometheus/latest/configuration/configuration/)。 ## **啟動 Prometheus** 使用新創建的配置文件來啟動 Prometheus，切換到包含 Prometheus 二進制文件的目錄并運行： ~~~ # 啟動 Prometheus. # 默認地, Prometheus 在 ./data 路徑下存儲其數據庫 (flag --storage.tsdb.path). ./prometheus --config.file=prometheus.yml ~~~ 此時 Prometheus 應該啟動起來了，您可以通過訪問 `localhost:9000` 來瀏覽狀態頁。等待幾秒讓他從自己的 HTTP metric endpoint 來收集數據。您還可以通過訪問到其 metric endpoint 來驗證 Prometheus 是否正在提供有關其自身的 metrics：`localhost:9090/metrics` ## 使用 expressin browser 讓我們嘗試看一看 Prometheus 收集的其自身的數據。使用 Prometheus 內置的`expression browser`，訪問 `localhost:9000/graph`，選擇 Graph tab 下的 Console。正如您可以從 `localhost:9090/metrics` 查看的那樣，Prometheus 導出的其自身的一個指標稱為 `prometheus_target_interval_length_seconds`（目標采樣之間的實際時間）。繼續并將其輸入到表達式控制臺中： ~~~ prometheus_target_interval_length_seconds ~~~ 這將返回多個不同的時間序列（以及每個時間序列的最新值），所有時間序列的 metric 名稱均為 prometheus_target_interval_length_seconds，但具有不同的標簽。這些標簽具有不同的`延遲百分比`和`目標組間隔（target group intervals）`。如果我們只對第 99 個百分位延遲感興趣，則可以使用以下查詢來檢索該信息： ~~~ prometheus_target_interval_length_seconds{quantile="0.99"} ~~~ 要計算返回的時間序列數，您可以編寫： ~~~ count(prometheus_target_interval_length_seconds) ~~~ 有關 expression language 的更多信息，請查看 [expression language 文檔](https://prometheus.io/docs/prometheus/latest/querying/basics/)。 ## **使用繪圖界面** 要繪制圖形表達式，請使用 “Graph” 選項卡。例如，輸入以下表達式以繪制在自采樣的 Prometheus 中每秒創建 chunk 的速率： ~~~ rate(prometheus_tsdb_head_chunks_created_total[1m]) ~~~ 可以嘗試 Graph 范圍參數和其他設置。 ## **啟動樣本 targets** 讓我們做點更有意思的，啟動一些樣本目標，讓 Prometheus 進行采樣。 Go 客戶端庫包含一個示例，該示例可以導出具有不同延遲分布的三個服務的虛構 RPC 延遲。確保您已安裝 Go 編譯器，并設置了可正常運行的 Go 構建環境（具有正確的 GOPATH）。下載 Prometheus 的 Go 客戶端庫，并運行以下三個示例過程： ~~~ # 下載及編譯. git clone https://github.com/prometheus/client_golang.git cd client_golang/examples/random go get -d go build # 在不同的終端啟動下面3個示例目標 ./random -listen-address=:8080 ./random -listen-address=:8081 ./random -listen-address=:8082 ~~~ 現在，您的示例目標可以監聽 `http://localhost:8080/metrics, http://localhost:8081/metrics, and http://localhost:8082/metrics`。 ## **配置 Prometheus 來監控示例目標** 現在，我們將配置 Prometheus 來采樣這些新目標。讓我們將所有三個 endpoint 分組為一個稱為 example-random 的 job。但是，假設前兩個 endpoint 是生產目標，而第三個 endpoint 代表金絲雀實例。為了在 Prometheus 中對此建模，我們可以將多個端組添加到單個 job 中，并為每個目標組添加額外的標簽。在此示例中，我們將 group=“ production” 標簽添加到第一個目標組，同時將 group=“ canary” 添加到第二個目標。為此，請將以下作業定義添加到 prometheus.yml 中的 scrape_configs 部分，然后重新啟動 Prometheus 實例： ~~~ scrape_configs: - job_name: 'example-random' # Override the global default and scrape targets from this job every 5 seconds. scrape_interval: 5s static_configs: - targets: ['localhost:8080', 'localhost:8081'] labels: group: 'production' - targets: ['localhost:8082'] labels: group: 'canary' ~~~ 現在前往 expression browser 來進行驗證，比如 `rpc_durations_seconds`。 ## **配置規則以將采樣的數據匯總到新的時間序列中** 盡管在我們的示例中并不會有問題，但是在臨時計算時，聚集了數千個時間序列的查詢可能會變慢。為了提高效率，Prometheus 允許您通過配置的規則將表達式預記錄到全新的持久化的時間序列中。假設我們感興趣的是在 5 分鐘的窗口中測得的所有實例（但保留 Job 和服務（service）維度）平均的示例 RPC 每秒速率（rpc_durations_seconds_count）。我們可以這樣寫： ~~~ avg(rate(rpc_durations_seconds_count[5m])) by (job, service) ~~~ 嘗試繪制此表達式的圖形。要將由該表達式產生的時間序列記錄到名為 job_service：`rpc_durations_seconds_count：avg_rate5m` 的新指標中，請使用以下記錄規則創建文件并將其另存為 `prometheus.rules.yml`： ~~~ groups: - name: example rules: - record: job_service:rpc_durations_seconds_count:avg_rate5m expr: avg(rate(rpc_durations_seconds_count[5m])) by (job, service) ~~~ 要使 Prometheus 選擇此新規則，請在 prometheus.yml 中添加 rule_files 語句。現在，配置應如下所示： ~~~ global: scrape_interval: 15s # By default, scrape targets every 15 seconds. evaluation_interval: 15s # Evaluate rules every 15 seconds. # Attach these extra labels to all timeseries collected by this Prometheus instance. external_labels: monitor: 'codelab-monitor' rule_files: - 'prometheus.rules.yml' scrape_configs: - job_name: 'prometheus' # Override the global default and scrape targets from this job every 5 seconds. scrape_interval: 5s static_configs: - targets: ['localhost:9090'] - job_name: 'example-random' # Override the global default and scrape targets from this job every 5 seconds. scrape_interval: 5s static_configs: - targets: ['localhost:8080', 'localhost:8081'] labels: group: 'production' - targets: ['localhost:8082'] labels: group: 'canary' ~~~ 通過新的配置重新啟動 Prometheus，并通過表達式瀏覽器對其進行查詢或對其進行制圖，以驗證 metric 名稱為 `job_service：rpc_durations_seconds_count：avg_rate5m` 的新時間序列是否可用。