啟動 · Prometheus中文文檔

## 啟動 --- 這是個類似"hello,world"的試驗，教大家怎樣快速安裝、配置和簡單地搭建一個DEMO。你會下載和本地化運行Prometheus服務，并寫一個配置文件，監控Prometheus服務本身和一個簡單的應用，然后配合使用query、rules和圖表展示采樣點數據 ### 下載和運行Prometheus [最新下載頁](https://prometheus.io/download), 然后提取和運行它，so easy： ```shell tar zxvf prometheus-*.tar.gz cd prometheus-* ``` 在開始啟動Prometheus之前，我們要配置它 ### 配置Prometheus監控自身 Prometheus從目標機上通過http方式拉取采樣點數據, 它也可以拉取自身服務數據并監控自身的健康狀況當然Prometheus服務拉取自身服務采樣數據，并沒有多大的用處，但是它是一個好的DEMO。保存下面的Prometheus配置，并命名為：`prometheus.yml`: ```shell global: scrape_interval: 15s # 默認情況下，每15s拉取一次目標采樣點數據。 # 我們可以附加一些指定標簽到采樣點度量標簽列表中, 用于和第三方系統進行通信, 包括：federation, remote storage, Alertmanager external_labels: monitor: 'codelab-monitor' # 下面就是拉取自身服務采樣點數據配置 scrape_configs: # job名稱會增加到拉取到的所有采樣點上，同時還有一個instance目標服務的host：port標簽也會增加到采樣點上 - job_name: 'prometheus' # 覆蓋global的采樣點，拉取時間間隔5s scrape_interval: 5s static_configs: - targets: ['localhost:9090'] ``` 對于一個完整的配置選項，請見[配置文檔](https://prometheus.io/docs/prometheus/latest/configuration/configuration/) ### 啟動Prometheus 指定啟動Prometheus的配置文件，然后運行 ```shell ./prometheus --config.file=prometheus.yml ``` 這樣Prometheus服務應該起來了。你可以在瀏覽器上輸入：`http://localhost:9090`, 就可以看到Prometheus的監控界面你也可以通過輸入`http://localhost:9090/metrics`，直接拉取到所有最新的采樣點數據集 ### 使用expression browser(暫翻譯：瀏覽器上輸入表達式) 為了使用Prometheus內置瀏覽器表達式，導航到`http://localhost:9090/graph`，并選擇帶有"Graph"的"Console". 在拉取到的度量采樣點數據中，有一個metric叫`prometheus_target_interval_length_seconds`, 兩次拉取實際的時間間隔，在表達式的console中輸入: ```shell prometheus_target_interval_length_seconds ``` 這個應該會返回很多不同的倒排時間序列數據，這些度量名稱都是`prometheus_target_interval_length_seconds`, 但是帶有不同的標簽列表值，這些標簽列表值指定了不同的延遲百分比和目標組間隔如果我們僅僅對99%的延遲感興趣，則我們可以使用下面的查詢去清洗信息： ```shell prometheus_target_interval_length_seconds{quantile="0.99"} ``` 為了統計返回時間序列數據個數，你可以寫： ```shell count(prometheus_target_interval_length_seconds) ``` 有關更多的表達式語言，請見[表達式語言文檔](https://prometheus.io/docs/prometheus/latest/querying/basics/) ### 使用graph interface 見圖表表達式，導航到`http://localhost:9090/graph`，然后使用"Graph" tab 例如，進入下面表達式，繪圖最近1分鐘產生chunks的速率： ```shell rate(prometheus_tsdb_head_chunks_created_total[1m]) ``` ### 啟動其他一些采樣目標 Go客戶端包括了一個例子，三個服務只見的RPC調用延遲首先你必須有Go的開發環境，然后才能跑下面的DEMO, 下載Prometheus的Go客戶端，運行三個服務: ```shell git clone https://github.com/prometheus/client_golang.git cd client_golang/examples/random go get -d go build ## 啟動三個服務 ./random -listen-address=:8080 ./random -listen-address=:8081 ./random -listen-address=:8082 ``` 現在你在瀏覽器輸入:`http://localhost:8080/metrics`, `http://localhost:8081/metrics`, `http://localhost:8082/metrics`, 能看到所有采集到的采樣點數據 ### 配置Prometheus去監控這三個目標服務現在我們將會配置Prometheus，拉取三個目標服務的采樣點。我們把這三個目標服務組成一個job, 叫`example-radom`. 然而，想象成，前兩個服務是生產環境服務，后者是測試環境服務。我們可以通過group標簽分組，在這個例子中，我們通過`group="production"`標簽和`group="test"`來區分生產和測試 ```shell scrape_configs: - job_name: 'example-random' scrape_interval: 5s static_configs: - targets: ['localhost:8080', 'localhost:8081'] labels: group: 'production' - targets: ['localhost:8082'] labels: group: 'test' ``` 進入瀏覽器，輸入`rpc_duration_seconds`, 驗證Prometheus所拉取到的采樣點中每個點都有group標簽，且這個標簽只有兩個值`production`, `test` ### 聚集到的采樣點數據配置規則上面的例子沒有什么問題，但是當采樣點海量時，計算成了瓶頸。查詢、聚合成千上萬的采樣點變得越來越慢。為了提高性能，Prometheus允許你通過配置文件設置規則，對表達式預先記錄為全新的持續時間序列。讓我們繼續看RPCs的延遲速率(`rpc_durations_seconds_count`), 如果存在很多實例，我們只需要對特定的`job`和`service`進行時間窗口為5分鐘的速率計算，我們可以寫成這樣： ```shell avg(rate(rpc_durations_seconds_count[5m])) by (job, service) ``` 為了記錄這個計算結果，我們命名一個新的度量：`job_service:rpc_durations_seconds_count:avg_rate5m`, 創建一個記錄規則文件，并保存為`prometheus.rules.yml`: ```shell groups: - name: example rules: - record: job_service:rpc_durations_seconds_count:avg_rate5m expr: avg(rate(rpc_durations_seconds_count[5m])) by (job, service) ``` 然后再在Prometheus配置文件中，添加`rule_files`語句到`global`配置區域，最后配置文件應該看起來是這樣的： ```shell global: scrape_interval: 15s # By default, scrape targets every 15 seconds. evaluation_interval: 15s # Evaluate rules every 15 seconds. # Attach these extra labels to all timeseries collected by this Prometheus instance. external_labels: monitor: 'codelab-monitor' rule_files: - 'prometheus.rules.yml' scrape_configs: - job_name: 'prometheus' # Override the global default and scrape targets from this job every 5 seconds. scrape_interval: 5s static_configs: - targets: ['localhost:9090'] - job_name: 'example-random' # Override the global default and scrape targets from this job every 5 seconds. scrape_interval: 5s static_configs: - targets: ['localhost:8080', 'localhost:8081'] labels: group: 'production' - targets: ['localhost:8082'] labels: group: 'test' ``` 然后重啟Prometheus服務，并指定最新的配置文件，查詢并驗證`job_service:rpc_durations_seconds_count:avg_rate5m`度量指標