部署 Nagios 監控服務器 · Puppet 2.7 Cookbook 中文版

### 部署 Nagios 監控服務器 > My roommate lost his pet elephant. It’s in the apartment somewhere. > > — Steven Wright 我們無法持續關注所有的一切。問題是：你怎么知道一臺服務器何時會出現故障呢？錯誤的答案是，“我的客戶打電話給我，告訴我服務器宕機了。” 但是，你會驚奇地發現，許多組織的系統沒有任何的自動監控。自動監控的設置非常簡單。目前有許多優秀的免費開源的自動監測工具可用，包括 Nagios、Icinga、Zabbix 和 Zenoss。 Nagios 是其中歷史最久最復雜的一個，盡管它有一個難以配置的名聲（基本上是實至名歸的）。本處方將向你展示如何使用 Puppet 創建一個基于 Nagios 的監控服務器以及如何讓 Puppet 配置由 Nagios 監控的每臺機器。 #### 準備工作你將需要我們已經在第 7 章的 [管理 Apache 服務](#ch07sec01) 一節中創建的 Apache 模塊。 #### 操作步驟 1. 創建一個 nagios 模塊： ``` # mkdir /etc/puppet/modules/nagios # mkdir /etc/puppet/modules/nagios/files # mkdir /etc/puppet/modules/nagios/manifests ``` 2. 使用如下內容創建 /etc/puppet/modules/nagios/manifests/server.pp 文件： ``` class nagios::server { include apache package { [ "nagios3", "nagios-images", "nagios-nrpe-plugin" ]: ensure => installed, } service { "nagios3": ensure => running, enable => true, require => Package["nagios3"], } exec { "nagios-config-check": command => "/usr/sbin/nagios3 -v /etc/nagios3/nagios. cfg && /usr/sbin/service nagios3 restart", refreshonly => true, } file { "/etc/apache2/sites-available/nagios.conf": source => "puppet:///modules/nagios/nagios.conf", notify => Service["apache2"], require => Package["apache2-mpm-prefork"], } file { "/etc/apache2/sites-enabled/nagios.conf": ensure => symlink, target => "/etc/apache2/sites-available/nagios.conf", require => Package["apache2-mpm-prefork"], } file { [ "/etc/nagios3/generic-service_nagios2.cfg", "/etc/nagios3/services_nagios2.cfg", "/etc/nagios3/hostgroups_nagios2.cfg", "/etc/nagios3/extinfo_nagios2.cfg", "/etc/nagios3/localhost_nagios2.cfg", "/etc/nagios3/contacts_nagios2.cfg", "/etc/nagios3/conf.d" ]: ensure => absent, force => true, } define nagios-config() { file { "/etc/nagios3/${name}": source => "puppet:///modules/nagios/${name}", require => Package["nagios3"], notify => Exec["nagios-config-check"], } } nagios-config { [ "htpasswd.nagios", "nagios.cfg", "cgi.cfg", "hostgroups.cfg", "hosts.cfg", "host_templates.cfg", "service_templates.cfg", "services.cfg", "timeperiods.cfg", "contacts.cfg", "commands.cfg" ]: } file { "/var/lib/nagios3": # see http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=478889 mode => 751, require => Package["nagios3"], notify => Service["nagios3"], } file { "/var/lib/nagios3/rw": # see http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=478889 mode => 2710, require => Package["nagios3"], notify => Service["nagios3"], } } ``` 3. 使用如下內容創建 /etc/puppet/modules/nagios/files/nagios.cfg 文件： ``` # Config files to read cfg_file=/etc/nagios3/commands.cfg cfg_file=/etc/nagios3/service_templates.cfg cfg_file=/etc/nagios3/host_templates.cfg cfg_file=/etc/nagios3/timeperiods.cfg cfg_file=/etc/nagios3/contacts.cfg cfg_file=/etc/nagios3/hostgroups.cfg cfg_file=/etc/nagios3/hosts.cfg cfg_file=/etc/nagios3/services.cfg # Nagios settings log_file=/var/log/nagios3/nagios.log illegal_macro_output_chars=`~$&|'"<> check_result_path=/var/lib/nagios3/spool/checkresults nagios_user=nagios nagios_group=nagios command_file=/var/lib/nagios3/rw/nagios.cmd lock_file=/var/run/nagios3/nagios3.pid p1_file=/usr/lib/nagios3/p1.pl check_external_commands=1 resource_file=/etc/nagios3/resource.cfg ``` 4. 使用如下內容創建 /etc/puppet/modules/nagios/files/service_templates.cfg 文件： ``` define service{ name generic_service ; The 'name' of this service template active_checks_enabled 1 ; Active service checks are enabled passive_checks_enabled 1 ; Passive service checks are enabled/accepted parallelize_check 1 ; Active service checks should be parallelized ; (disabling this can lead to major performance problems) obsess_over_service 1 ; We should obsess over this service (if necessary) check_freshness 0 ; Default is to NOT check service 'freshness' notifications_enabled 1 ; Service notifications are enabled event_handler_enabled 1 ; Service event handler is enabled flap_detection_enabled 1 ; Flap detection is enabled failure_prediction_enabled 1 ; Failure prediction is enabled process_perf_data 1 ; Process performance data retain_status_information 1 ; Retain status information across program restarts retain_nonstatus_information 1 ; Retain non-status information across program restarts notification_interval 0 ; Only send notifications on status change by default. is_volatile 0 check_period 24x7 normal_check_interval 5 retry_check_interval 2 max_check_attempts 3 notification_period 24x7 notification_options c,r contact_groups sysadmin register 0 ; DONT REGISTER THIS DEFINITION ; - ITS NOT A REAL SERVICE, JUST A TEMPLATE! } # Defaults define service { name every_5_mins normal_check_interval 5 use generic_service register 0 } define service { name every_hour normal_check_interval 60 use generic_service register 0 } define service { name every_day normal_check_interval 1440 use generic_service register 0 } ``` 5. 使用如下內容創建 /etc/puppet/modules/nagios/files/services.cfg 文件： ``` define service { hostgroup_name all service_description Disk check_command check_nrpe!check_all_disks!20%!10% use every_day } define service { hostgroup_name all service_description Load check_command check_nrpe!check_load!10,10,10!15,15,15 use every_hour } ``` 6. 使用如下內容創建 /etc/puppet/modules/nagios/files/cgi.cfg 文件： ``` main_config_file=/etc/nagios3/nagios.cfg physical_html_path=/usr/share/nagios3/htdocs url_html_path=/nagios3 show_context_help=1 use_pending_states=1 nagios_check_command=/usr/lib/nagios/plugins/check_nagios /var/cache/nagios3/status.dat 5 '/usr/sbin/nagios3' use_authentication=1 use_ssl_authentication=0 authorized_for_system_information=nagios authorized_for_configuration_information=nagios authorized_for_system_commands=nagios authorized_for_all_services=nagios authorized_for_all_hosts=nagios authorized_for_all_service_commands=nagios authorized_for_all_host_commands=nagios default_statusmap_layout=5 default_statuswrl_layout=4 ping_syntax=/bin/ping -n -U -c 5 $HOSTADDRESS$ refresh_rate=90 escape_html_tags=1 action_url_target=_blank notes_url_target=_blank lock_author_names=1 ``` 7. 使用如下內容創建 /etc/puppet/modules/nagios/files/host_templates.cfg 文件： ``` define host{ name generic_host check_command check-host-alive max_check_attempts 3 checks_enabled 1 failure_prediction_enabled 1 retain_status_information 1 retain_nonstatus_information 1 notification_interval 0 notification_options d,u,r check_interval 300 contact_groups sysadmin register 0 } ``` 8. 使用如下內容創建 /etc/puppet/modules/nagios/files/contacts.cfg 文件（使用你自己的 e-mail 地址，或者至少是一個不介意從你的監控服務獲得大量郵件的人的 e-mail 地址）： ``` define contact { contact_name helen alias Helen Highwater service_notification_period 24x7 host_notification_period 24x7 service_notification_options w,u,c,r host_notification_options d,r service_notification_commands notify-service-by-email host_notification_commands notify-host-by-email email helen@example.com } define contactgroup { contactgroup_name sysadmin alias Sysadmins members helen } ``` 9. 使用如下內容創建 /etc/puppet/modules/nagios/files/hostgroups.cfg 文件： ``` define hostgroup { hostgroup_name all alias All Servers members * } ``` 10. 使用如下內容創建 /etc/puppet/modules/nagios/files/timeperiods.cfg 文件： ``` define timeperiod { timeperiod_name 24x7 alias 24 Hours A Day, 7 Days A Week sunday 00:00-24:00 monday 00:00-24:00 tuesday 00:00-24:00 wednesday 00:00-24:00 thursday 00:00-24:00 friday 00:00-24:00 saturday 00:00-24:00 } ``` 11. 使用如下內容創建 /etc/puppet/modules/nagios/files/hosts.cfg 文件（使用你自己的服務器信息替換相應的配置值）： ``` define host { host_name cookbook address cookbook.bitfieldconsulting.com use generic_host } ``` 12. 使用如下內容創建 /etc/puppet/modules/nagios/files/commands.cfg 文件： ``` define command { command_name check_nrpe command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -a $ARG2$ $ARG3$ $ARG4$ $ARG5$ } define command{ command_name check-host-alive command_line $USER1$/check_ping -H '$HOSTADDRESS$' -w 5000,100% -c 5000,100% -p 1 } define command{ command_name check_all_disks command_line /usr/lib/nagios/plugins/check_disk -w '$ARG1$' -c '$ARG2$' -e } define command{ command_name notify-host-by-email command_line /usr/bin/printf "%b" "***** Nagios *****\n\ nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\ nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /usr/bin/mail -s "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" $CONTACTEMAIL$ } define command{ command_name notify-service-by-email command_line /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\ nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\ nAdditional Info:\n\n$SERVICEOUTPUT$" | /usr/bin/mail -s "** $NOTIFICATIONTYPE$ Service Alert: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTACTEMAIL$ } ``` 13. 使用如下內容創建 /etc/puppet/modules/nagios/files/nagios.conf 文件（用你自己的服務器替換 ServerName 的配置值）： ``` ScriptAlias /cgi-bin/nagios3 /usr/lib/cgi-bin/nagios3 ScriptAlias /nagios3/cgi-bin /usr/lib/cgi-bin/nagios3 Alias /nagios3/stylesheets /etc/nagios3/stylesheets Alias /nagios3 /usr/share/nagios3/htdocs Alias / /usr/share/nagios3/htdocs/ <DirectoryMatch (/usr/share/nagios3/htdocs|/usr/lib/cgi-bin/ nagios3|/etc/nagios3/stylesheets)> Options FollowSymLinks DirectoryIndex index.html AllowOverride AuthConfig Order Allow,Deny Allow From All AuthName "Nagios Access" AuthType Basic AuthUserFile /etc/nagios3/htpasswd.nagios require valid-user </DirectoryMatch> <VirtualHost *:80> ServerName nagios.bitfieldconsulting.com ErrorLog /var/log/apache2/nagios-error_log CustomLog /var/log/apache2/nagios-access_log common DocumentRoot /usr/share/nagios3 </VirtualHost> ``` 14. 創建口令文件控制對 Nagios web 界面的訪問： ``` # htpasswd -c /etc/puppet/modules/nagios/files/htpasswd.nagios nagios Password: (type password) ``` 15. 如果你的系統中還沒有 htpasswd 程序，運行如下的命令： ``` # apt-get install apache2-utils ``` 16. 在前面的代碼中指定的 ServerName 為你創建一個 /etc/hosts 條目或 DNS 記錄，本例中的主機名是： ``` nagios.bitfieldconsulting.com ``` 17. 在你的 Nagios 服務器的節點中定義中包含如下代碼： ``` include nagios::server ``` 18. 使用如下內容創建 /etc/puppet/modules/nagios/files/nrpe.cfg 文件（使用你自己的監控服務器的主機名或 IP 地址替換 allowed_hosts 設置值）： ``` log_facility=daemon pid_file=/var/run/nagios/nrpe.pid server_port=5666 nrpe_user=nagios nrpe_group=nagios allowed_hosts=cookbook.bitfieldconsulting.com dont_blame_nrpe=1 debug=0 command_timeout=60 connection_timeout=300 command[check_load]=/usr/lib/nagios/plugins/check_load -w $ARG1$ -c $ARG2$ command[check_all_disks]=/usr/lib/nagios/plugins/check_disk -w $ARG1$ -c $ARG2$ -e -A -i '.gvfs' ``` 19. 使用如下內容創建 /etc/puppet/modules/nagios/manifests/target.pp 文件： ``` class nagios::target { package { [ "nagios-nrpe-server", "nagios-plugins", "nagios-plugins-basic", "nagios-plugins-standard", "nagios-plugins-extra" ]: ensure => installed, } service { "nagios-nrpe-server": enable => true, ensure => running, pattern => "/usr/sbin/nrpe", require => Package["nagios-nrpe-server"], } file { "/etc/nagios/nrpe.cfg": source => "puppet:///modules/nagios/nrpe.cfg", require => Package["nagios-nrpe-server"], notify => Service["nagios-nrpe-server"], } } ``` 20. 在你要監控的所有節點（也應該包括 Nagios 服務器自己）上包含這個類： ``` include nagios::target ``` 21. 在 Nagios 服務器上運行如下 Puppet 命令： ``` # puppet agent --test ``` 22. 使用瀏覽器打開 Nagios 的 web 界面（使用用戶名 nagios 和你在前面的代碼中設置的口令登錄）檢查你看到的 Nagios 歡迎界面，如圖所示： ![https://box.kancloud.cn/2016-05-12_5733eeea46a80.png](https://box.kancloud.cn/2016-05-12_5733eeea46a80.png) 23. 點擊 **Host detail** 菜單，在界面中你應該看到目標節點的列表。 24. 點擊節點的名字，之后從 **Host commands** 菜單中選擇 "Schedule a check of all services on this host"。 25. 在 "Force check" 復選框上打勾將其選中并單擊 "Commit"。這會花幾秒鐘運行 Nagios 的檢查。點擊 "Service detail" 菜單，在界面中你應該看到顯示為綠色的服務列表，如圖所示： ![https://box.kancloud.cn/2016-05-12_5733eeea7033f.png](https://box.kancloud.cn/2016-05-12_5733eeea7033f.png) #### 工作原理盡管這個處方相當的長，但是 Puppet 的配置清單本身卻很簡單；這里沒有我們在其他處方中還沒使用過的配置技巧。本質上，我們所做的就是安裝 Nagios 包，配置服務于它的 Apache 虛擬主機，為 Nagios 部署一系列的配置文件，這些配置文件用于告訴 Nagios 要檢查哪些主機、檢查哪些服務以及其他一些雜七雜八的配置。在客戶端，由 Nagios 監控的每個節點上都需要安裝 nagios-nrpe-server 包（**NRPE** 是使 Nagios 能在遠程服務器上安全地執行命令的協議）以及告訴 NRPE 服務允許執行哪些命令的配置文件。 > ![注記](https://box.kancloud.cn/2016-05-12_5733eec619643.png) > 你應該知道，nrpe.cfg 文件中的 dont_blame_nrpe 設置是一個潛在的安全漏洞，因為它允許在遠程主機上使用用戶提供的參數去執行命令。這是一個非常有用的功能，因為這意味著你可以在無需重新配置每一個監控機的情況下，就可以改變警報閾值或其他參數。然而，如果你不需要這個功能，禁用 dont_blame_nrpe 設置會更安全。 #### 更多用法使用 Nagios 最為棘手的事情，就是獲取、配置及首次執行。雖然這里介紹的是非常基本的監控配置（僅包括對磁盤占用和 CPU 負載的檢查），你可以使用這個配置作為 Nagios 設置工作的起點，添加更多要檢查的服務和主機。你可能想要添加如下的一些配置： * 主機組（例如，web 服務器組或數據庫服務器組）：你可以配置一個檢查自動應用到主機組的每個成員。 * Web 站點檢查：配置 Nagios 的 check_http plugin 插件是相當復雜的，它可以處理重定向、SSL、認證以及在一個 Web 頁面中匹配文本。 * 進程檢查：監視一個主機上指定的進程是一個常見的需求。使用 check_procs 插件可以實現。 * 不同的檢查頻率：我已經在 service_templates.cfg 模板中定義了 every_hour、 every_day 和 every_5_mins；你可能想要添加一些新的頻率設置。 * 新的時間周期；在 timeperiods.cfg 中，當前僅定義了一個時間周期 24x7，但你可能想要創建自己的時間周期。例如，如果要從午夜 0 點到凌晨 1 點執行數據庫的維護工作，你可以定義一個排除了這段時間的時間周期，而不會獲得來自這個數據庫服務器的誤報。要找到如何配置 Nagios 的更多內容, 請參考文檔： [http://nagios.sourceforge.net/docs/nagioscore/3/en/toc.html](http://nagios.sourceforge.net/docs/nagioscore/3/en/toc.html) 。在 Puppet 中也有一些對 Nagios 內置支持；你可以讓 Puppet 從配置清單生成主機和服務的定義，這是一個強大而有用的功能。盡管我在一些生產站點上會使用這個功能，但基于篇幅的原因，我不得不遺憾地排除了對它的介紹。若你希望找到與這部分的相關的內容，請參考 Puppet 的官方文檔以及 Mike Gurski 撰寫的有關這個主題的優秀文章： [http://blog.gurski.org/index.php/2010/01/28/automatic-monitoring-with-puppet-and-nagios/](http://blog.gurski.org/index.php/2010/01/28/automatic-monitoring-with-puppet-and-nagios/) 。