monitとcollectdによるプロセスの監視

Engine Yardで構築したインスタンスにはmonitとcollectdが導入されており、システムの状況や重要なプロセスなどを監視しています。

monitによって監視されているプロセス・デーモンは異常を検知した際にはアラートが送信され、また可能であれば再起動なども行われます。 直接SSHでログインしてサービスを再起動などする際にもmonitの存在を覚えておくのがよいでしょう。

monitによって監視されているプロセスの状況を確認するには対象のホストにSSHでログインし、下記のコマンドを実行します。

$ sudo monit status
The Monit daemon 5.3.2 uptime: 8d 4h 10m 

Process 'redis'
  status                            Running
  monitoring status                 Monitored
  pid                               30717
  parent pid                        1
  uptime                            295d 1h 11m 
  children                          0
  memory kilobytes                  620
  memory kilobytes total            620
  memory percent                    0.0%
  memory percent total              0.0%
  cpu percent                       0.3%
  cpu percent total                 0.3%
  data collected                    Wed, 02 Apr 2014 06:33:58

Process 'php-fpm'
  status                            Running
  monitoring status                 Monitored
  pid                               5792
  parent pid                        1
  uptime                            15h 43m 
  children                          3
  memory kilobytes                  8104
  memory kilobytes total            148264
  memory percent                    0.4%
  memory percent total              8.7%
  cpu percent                       0.0%
  cpu percent total                 5.3%
  unix socket response time         0.000s to /var/run/engineyard/php-fpm_candycane.sock [DEFAULT]
  data collected                    Wed, 02 Apr 2014 06:33:58

Process 'newrelic-daemon'
  status                            Running
  monitoring status                 Monitored
  pid                               31421
  parent pid                        1
  uptime                            8d 4h 11m 
  children                          1
  memory kilobytes                  44
  memory kilobytes total            5176
  memory percent                    0.0%
  memory percent total              0.3%
  cpu percent                       0.0%
  cpu percent total                 0.6%
  data collected                    Wed, 02 Apr 2014 06:33:58

Process 'memcache_11211'
  status                            Running
  monitoring status                 Monitored
  pid                               14959
  parent pid                        1
  uptime                            295d 1h 23m 
  children                          0
  memory kilobytes                  188
  memory kilobytes total            188
  memory percent                    0.0%
  memory percent total              0.0%
  cpu percent                       0.0%
  cpu percent total                 0.0%
  data collected                    Wed, 02 Apr 2014 06:33:58

Process 'collectd_httpd'
  status                            Running
  monitoring status                 Monitored
  pid                               31739
  parent pid                        1
  uptime                            295d 1h 11m 
  children                          1
  memory kilobytes                  4
  memory kilobytes total            80
  memory percent                    0.0%
  memory percent total              0.0%
  cpu percent                       0.0%
  cpu percent total                 0.0%
  data collected                    Wed, 02 Apr 2014 06:33:58

Process 'collectd_fcgi'
  status                            Running
  monitoring status                 Monitored
  pid                               31746
  parent pid                        1
  uptime                            295d 1h 11m 
  children                          0
  memory kilobytes                  104
  memory kilobytes total            104
  memory percent                    0.0%
  memory percent total              0.0%
  cpu percent                       0.0%
  cpu percent total                 0.0%
  data collected                    Wed, 02 Apr 2014 06:33:58

System 'system_ip-10-132-70-223.ap-northeast-1.compute.internal'
  status                            Running
  monitoring status                 Monitored
  load average                      [0.58] [0.42] [0.34]
  cpu                               5.0%us 3.9%sy 0.5%wa
  memory usage                      540284 kB [31.8%]
  swap usage                        35100 kB [3.8%]
  data collected                    Wed, 02 Apr 2014 06:33:58

 

MySQLについてはクラッシュした際に自動での再起動が不可能な事が多いため、monitではなくcollectdを通じて、独自のスクリプトを使って監視を行い、接続ができない場合などにアラートを送信します。スクリプトは実際に接続を行い、問題があった場合にはタッチファイルを作成する事でアラートを検知します。
監視スクリプトは次のパスに存在します。

 /engineyard/bin/check_mysql.sh

 

collectdの設定ファイルの内容は次のようになっており、どのような項目が監視されているかがわかります。

$ cat /etc/engineyard/collectd.conf 
#
# Config file for collectd(1).
# Please read collectd.conf(5) for a list of options.
# http://collectd.org/
#
# This file is managed by Chef and will be overwritten on the
# next rebuild.
#
# DO NOT MODIFY
#

FQDNLookup   true
BaseDir     "/var/lib/collectd"
PIDFile     "/var/run/collectd.pid"
PluginDir   "/usr/lib/collectd"
Interval     30

# LOAD THESE PLUGINS
LoadPlugin logfile
LoadPlugin processes
LoadPlugin syslog
LoadPlugin cpu
LoadPlugin df
LoadPlugin disk
LoadPlugin interface
LoadPlugin load
LoadPlugin memcached
LoadPlugin memory
LoadPlugin mysql
LoadPlugin rrdtool
LoadPlugin swap
LoadPlugin exec
LoadPlugin filecount
LoadPlugin threshold

# PLUGIN CONFIG
# Ignore (don't monitor) /dev, /dev/shm, /var/log (it's really /mnt/log)
# Report the reserved disk space as being used, instead of free ... 'cause it's not
<Plugin "df">
  ReportReserved true
  FSType "ext3"
  Host "127.0.0.1"
  Port "11211"

    # The role of this machine implies a db should be running,
    # so let's monitor it
      <Database "candycane">
      Host "localhost"
      User "root"
      Password "foovar"
      
    # Watch mysqld process
    Process mysqld
    
#Get some more stats about cron
  Process cron
  Process collectd

#Make sure cron is updating the check file
#The check file is touched by cron every minute
#cron_nanny is used to make sure cron is running
#This is a fall back alert.

  <Directory "/tmp">
    Instance "cron-check"
    Name "cron-check"
    MTime 300
  
# RRD configuration
  DataDir "/var/lib/collectd/rrd"
  CacheTimeout 120
  CacheFlush   900

# This script get's fired off for Thresholds
# It's written dynamically by chef
      Exec "mysql" "/engineyard/bin/check_mysql.sh" "connect"
      Exec "mysql" "/engineyard/bin/check_mysql.sh" "connections"
  Exec "deploy" "/engineyard/bin/check_readonly.sh"
  NotificationExec "deploy" "/engineyard/bin/ey-alert.rb"

# THRESHHOLD CONFIG
# These are the things we alert on

  <Plugin "load">
    <Type "load">
      WarningMin    0.00
      WarningMax    4.00
      FailureMin    0.00
      FailureMax    10.00
      DataSource "shortterm"
   
  <Plugin "filecount-cron-check">
    <Type "files">
      FailureMin 0.00
      DataSource "value"
    
  # let's monitor to make sure it's running.
  # This is kind of a hack, let's see if it works well.
  <Plugin "processes-mysqld">
    <Type "ps_count">
      FailureMin 1.00
      FailureMax 100.00
      DataSource "processes"
    
  #Alert if cron process count > 100 || < 1
  <Plugin "processes-cron">
    <Type "ps_count">
      FailureMin 1.00
      FailureMax 100.00
      DataSource "processes"
   
  #Alert if collectd process count > 10 || < 1
  <Plugin "processes-collectd">
    <Type "ps_count">
      FailureMin 1.00
      FailureMax 10.00
      DataSource "processes"
    
  <Plugin "swap">
    <Type "swap-used">
      WarningMin    0.00
      WarningMax    469760000.0
      FailureMin    0.00
      FailureMax    657664000.0
      DataSource "value"
    
  # Disk space alerts
  # Thresholds are pulled from a library
  <Plugin "df-root">
    <Type "df_complex">
      Instance "free"
      Invert true
      WarningMin    0.00
      WarningMax    3170680832.0
      FailureMin    0.00
      FailureMax    1585340416.0
      DataSource "value"
    
  <Plugin "df-db">
    <Type "df_complex">
      Instance "free"
      Invert true
      WarningMin    0.00
      WarningMax    3170680832.0
      FailureMin    0.00
      FailureMax    1585340416.0
      DataSource "value"
   
  <Plugin "df-data">
  <Type "df_complex">
      Instance "free"
      Invert true
      WarningMin    0.00
      WarningMax    3170680832.0
      FailureMin    0.00
      FailureMax    1585340416.0
      DataSource "value"
    
  <Plugin "df-mnt">
    <Type "df_complex">
      Instance "free"
      Invert true
      WarningMin    0.00
      WarningMax    31518524211.2
      FailureMin    0.00
      FailureMax    15759262105.6
      DataSource "value"

 

アプリケーションとデータベースを同居させたsolo構成のインスタンスではメモリ不足などが発生しやすく、プロセスが異常終了するといった状況が発生しやすくなります。ある程度の負荷が見込まれる場合はアプリケーションとデータベースを分離する構成を検討することを強くお奨めします。

コメント

サインインしてコメントを残してください。

Powered by Zendesk