Engine Yardで構築したインスタンスにはmonitとcollectdが導入されており、システムの状況や重要なプロセスなどを監視しています。
monitによって監視されているプロセス・デーモンは異常を検知した際にはアラートが送信され、また可能であれば再起動なども行われます。 直接SSHでログインしてサービスを再起動などする際にもmonitの存在を覚えておくのがよいでしょう。
monitによって監視されているプロセスの状況を確認するには対象のホストにSSHでログインし、下記のコマンドを実行します。
$ sudo monit status The Monit daemon 5.3.2 uptime: 8d 4h 10m Process 'redis' status Running monitoring status Monitored pid 30717 parent pid 1 uptime 295d 1h 11m children 0 memory kilobytes 620 memory kilobytes total 620 memory percent 0.0% memory percent total 0.0% cpu percent 0.3% cpu percent total 0.3% data collected Wed, 02 Apr 2014 06:33:58 Process 'php-fpm' status Running monitoring status Monitored pid 5792 parent pid 1 uptime 15h 43m children 3 memory kilobytes 8104 memory kilobytes total 148264 memory percent 0.4% memory percent total 8.7% cpu percent 0.0% cpu percent total 5.3% unix socket response time 0.000s to /var/run/engineyard/php-fpm_candycane.sock [DEFAULT] data collected Wed, 02 Apr 2014 06:33:58 Process 'newrelic-daemon' status Running monitoring status Monitored pid 31421 parent pid 1 uptime 8d 4h 11m children 1 memory kilobytes 44 memory kilobytes total 5176 memory percent 0.0% memory percent total 0.3% cpu percent 0.0% cpu percent total 0.6% data collected Wed, 02 Apr 2014 06:33:58 Process 'memcache_11211' status Running monitoring status Monitored pid 14959 parent pid 1 uptime 295d 1h 23m children 0 memory kilobytes 188 memory kilobytes total 188 memory percent 0.0% memory percent total 0.0% cpu percent 0.0% cpu percent total 0.0% data collected Wed, 02 Apr 2014 06:33:58 Process 'collectd_httpd' status Running monitoring status Monitored pid 31739 parent pid 1 uptime 295d 1h 11m children 1 memory kilobytes 4 memory kilobytes total 80 memory percent 0.0% memory percent total 0.0% cpu percent 0.0% cpu percent total 0.0% data collected Wed, 02 Apr 2014 06:33:58 Process 'collectd_fcgi' status Running monitoring status Monitored pid 31746 parent pid 1 uptime 295d 1h 11m children 0 memory kilobytes 104 memory kilobytes total 104 memory percent 0.0% memory percent total 0.0% cpu percent 0.0% cpu percent total 0.0% data collected Wed, 02 Apr 2014 06:33:58 System 'system_ip-10-132-70-223.ap-northeast-1.compute.internal' status Running monitoring status Monitored load average [0.58] [0.42] [0.34] cpu 5.0%us 3.9%sy 0.5%wa memory usage 540284 kB [31.8%] swap usage 35100 kB [3.8%] data collected Wed, 02 Apr 2014 06:33:58
MySQLについてはクラッシュした際に自動での再起動が不可能な事が多いため、monitではなくcollectdを通じて、独自のスクリプトを使って監視を行い、接続ができない場合などにアラートを送信します。スクリプトは実際に接続を行い、問題があった場合にはタッチファイルを作成する事でアラートを検知します。
監視スクリプトは次のパスに存在します。
/engineyard/bin/check_mysql.sh
collectdの設定ファイルの内容は次のようになっており、どのような項目が監視されているかがわかります。
$ cat /etc/engineyard/collectd.conf # # Config file for collectd(1). # Please read collectd.conf(5) for a list of options. # http://collectd.org/ # # This file is managed by Chef and will be overwritten on the # next rebuild. # # DO NOT MODIFY # FQDNLookup true BaseDir "/var/lib/collectd" PIDFile "/var/run/collectd.pid" PluginDir "/usr/lib/collectd" Interval 30 # LOAD THESE PLUGINS LoadPlugin logfile LoadPlugin processes LoadPlugin syslog LoadPlugin cpu LoadPlugin df LoadPlugin disk LoadPlugin interface LoadPlugin load LoadPlugin memcached LoadPlugin memory LoadPlugin mysql LoadPlugin rrdtool LoadPlugin swap LoadPlugin exec LoadPlugin filecount LoadPlugin threshold # PLUGIN CONFIG # Ignore (don't monitor) /dev, /dev/shm, /var/log (it's really /mnt/log) # Report the reserved disk space as being used, instead of free ... 'cause it's not <Plugin "df"> ReportReserved true FSType "ext3" Host "127.0.0.1" Port "11211" # The role of this machine implies a db should be running, # so let's monitor it <Database "candycane"> Host "localhost" User "root" Password "foovar" # Watch mysqld process Process mysqld #Get some more stats about cron Process cron Process collectd #Make sure cron is updating the check file #The check file is touched by cron every minute #cron_nanny is used to make sure cron is running #This is a fall back alert. <Directory "/tmp"> Instance "cron-check" Name "cron-check" MTime 300 # RRD configuration DataDir "/var/lib/collectd/rrd" CacheTimeout 120 CacheFlush 900 # This script get's fired off for Thresholds # It's written dynamically by chef Exec "mysql" "/engineyard/bin/check_mysql.sh" "connect" Exec "mysql" "/engineyard/bin/check_mysql.sh" "connections" Exec "deploy" "/engineyard/bin/check_readonly.sh" NotificationExec "deploy" "/engineyard/bin/ey-alert.rb" # THRESHHOLD CONFIG # These are the things we alert on <Plugin "load"> <Type "load"> WarningMin 0.00 WarningMax 4.00 FailureMin 0.00 FailureMax 10.00 DataSource "shortterm" <Plugin "filecount-cron-check"> <Type "files"> FailureMin 0.00 DataSource "value" # let's monitor to make sure it's running. # This is kind of a hack, let's see if it works well. <Plugin "processes-mysqld"> <Type "ps_count"> FailureMin 1.00 FailureMax 100.00 DataSource "processes" #Alert if cron process count > 100 || < 1 <Plugin "processes-cron"> <Type "ps_count"> FailureMin 1.00 FailureMax 100.00 DataSource "processes" #Alert if collectd process count > 10 || < 1 <Plugin "processes-collectd"> <Type "ps_count"> FailureMin 1.00 FailureMax 10.00 DataSource "processes" <Plugin "swap"> <Type "swap-used"> WarningMin 0.00 WarningMax 469760000.0 FailureMin 0.00 FailureMax 657664000.0 DataSource "value" # Disk space alerts # Thresholds are pulled from a library <Plugin "df-root"> <Type "df_complex"> Instance "free" Invert true WarningMin 0.00 WarningMax 3170680832.0 FailureMin 0.00 FailureMax 1585340416.0 DataSource "value" <Plugin "df-db"> <Type "df_complex"> Instance "free" Invert true WarningMin 0.00 WarningMax 3170680832.0 FailureMin 0.00 FailureMax 1585340416.0 DataSource "value" <Plugin "df-data"> <Type "df_complex"> Instance "free" Invert true WarningMin 0.00 WarningMax 3170680832.0 FailureMin 0.00 FailureMax 1585340416.0 DataSource "value" <Plugin "df-mnt"> <Type "df_complex"> Instance "free" Invert true WarningMin 0.00 WarningMax 31518524211.2 FailureMin 0.00 FailureMax 15759262105.6 DataSource "value"
アプリケーションとデータベースを同居させたsolo構成のインスタンスではメモリ不足などが発生しやすく、プロセスが異常終了するといった状況が発生しやすくなります。ある程度の負荷が見込まれる場合はアプリケーションとデータベースを分離する構成を検討することを強くお奨めします。
コメント
サインインしてコメントを残してください。