Engine Yard monitors the health of your PostgreSQL database using a combination of our own custom checks and Bucardo’s check_postgres scripts. Collectd or Nagios (depending on your stack and features) consumes the results of these checks and present them to the Engine Yard dashboard as alerts. The alerts we show you follow this format:
severity -- The exit code of the check. It can have any of these values: OK, WARNING, FAILURE, CRITICAL, UNKNOWN, etc.
environment-name -- The name of the environment that originated the alert.
originating-process -- The process that generated the alert.
check-name -- The name of the check that ran the process.
additional information -- Extra information reported by the failing check.
Alert(CRITICAL) MyappProduction process-postgresql: POSTGRES_CHECKPOINT CRITICAL: Last checkpoint was 16204 seconds ago
This sample alert means that in the MyappProduction application, the postgres_checkpoint check raised an alert on the Postgresql process. The checkpoint check issued a severity of critical. The associated message is that the database has not had a checkpoint for about 4.5 hours.
We specify the severity of the monitoring checks based on the thresholds defined when your database was created. This section discusses the most important checks for PostgreSQL and their meanings.
The Connections check
The connections check verifies that the database process is functioning and connections can be established to it.
When do we warn you? We will test the connection to your database every 60 seconds, we will warn you when a connection to the database fails.
What to do if you see this check? Contact Engine Yard Support immediately because your site may be down.
The Checkpoint check
This check determines how long since the last checkpoint has been run. A checkpoint is a point in the transaction log sequence at which all data files have been updated to reflect the information in the log and flushed to disk. If your system crashes, recovery will start from the last known checkpoint. The checkpoint check helps us confirm two things:
Your database consistently takes forward the position in which recovery is started.
In the case of replicas, your standby is keeping up with its master (because the activity the replica sees is what the master has sent it).
When do we warn you? We issue a WARNING severity when checkpoint delays range from 20 to 30 minutes. For delays that exceed 30 minutes, the severity of the alert is CRITICAL.
What to do if you see this check? Contact Engine Yard Support if you see a severity of CRITICAL or FAILURE.
The Snaplock check
The snaplock check alerts us of inconsistent snapshots. Before taking a database snapshot, we attempt to lock it to prevent writes and ensure a consistent snapshot.
When do we warn you? We will warn you when we have failed to obtain a lock before a database snapshot.
What to do if you see this check? Contact Engine Yard Support if the source of your snapshots shows this alert. For example if you have moved your snapshots to the replica and we cannot lock it before a snapshot, it may mean you have no snapshots that are consistent and usable for recovery.