Intermittent "Error 503 Service Temporarily Unavailable" errors following deployment

Overview

Customers may experience a small percentage of intermittent "503 Service Temporarily Unavailable" errors after making configuration changes (e.g. changing a configuration variable) and re-deploying an otherwise healthy application that was running fine before.

Solution

Root cause

By design, there is a race condition in the way Kubernetes de-provisions pods.

In a nutshell, when you terminate a Pod, removing the endpoint and the signal to the kubelet are issued at the same time. See Graceful shutdown and zero downtime deployments in Kubernetes for more information.

Graceful_shutdown_in_Kubernetes.png

Solution Steps

One of the ways to address this in a raw Kubernetes implementation is to use a PreStop hook.

However, given the abstraction layer introduced by the platform, this is not possible in EYK yet the same level of control can be achieved by using traps. In a nutshell, a trap can step in and take a series of actions (commands) that can give EYK complete control of the lifecycle of the process.

To achieve that we need to:

  1. Use dumb-init to make sure that it always takes PID 1 and passes the signals to its children
  2. Create an entrypoint.sh that uses trap to take a series of actions before it runs appcontrol.sh
  3. Create an appcontrol.sh that uses dumb-init to map 15:0 i.e. TERM to EXIT (another way of ignoring the signal altogether) and includes the application run command

The above solution requires the following:

  1. dumb-init package
  2. trap (already available)
  3. entrypoint.sh bash script
  4. appcontrol.sh bash script

Below are the specific details of the solution steps:

  1. Dockerfile

    Include dumb-init package

    RUN apt-get update && apt-get install -y dumb-init

    Although on our implementation we rely on a Procfile to pass the process instructions, you may also add it on the Dockerfile:

    ENTRYPOINT ["/usr/bin/dumb-init","--"]

    From the above, you can see that we are not changing the mapping here just yet.

  2. Procfile

    Here we will include both the use of dumb-init as well as the entrypoint.sh script:

    web: /usr/bin/dumb-init -- ./script/entrypoint.sh

    script/entrypoint.sh

    Here we are adding the trap followed by the appcontrol.sh script:

    #!/bin/bash
    
    trap "echo SIGTERM recieved - sleeping 30 seconds; sleep 30; echo Slept 30 Seconds - stopping Puma; pkill -TERM -f '^([^ ]*/)?puma '; exit 0" TERM
    ./script/appcontrol.sh

    As per the trap options above, once the pod receives the signal TERM (i.e. SIGTERM default for docker/kubernetes scale down/rolling update process) it will issue the following serially:

    • echo SIGTERM received - sleeping 30 seconds

    An informational message that we have received SIGTERM

    • sleep 30

    This is holding the next command for 30 seconds

    • echo Slept 30 Seconds - stopping Passenger

    An informational message that we are about to stop passenger

    • pkill -TERM -f '^([^ ]*/)?puma '

    This is actually the command that stops puma gracefully

    • exit 0

    That's where we are exiting trap.

  3. script/appcontrol.sh

    Here we are using dumb-init to ignore SIGTERM (15:0) and start passenger the usual way:

    /usr/bin/dumb-init --rewrite 15:0 -- bundle exec puma -p 3000

    Finally, we are good to git add/commit/push.

Note:

If using eyk pull to deploy the image, ensure you have accordingly updated the YAML string used to supply a Procfile to the application to the above Procfile. In addition, given that we will be introducing the 30-second delay on the pod's termination process, it is advisable to also introduce a greater termination grace period by running this command:

eyk config:set KUBERNETES_POD_TERMINATION_GRACE_PERIOD_SECONDS=60 -a appname

Comments

Please sign in to leave a comment.

Powered by Zendesk