Customers may experience a small percentage of intermittent "503 Service Temporarily Unavailable" errors after making configuration changes (e.g. changing a configuration variable) and re-deploying an otherwise healthy application that was running fine before.
By design, there is a race condition in the way Kubernetes de-provisions pods.
In a nutshell, when you terminate a Pod, removing the endpoint and the signal to the kubelet are issued at the same time. See Graceful shutdown and zero downtime deployments in Kubernetes for more information.
One of the ways to address this in a raw Kubernetes implementation is to use a PreStop hook.
However, given the abstraction layer introduced by the platform, this is not possible in EYK yet the same level of control can be achieved by using traps. In a nutshell, a trap can step in and take a series of actions (commands) that can give EYK complete control of the lifecycle of the process.
To achieve that we need to:
- Use dumb-init to make sure that it always takes PID 1 and passes the signals to its children
- Create an entrypoint.sh that uses trap to take a series of actions before it runs appcontrol.sh
- Create an appcontrol.sh that uses dumb-init to map 15:0 i.e. TERM to EXIT (another way of ignoring the signal altogether) and includes the application run command
The above solution requires the following:
Below are the specific details of the solution steps:
Include dumb-init package
RUN apt-get update && apt-get install -y dumb-init
Although on our implementation we rely on a Procfile to pass the process instructions, you may also add it on the Dockerfile:
From the above, you can see that we are not changing the mapping here just yet.
Here we will include both the use of dumb-init as well as the entrypoint.sh script:
web: /usr/bin/dumb-init -- ./script/entrypoint.sh
Here we are adding the trap followed by the appcontrol.sh script:
#!/bin/bash trap "echo SIGTERM recieved - sleeping 30 seconds; sleep 30; echo Slept 30 Seconds - stopping Puma; pkill -TERM -f '^([^ ]*/)?puma '; exit 0" TERM ./script/appcontrol.sh
As per the trap options above, once the pod receives the signal TERM (i.e. SIGTERM default for docker/kubernetes scale down/rolling update process) it will issue the following serially:
echo SIGTERM received - sleeping 30 seconds
An informational message that we have received SIGTERM
This is holding the next command for 30 seconds
echo Slept 30 Seconds - stopping Passenger
An informational message that we are about to stop passenger
pkill -TERM -f '^([^ ]*/)?puma '
This is actually the command that stops puma gracefully
That's where we are exiting trap.
Here we are using dumb-init to ignore SIGTERM (15:0) and start passenger the usual way:
/usr/bin/dumb-init --rewrite 15:0 -- bundle exec puma -p 3000
Finally, we are good to git add/commit/push.
eyk pull to deploy the image, ensure you have accordingly updated the YAML string used to supply a Procfile to the application to the above Procfile. In addition, given that we will be introducing the 30-second delay on the pod's termination process, it is advisable to also introduce a greater termination grace period by running this command:
eyk config:set KUBERNETES_POD_TERMINATION_GRACE_PERIOD_SECONDS=60 -a appname