An in-depth introduction to Canary Deployments
Canary Deployments is a deployment pattern that rolls out the changes to a limited set of users before doing it for 100%.
We compare the vitals side-by-side from the old setup and the canary servers to ensure everything is as expected. If all is okay, then we incrementally roll out to a wider audience. If not, we immediately roll back our changes from the canaries.
Canary Deployment thus acts as an Early Warning Indicator to prevent a potential outage.
Why canary deployment is named canary deployment?
In 1920, coal miners used to carry caged canaries with them. If the gases in the mines were highly toxic the canaries would die and that alerted the miners to evacuate immediately, thus saving their lives.
In canary deployment, the canary servers are the caged canaries that alert us when anything goes wrong.
Implementing canary deployment
Canary deployments are implemented through a setup where a few servers serve the newer version while the reset serves the old version.
A router (load balancer / API gateway) is placed in front of the setup and it routes some traffic to the new fleet while the other requests continue to go to the old one.
Pros of Canary Deployment
we test our changes on real traffic
rollbacks are much faster
if something's wrong only a fraction of users are affected
zero downtime deployments
we can gradually roll out the changes to users
we can power A/B Testing
Cons of Canary Deployment
engineers will get habituated to testing things in production
a little complex setup
a parallel monitoring setup is required to compare vitals side-by-side
Selecting users/servers for canary deployment?
The selection is use-case specific, but the common strategies are:
geographical selection to power regional roll-out
create specific user cohorts eg: beta users
random selection
When we absolutely need Canary Deployments
Say you own the Auth service that is written in Java and you chose to re-write it in - Golang. When taking it to production, you would NOT want to make a direct 100% roll-out given that the new codebase might have a lot of bugs.
This is where canary is super-helpful when we a fraction of servers serving requests from Golang server while others from the existing setup. We now forward 5% traffic to the new ones and observe how it reacts.
Once we have enough confidence in the newer setup, we increase the roll-out fraction to 15%, 50%, 75%, and eventually 100%. Canary setup thus gives us a seamless transition from our old server to a newer one.
Here's the video of my explaining this in-depth 👇 do check it out
Deployments are stressful; what if something goes wrong? What if you forgot to handle an edge case that was also missed during the unit test, integration test, or an internal QA iteration.
Putting such code into production can take down your entire infrastructure and could cause a massive outage. In order or handle such a situation gracefully and provide us with an early warning about something's wrong we have Canary Deployment.
In this video, we take an in-depth look into canary deployments, learn why canary deployments are called canary deployments, and understand how they are actually implemented, talk about the pros and cons of this deployment pattern, and conclude with a one really solid use case where you absolutely need them.
Outline:
00:00 Agenda
03:05 Introduction to Canary Deployment
06:06 Why Canary Deployments are called Canary Deployments?
08:04 How to implement Canary Deployments?
10:03 Pros of having Canary Deployments
16:21 How to pick servers and users for a rollout?
19:08 Cons of having Canary Deployments
21:25 When we absolutely need Canary Deployment
You can also
Subscribe to the YT Channel Asli Engineering
Listen to this on the go on Spotify
Thank you so much for reading 🖖 If you found this helpful, do spread the word about it on social media; it would mean the world to me.
You can also follow me on your favourite social media LinkedIn, and Twitter.
Yours truly,
Arpit
arpitbhayani.me
Until next time, stay awesome :)