Dissecting GitHub Outage - When master failover failed
Master failover failed for GitHub leading to a 5-hour long incident, let's see what happened.
Incident Summary
For five hours, GitHub users observed delays in data being visible on the interface and API after it was written on the database. This happened during the maintenance when they were switching the Master DB.
Planned Maintenance
Planned maintenance is a popular way for companies to take a small downtime and execute all the maintenance activities. Some activities for which we do plan database maintenance are
applying security patches
apply version upgrades
parameter tuning
hardware replacement
periodic reboots
A popular activity during database maintenance is to switch the Master node i.e. shift the traffic coming from the master node to a new node so that we could patch the old instance.
For a very short duration, when the config switch is happening, the database would be unavailable leading to a small outage; and this is expected behavior.
Database Crash
During the failover, when the traffic was moved to the new database, the mysqld
process crashed. This led to incoming writes failing. To quickly mitigate the issue, the team moved the traffic to the old database. This solved the issue and the site was up and running.
Something interesting happened
The new database before crashing served the write traffic for 6 seconds. So, after the crash when the traffic was redirected to the old database, it did not have the data that was written in that 6 seconds window.
This is a huge concern, as it would lead to bad UX, and in the worst case consistency failures. So, how to remediate this issue?
Remediating master failovers
In order to remediate this, we take the help of the Write Ahead Log or Commit log of the database. Whenever we do a failover, we always keep track of the BINLOG
coordinates.
Once we moved the traffic to the old database, all we have to do is iterate through the BINLOG and apply all the changes that happened on the new database post the noted coordinate on the old database.
This would re-create or modify the exact data that was written to the new database on the old database, leading to zero data loss or consistency breach.
Cleaning up the mess
Typically when we have such a failover, it is better that we restore the read replicas and hence GitHub team rotated all the replicas. Creating a read replica takes time, given the scale of GitHub.
It took them 4 hours to set up replicas and 1 hour to re-configure the cluster hence for over 5 hours the incident was affecting the users.
Here's the video of my explaining this in-depth 👇 do check it out
Companies announce their planned maintenance, what happens during that? Could something go wrong while running maintenance?
GitHub team was switching their Master databases from one node to another; while doing this something went wrong and the new database crashed. This led to data divergence and a production incident that lasted over 5 hours.
In this video, we dissect this incident and understand what happens during planned maintenance, what went wrong with GitHub, how GitHub mitigated it, and understand some really cool things about switching databases and solving data divergence.
Outline:
00:00 Agenda
02:42 What happened?
03:29 Scaling reads with Read Replicas
04:40 Planned Database Maintenance
10:08 Database crashed and quick mitigation
11:44 Data Divergence between two masters
13:54 Remediating Data Divergence
18:23 Read Replica taking time to spin up
You can also
Subscribe to the YT Channel Asli Engineering
Listen to this on the go on Spotify
Thank you so much for reading 🖖 If you found this helpful, do spread the word about it on social media; it would mean the world to me.
You can also follow me on your favourite social media LinkedIn, and Twitter.
Yours truly,
Arpit
arpitbhayani.me
Until next time, stay awesome :)