Dissecting GitHub Outage: Downtime due to an Edge Case
An edge case took down GitHub 🤯
GitHub experienced an outage where their MySQL database went into a degraded state. Upon investigation, it was found out that the outage happened because of an edge case. So, how can an edge case take down a database?
What happened?
The outage happened because of an edge case which lead to the generation of an inefficient SQL query that was executed very frequently on the database. The database was thus put under a massive load which eventually made it crash leading to an outage.
Could retry have helped?
Automatic retries always help in recovering from a transient issue. During this outage, retries made things worse. Automatic retries added the load on the database that was already under stress.
Fictional Example
Now, we take a look at a fictional example where an edge case could potentially take down a DB.
Say, we have an API that returns the number of commits made by a user in the last n
days. The way, this API could be implemented is to get the start_date
as an integer through the query parameter, and the API server could then fire a SQL query like
SELECT count(id) FROM commits WHERE user_id = 123 AND start_time > start_time
In order to fire the query, we convert the string start_time
to an integer, create the query, and then fire it. In the regular case, we get the correct input and then compute the number of commits and respond.
But as an edge case, what if we do not get the query parameter or we get a non-integer value; then depending on the language at hand we may actually use the default integer value like 0
as our start_time
.
There is a very high chance of this happening when we are using Golang which uses 0
as the default integer value. In such a case, the query that gets executed would be
SELECT count(id) FROM commits WHERE user_id = 123 AND start_time > 0
The above query when executed iterates through all the rows of the table for a particular user, instead of the rows for the last 7 days; making it super inefficient and expensive. The above query would put a huge load on the database and a frequent invocation can actually take down the entire database.
Ways to avoid such situations
Always sanitize the input before executing the query
Put guard rails that prevent you from iterating the entire table. For example: putting
LIMIT 1000
would have made you iterate over 1000 rows in the worst case.
Here's the video of my explaining this in-depth 👇 do check it out
In August 2021, GitHub experienced an outage where their MySQL Master database went into a degraded state. Upon investigation, they found out it was due to an edge case. So, how can an edge case take down a database?
In this video, we understand what happened, take a fictional example of how an edge case can put extra load on the database, and conclude with an exciting way to make our SQL queries fool-proof.
Outline:
00:00 Agenda
00:27 Outage Overview
07:38 What could have happened?
08:51 A fictional example of an edge case taking down DB
17:18 How to avoid edge cases on SQL queries
You can also
Subscribe to the YT Channel Asli Engineering
Listen to this on the go on Spotify
Thank you so much for reading 🖖 If you found this helpful, do spread the word about it on social media; it would mean the world to me.
You can also follow me on your favourite social media LinkedIn, and Twitter.
Yours truly,
Arpit
arpitbhayani.me
Until next time, stay awesome :)