Yesterday afternoon Facebook experienced the worst outage that the company has had “in over four years”, causing the site to go down for most users for “approximately 2.5 hours”. One of the company’s engineers followed up with a blog post, explaining exactly what went wrong. The cause of the issue sounds relatively complicated, however the conclusion was that the company had to restart the entire site.
According to Robert Johnson:
The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.
The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values from the persistent store. This works well for a transient problem with the cache, but it doesn’t work when the persistent store is invalid.
Today we made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second.
To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves.
Now come to Fahim (Admin):
If you don’t understand what he’s talking about, it’s ok. Most people probably don’t understand what went wrong for the most part, however it sounds as though the site went into one of those infinite loops of death. While you don’t need to be an advanced programmer to understand how bad infinite loops are, you definitely need to have some engineering know-how.
The bottom line is that it was one of the worst crashes the company has ever experienced and they are working on making it so that never happens again!