Facebook was down for more than two hours Thursday afternoon, marking its longest outage in about four years. The Facebook Engineering blog has posted a detailed explanation of what happened.”The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition,” writes Facebook’s Robert Johnson. “An automated system for verifying configuration values ended up causing much more damage than it fixed.”
In short: A configuration change created a feedback loop that overwhelmed a database cluster. The only way to fix the problem was to take the whole cluster offline – which meant downtime for web site. Read the Engineering blog for more details.