Incident command is a reasonably new area of focus for SBG. In a nutshell we have a nominated technical person known as the Incident Commander (IC) who gives direction in order to resolve an incident and restore service as quickly as possible.
This blog post contains some of the insights and ‘lessons learned’ by our teams from their experiences in live incidents and exercises (known internally as fire drills) as they work to improve their skills and reduce our Mean Time To Resolution
We collect a lot of metrics about our production systems using Graphite Times Series Databases. In order to improve performance of Graphite and reduce the load on our SAN we purpose-built and tuned some very vast dedicated hardware for our Graphite Databases.