Service unavailable across multiple instances

On Friday, 24th January, at 10:47, our monitoring systems alerted us to a problem with our configuration servers. The root cause has been identified and relates to an error that caused a partial replication of our configuration database. During this time, some customer instances were effected and exhibited extended API response times, which for all intents made the affected instances unavailable.

The problem was detected and rectified by 10:58, 11 minutes after the first alert was raised.

We have levels of redundancy built into our configuration database deployment, but this was a new failure scenario we have not previously seen, and our resilience strategy failed under these conditions. We have now implemented a fix and deployed to production to ensure that in the future, under similar circumstances, we will not see the same failure mode.

We apologize for any inconvenience this caused.

Service unavailable across multiple instances Friday 24th January 2020 10:47:00