GitHub says multi-outage week “not acceptable”
Microsoft-owned GitHub has published a post-mortem on last week’s outage incidents that impacted the services tit offers for developers, saying they have now been mitigated.
GitHub’s chief security officer Mike Hanley said the code repository wanted to be very transparent about the the string of outages, and what it has done to mitigate them for the future.
Hanley said the May 9 (US time) database outage that affected eight out 10 services was due to a configuration change to prevent connection saturation.
However, shortly after the configuration change was rolled out, GitHub’s database cluster experienced a failover.
A rollback of the configuration change was attempted but failed due an internal infrastructure error, Hanley said.
An app authentication token issuance incident on May 10 degraded six out of 10 main GitHub services, with the failure rate peaking at 76 percent for a short time.
This was caused by an inefficient implementation of an application programming interface for managing GitHub App permissions, Hanley said.
A further database cluster crash on May 11 (all times US) ended up degrading eight out of 10 main GitHub services.
“This is not acceptable nor the standard we hold ourselves to,” Hanley wrote.
GitHub has now taken multiple steps such as reviewing internal processes and making adjustments to ensure changes are deployed safely, to prevent similar incidents from happening.