Microsoft has blamed multiple outages which locked users out of applications such as Outlook and Teams on a software bug in the Azure Active Directory backend service Safe Deployment Process system.
For about three hours last week, Azure Active Directory, Microsoft’s cloud-based authentication system, and Azure AD B2C, a white-label authentication service for businesses, misfired and the outage had a worldwide impact.
Office 365 went on to suffer a second global outage later that week. Authentication errors saw users unable to log into applications Teams, Outlook and other Office 365 and Azure services.
According to sources at CRN, Microsoft may be suffering a DevOps problem as the company struggles to keep up with the overwhelming online demand and user capacity due to the COVID-19 pandemic.
“They have so much going on right now, rolling Teams out at a breakneck pace. I think they are running into an issue where code tested out fine but there is a configuration problem when they deploy it,” an insider told the publication.
However, a Microsoft spokesperson refused to confirm if the outages were related to a DevOps issue.
In a statement Microsoft said: “No cloud vendor is immune to downtime. Our number one priority is to get to resolution as quickly as possible and ensure our customers stay updated along the way, as was the case here. We continuously invest in the resilience of our platform and focus on learning from these incidents to ultimately reduce the impact of inevitable outages.”
According to a Microsoft incident report, Australian users had a 37 per cent success rate for the duration of the outage.
The US was hit hardest by the server crash, with 83 per cent of users unable to log into the Office 365 and Azure platforms.
Microsoft described the root cause in the incident report as “a service update targeting an internal validation test ring was deployed, causing a crash upon startup in the Azure AD backend services. A latent code defect in the Azure AD backend service Safe Deployment Process (SDP) system caused this to deploy directly into our production environment, bypassing our normal validation process.”
The company also apologised for the scope of the impact and confirmed it has fixed the latent code defect in the Azure AD backend SDP system.
“We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future,” the incident report reads.