Unintentional data leak
It recently came to light that researchers from Microsoft’s AI department accidentally exposed several terabytes of sensitive data. This happened when publishing a memory bucket of open-source training data on GitHub.
The discovery of the leak
Cloud security startup Wiz stumbled upon the accidental disclosure of cloud-hosted data via a GitHub repository owned by Microsoft’s AI research division. What started as a simple deployment of open source code and AI models for image recognition quickly turned into a nightmare for Microsoft’s security team.
The extent of the data leak
When accessing the provisioned Azure storage URL, Wiz discovered that it was configured to grant permissions to the entire storage account. This led to the unintentional exposure of 38 terabytes of sensitive data. Among the exposed data were personal backups from two Microsoft employee computers, passwords to Microsoft services, secret keys and thousands of internal Microsoft Teams messages.
The root of the problem
It turned out that the problem was not directly related to the storage account, but to a shared access signature (SAS) token that was included in the URL and was too permissive. SAS tokens are an Azure mechanism that allows users to share data from an Azure storage account.
The consequences and Microsoft’s reaction
After Wiz shared his findings with Microsoft on 22 June, the SAS token was withdrawn by Microsoft on 24 June. After the investigation concluded in August, Microsoft stressed that no customer data had been exposed. In direct response to this discovery, Microsoft has also improved the GitHub Privacy Service to ensure that such incidents are avoided in the future.
This incident highlights the growing challenges in cybersecurity, especially in an era where AI and cloud technologies dominate. It is a wake-up call for organisations to rethink their security protocols and ensure that human error does not lead to serious data breaches.