Microsoft details Project Tardigrade, an Azure resiliency initiative

Microsoft has been making improvements to Azure reliability in recent months. In July, Azure CTO Mark Russinovich detailed some of these in a blog post, including a reference to Project Tardigrade, which was first announced at this year's Build conference. For those unaware of its purpose, it is a new service aimed toward improving Azure resiliency, and is appropriately named after the "nearly indestructible" microscopic water bears.

Today, Russinovich has expanded upon the initiative, explaining its actual functionality in more detail, while also highlighting upcoming improvements to the service.

Tardigrade contains mitigation strategies that serve in protecting Azure virtual machines (VMs) against unanticipated platform failures. As such, self-healing mechanisms are included in this service in order to initiate quick recovery and reduce impact upon user workloads. Not only are the states of each VM preserved even when facing extreme forms of critical failures, such as kernel-level failures and firmware issues, but the underlying causes behind these problems are also handled so as to prevent their reappearance. The service's implementations range across multiple hardware and software layers of Microsoft's cloud computing platform, putting platform resiliency and high availability of services at the forefront.

An example recovery workflow that executes upon VM operation failure due to host server issues has been described in the following way:

Phase 1:

This step has no impact to running customer VMs. It simply recycles all services running on the host. In the rare case that the faulted service does not successfully restart, we proceed to Phase 2.

Phase 2:

Our diagnostics service runs on the host to collect all relevant logs/dumps systematically, to ensure that we can thoroughly diagnose the reason for failure in Phase 1. This comprehensive analysis allows us to ‘root cause’ the issue and thereby prevent reoccurrences in the future.

Phase 3:

At a high level, we reset the OS into a healthy state with minimal customer impact to mitigate the host issue. During this phase we preserve the states of each VM to RAM, after which we begin to reset the OS into a healthy state. While the OS swiftly resets underneath, running applications on all VMs hosted on the server briefly ‘freeze’ as the CPU is temporarily suspended. This experience is similar to a network connection temporarily lost but quickly resumed due to retry logic. After the OS is successfully reset, VMs consume their stored state and resume normal activity, thereby circumventing any potential VM reboots.

Although the above-mentioned workflow is being used in current implementations of Tardigrade, further host failure scenarios are also being tested to explore more recovery paths. In the future, machine learning approaches are going to be utilized to detect abnormal resource utilization patterns from the host's end. Moreover, other machine learning algorithms are also planned to be deployed in order to provide assistance in repair tasks.

Microsoft believes that platform resiliency is a significant component of Azure. As such, the tech giant will continue improving reliability across the cloud computing platform. Other updates to Azure in recent days include new reservation and pre-purchase plans for some services, along with some more security features for Files entering generally availability.