Cloud outage of August 2022 - Studenten Net Twente

Introduction

This page documents the cloud outages that occured between 2022-07-30 and 2022-08-21.

Rough timeline:

2022-07-30: Failing disk replaced
2022-07-30: Start of maintenance on Cloud Platform.
2022-08-05: Engine and host krentenwegge upgraded to oVirt 4.4.
2022-08-07: Upgrade of host mergpijp to oVirt 4.4.
2022-08-09: Upgrade of host kletsmajoor to oVirt 4.4.
2022-08-13: Upgrade of Engine and host kletsmajoor to oVirt 4.5. Start of I/O issues.
2022-08-14: Upgrade of host krentenwegge and mergpijp to oVirt 4.5.
2022-08-14: Reinstall of host krentenwegge with oVirt 4.5.
2022-08-14: Two more failing disks identified.
2022-08-15: Failing disk replaced. New disks ordered.
2022-08-20: Last failing disk replaced.
2022-08-20: Issues identified and fixed.

Architecture of our Cloud Platform

SNT’s cloud platform (Virtual Colocation) runs on oVirt, a free and open source virtualization platform that is the basis for Red Hat Virtualization. oVirt makes it easy to manage and maintain a high-available setup, in which virtual machines (VMs) can be migrated to other machines without downtime.

We run oVirt on three hosts: kletsmajoor, krentenwegge and mergpijp. Two of these machines are located in the primary server room of the University of Twente (Seinhuis) and one is located in a secondary location (Teehuis).

VMs and the hosts are managed using the oVirt Engine, which is a special VM that runs on the platform it manages. When a single host is deemed ‘broken’, they are disabled and the VMs migrated to others. When all three hosts are disconnected, all VMs are shut down in order to not corrupt data.

Data, such as the VM hard drive images, is stored using GlusterFS, a network filesystem that stores the data on all three hosts. Gluster ensures that all data is always written to at least two of the three machines, but preferably all three. Writes are synchronous when hosts are connected, which means that a write action only succeeds when the data is written to all three systems. Note that two of the three hosts should always be connected and ‘up to date’ to ensure data integrity.

Maintenance

As with all software, oVirt requires regular software updates and upgrades to stay up to date and secure. The upgrade to oVirt 4.4 is relatively complex, as it involves a re-install of both the Engine and all three hosts with a newer OS version. Other upgrades (eg: 4.3 and 4.5) are almost entirely managed by oVirt itself.

Because the update to oVirt 4.4 could involve some downtime (when oVirt deems things too broken), this was planned during a maintenance weekend on the 30th and 31st of July. During this weekend an upgrade was attempted, but not finished. The upgrade to oVirt 4.4 was continued in the following week, which caused unexpected (but limited) downtime.

After all hosts were reinstalled to oVirt 4.4 and everything seems to run smoothly, the upgrade to oVirt 4.5 was started on the 13th of August.

This did not go entirely smoothly: two hosts (kletsmajoor and mergpijp) had severe I/O issues, which seemed to trace back to both having a failing SSD. The third host (krentenwegge) had failing VDSM configuration, and had to be reinstalled.

Diagnosing issues

With severely degraded I/O performance, we were in trouble. Remember that writes to GlusterFS are synchronous: if a single SSD in the cluster blocks on I/O, all hosts block that I/O. This also makes it harder to trace back issues to a single host: before turning off a host, we want the other two to be in sync. With very slow I/O, it took days for hosts to be back in sync.

One of the failing SSDs (in mergpijp) was replaced on the 15th and another SSD was ordered to replace the other failing disk. The first replacement did not alleviate all of the blocking I/O on that host. It did seem to help the performance somewhat, especially when disabling the the other failing host (kletsmajoor), so things were slowly brought back online on the 16th.

This was somewhat premature, as the I/O issues came back on the 17th.

The last failing SSD was replaced on the 20th. Issues, however, remained.

Cause

As we had three hosts, two of which were having problems, we searched for differences in setup and configuration.

One difference we found was that the two hosts with issues were accessing their SSDs using multipath. In turn, I/O statistics showed there was I/O blocking on the multipath devices, but not the underlying disks. While the use of multipath had not changed during the upgrades and this is a mature part of the Linux kernel, we decided to disable it on the two hosts with issues anyway. At worst it would make diagnosing the issues easier.

After disabling multipath on one host (kletsmajoor), the blocking I/O disappeared on that host. The same was done for the other failing host (mergpijp), which resolved the outage.

Impact

On the 16th we identified data loss on some VMs. The cloud platform should be resilient against data loss and corruption because of the use of GlusterFS. In this scenario this seemed not to be the cause, some possible causes are:

Failing SSDs on all three hosts. Data stored on these disks may have been corrupted by the storage.
Failing writes by the hypervisors. The images on the SSDs are stored as qcow images. The hypervisors running the virtual machines on the hosts may not have been able to write updated metadata to these images, corrupting them.
Failing writes by the guests. The guest OSes in the VMs might not have been able to always write data to disk, possibly corrupting the filesystem in the process.

We have identified all unbootable VMs, which will be restored with a clean disk image matching the original OS and SSH key. A repaired version of the corrupted disk is mounted, as well as any uncorrupted disks.

We recommend restoring from backup using these VMs.