Bernhard Bock

Confidential Computing with AMD SEV: Down the rabbit hole

Today I investigated confidential computing (CC) in more detail, as there is some customer interest to secure workloads with it in non-EU clouds. The primary question is: Can CC help making an non-EU-based cloud strictly GDPR- and TKG-compliant? I won’t dive into legal details, but the high level assumption is: If the provider verifyably cannot get access to unencrypted data, it is compliant. Anything else is complicated™.

Overview

The technology base for CC investigated today is AMD’s Secure Virtualization (SEV), which encrypts memory of virtual machines with a dedicated key to make hypervisor-based attacks harder or impossible. Both Google and Azure advertise AMD SEV as their technical implementation for CC without application changes.

Please note that this is a fundamentally different approach that Intel SGX or AWS Nitro enclaves, which are out of scope of this blog post.

Let’s take the Google Cloud Platform as example. Enabling CC is nothing more than a checkbox while creating the instance.

Starting at the CPU

To evaluate how our instance is protected now, we start at the AMD SEV specification. Without going through all the details here, we basically learn that there’s an host API to set up encryption keys and context, and then the CPU internally AES-encrypts all memory pages with a VM-specific key:

The programming and management of these keys and secure data transfer between host hypervisor and guest VM memory is handled by the SEV firmware running AMD Secure Processor.

And further down:

The firmware maintains a guest policy provided by the guest owner. This policy is enforced by the firmware and restricts what configuration and operational commands can be performed on this guest by the hypervisor.

This opens three questions to follow up on:

  • Do we trust the AMD SEV firmware?
  • How can we check inside the VM that SEV is operating as intended?
  • How are those keys managed?

Regarding the trust in the firmware, we will not have a choice: The firmware is closed-source and we cannot verify its inner workings - we’ll have to trust AMD. We can verify with a certificate that is wasn’t changed by the cloud operator or a hostile hypervisor, but we need to trust that AMD did the right thing. At least, basic release notes are available at the AMD SEV page. They indicate security fixes were done in the past.

This begs the question: Which firmware version does your cloud use? This information is not easily available. It was not investigated further.

The SEV status can be obtained inside the VM. This is supported by Linux and we should implement monitoring for it. Google Cloud Monitoring also has support for it. We’ll come back to it later in this post.

Regarding the key management, we do not know what the cloud providers are doing exactly, but apart from the AMD docs, we can have a look at the open source KVM and libvirt to understand what’s going on.

In order to verify the initialization, the platform provider can offer his platform public key for verification. I wasn’t able to obtain any certificates and verify a platform, so this may be subject for further investigation. For now, this means I have to trust the cloud provider and cannot check the SEV setup.

KVM and libvirt

Recent versions of KVM and Libvirt support SEV and give us a few knobs to turn: Libvirt even has an excellent documentation page how to use SEV.

During launch of an instance, we could optionally provide libvirt with a key to establish a trusted channel between the VM and SEV. This is, as far as I can tell, not supported by any cloud provider. Like above, we therefore cannot check that SEV was not tampered with and have to trust the cloud provider on correct SEV firmware state.

Of course, all data entering or exiting the VM needs to go to the virtualized hardware unencrypted, as the device emulation is done in the context of the hypervisor, not the VM, and it needs to see the data that is destined for it.

This is no problem for network I/O if we use transport level encryption on the VM itself (TLS, IPsec). We cannot rely on the cloud provider to do that in the underlying infrastructure, as he would need to see the plaintext data and this would render SEV pointless. But as using TLS everywhere is already state of the art, this isn’t unreasonable to expect. Of course, this limits which services one should use (L7 load balancers!).

More interesting is the handling of disks. While cloud providers offer to encrypt disks transparently, this would happen outside of the VM (in the hypervisor) and therefore out of control of SEV. If we want the security of SEV end-to-end, we need to encrypt the disk inside the VM (e.g. Linux dm_crypt), and we must manage the corresponding keys ourselves. This will generate quite a lot operational complexity and render the cloud-provided disk encryption features useless.

All the DMA operations inside the guest must be performed on (hypervisor) shared memory, as those are unencrypted. For private pages, the SEV hardware enforces the memory encryption. In order to put DMA pages in shared mode, we need to enable IOMMU mode for virtio drivers, which is not supported by all virtio devices. For example, we loose the virtio GPU.

We also loose PXEboot functionality on virtio network interfaces, as the PXEboot firmware is not SEV-aware.

This provides the path for further investigations: Let’s look at involved firmwares.

Firmware: BIOS and UEFI

While I found multiple sources stating that SEV only works with UEFI, I actually cannot find the detailed techncial reasons for that. UEFI provides better security features (like Secure Boot) anyway, so we stick to it.

Open Source UEFI is provided by the OVMF firmware. Here, open source QEMU/KVM and cloud providers differ: UEFI firmware is proprietary for the big public clouds. The UEFI firmware runs inside your VM and therefore has unlimited access to the memory. You’d have to trust your cloud provider that it works as advertised and is secure.

Of course, boot loader firmware must be loaded in memory and encrypted with the VM key before the VM can be booted. This is done by the CPU itself and orchestrated by the hypervisor. Together with the encrypted boot loader, a “guest policy” can be installed which can control things like live migration or allow debug access via the hypervisor. In GCP, you can get details of this attestation process in Stackdriver logs. This is a start, but there’s no way to validate what Google is printing in the logs there. So at the end, you still have to trust Google that those logs are correct. This is the best option I could find, other solutions will not give you any insight into that process at all.

Speaking of live migration policy: While SEV provides a way to securely transport memory encryption keys from one CPU to another, at least for now I didn’t find any platform that supports it. Therefore, I didn’t investigate security details of this process any further. For now, confidential VMs terminate on host maintenance. Just make sure your application can handle those terminations!

Secure Boot

Combining SEV bith secure boot (SB) makes a lot of sense, because only in conjunction with SB you can be sure that only trusted software is executed and has access to decrypted memory. SB is a topic in itself, but as long as you stick to distribution-provided OS images and kernels, it should not be a huge deal. Just be aware that the OS vendor is another entity that you must trust in this case.

If you want to build and verify your own kernel, this might actually be considerable effort and complexity you need to factor in.

Summary

Confidential computing is a major step towards securing sensitive workloads in the cloud. It increases the separation between VMs on the same host and therefore increases further the overall security of multi-tenant platforms.

With technologies like Firecracker or Kata containers, CC also applies to containers. This is an area of active development, which might be of interest in the future.

Unfortunately, it is hard to secure the whole system from access of cloud provider staff. Doing that without loopholes is currently not possible, requiring you to trust the cloud provider.

Going back to the initial question regarding GDPR and TKG laws, at the moment you cannot validate that nobody has access to data in CC instances. The cloud provider would still be able to access inside your VMs, be it e.g. via UEFI functions or via the hypervisor debug function (although you might be able to detect the latter). You also still need to trust operating system vendors.

Further references

— Jun 16, 2021