Intel LazyFP vulnerability: Exploiting lazy FPU state switching

Posted on June 6, 2018 by Thomas Prescher, Julian Stecklina, Jacek Galowicz

After Meltdown (see also our article about Meltdown) and Spectre, which were publicly disclosed in January, the Spectre V3a and V4 vulnerabilities followed in May (see also our article about Spectre V4). According to the German IT news publisher Heise, the latter might be part of 8 new vulnerabilities in total that are going to be disclosed in the course of the year.

Earlier this year, Julian Stecklina (Amazon) and Thomas Prescher (Cyberus Technology) jointly discovered and responsibly disclosed another vulnerability that might be part of these, and we call it LazyFP. LazyFP (CVE-2018-3665) is an attack targeting operating systems that use lazy FPU switching. This article describes what this attack means, outlines how it can be mitigated and how it actually works.

For further details, see the current draft of the lazyFP paper: Link to Paper

Summary and Implications

The public disclosure of this vulnerability was initially postponed by a typical responsible disclosure information embargo until August, but first rumors led to this date being dropped.

The register state of the floating point unit (FPU), which consists of the AVX, MMX and SSE register sets, can be leaked across protection domain boundaries. This includes leaking across process- and virtual machine boundaries.

The FPU state may contain sensitive information such as cryptographic keys. As an example, the Intel AES instruction set (AES-NI) uses FPU registers to store round keys. It is only possible to exploit when the underlying operating system or hypervisor uses lazy FPU switching.

Another interesting use case of this vulnerability is building high-performance covert channels between VMs. When using all register sets and controlling the scheduling frequency of the core, it is possible to maximize the throughput up to the range of MB/s.

Users are affected when they run a combination of affected processor and affected operating systems.

Mitigation requires system software to use eager FPU switching instead of lazy FPU switching.

External References

Press Contact

If you have any questions about the variants of Meltdown and Spectre and their impact or the involvement of Cyberus Technology GmbH, please contact:

Werner Haas, CTO
Tel +49 152 3429 2889
Mail werner.haas@cyberus-technology.de

Technical Background

This vulnerability is similar to Meltdown (CVE-2017-5754). While Meltdown allowed to read protected memory contents from a user space program, this new attack allows to read certain register contents across protection domain boundaries.

In order to exploit the vulnerability, the attacker accesses FPU registers that he is not allowed to access yet. This forbidden value is fed into an address calculation which is used for a subsequent memory access. The CPU will perform this series of instructions speculatively, that is the corresponding register content is squashed when it detects its mistake.

Then, he attempts to read/write from/to a memory address that contains the forbidden register value as part of its address offset. The CPU will speculatively perform this series of loads and stores, but then detect and fix its mistake. This speculative execution however leaves traces in the cache which can leak hidden information to the attacker.

To have a better understanding of how this attack actually works, it is necessary to dive deeper into the inner workings of the FPU, and how it is used by operating systems.

The Floating Point Unit (FPU)

In the early days of x86, the FPU (also called math coprocessor) was an external co-processor that could be added to Intel’s now widely adopted x86 processor architecture. The Intel 8087 was the first floating point math co-processor of this kind. The purpose of this extension was to accelerate mathematical operations on floating point numbers, such as division and multiplication. With the Intel 486DX CPU model (released in 1989), the FPU got integrated into the microchip itself. This way, no additional co-processor was needed anymore.

Over the years, the processor was extended to support Single Instruction, Multiple Data (SIMD) instruction sets, i.e. MMX/SSE/AVX. SIMD instructions perform the same mathematical operation on multiple pairs of operands at the same time in order to improve performance. Each of these instruction set extensions introduced new register sets that continue to be managed as part of the FPU register state. On recent Intel processors, the FPU register state can contain more than 2 kB of data (AVX-512 offers 32 registers of 512 bit, each, which translates into 2 kB additional processor state). Due to the usefulness of these instructions this register set may contain not only floating point values but also other data as e.g. integer values.

Loading and Storing the FPU state

To enable multi-tasking, operating systems periodically interrupt processes to give other processes a chance to run. Otherwise a single process that loops forever could grind the system to a halt.

When the operating system switches from one process to another, it needs to save the state of the previous process and restore the state of the process that is about to be run. This state consists of the values stored in general purpose registers and FPU registers.

The x86 instruction architecture provides several instructions to load/store the register state of the FPU from/to memory. We already know about the large size of the FPU state, hence it is pretty obvious that one does not want to read and write such amounts of data on every context switch unless it is actually necessary, because not all processes use the FPU.

Eager and Lazy FPU Switching

Eager FPU switching is comparable to saving the general purpose register state on a context switch. For each process, the operating system reserves an area in memory to save the FPU state. When switching from one process to another, it executes an FPU store instruction to transfer the current FPU content to the state save area. It then loads the new FPU state from the state save area of the process that is about to be scheduled.

Lazy FPU switching optimizes this procedure for the case where not every process uses the FPU all the time.

After a context switch, the FPU is disabled until it is first used. Only then, the old FPU state will be saved, and the correct one restored from memory. Up to this point, the FPU keeps the register state of the process or VM that used it last.

To implement this optimization, the operating system kernel temporarily disables the FPU by setting a certain control register bit (CR0.TS). If this bit is set, any attempt to access the FPU will cause an exception (#NM, Device Not Available, No Math Coprocessor). When the exception occurs, the kernel can save the old state to the respective state area and restore the state of the current process. Processes that simply do not use the FPU will never trigger this exception and hence never trigger FPU state stores and loads.

Starting with the introducton of the x86-64 architecture, the presence of at least some SIMD instruction set extensions is mandated and their use has become more widespread. As such the underlying assumption of lazy FPU switching is not valid anymore. The performance gain by lazy FPU switching has become negligible, and some kernels already removed it in favor of eager switching.

The Attack

The attack works by reading the FPU register contents while the FPU is disabled and disclosing the contents via a cache-based side-channel. At this point, the FPU contains register content of the last process that used it.

To make this attack practical, it is necessary to hide the exception that would normally happen when accessing the FPU. We found three different ways to do so:

Method 1: Executing the attack in the shadow of a pagefault

All of the three versions work similarly. A simplified pagefault version could look like this:

The trick here is that the #NM exception never occurs, because the instruction that touches the xmm0 FPU register is only speculatively executed. Yet, the memory location that is touched depends on the value of xmm0 and this value can be recovered using the same flush-and-reload attack that is also used in Meltdown and Spectre. mem is just an ordinary memory buffer that is accessible to the application. At this point, we are back to the initial state and can retry the attack to leak the entire register state bit by bit (always masking a different bit in the and step of the code snippet).

Please note that the sample code in this article just reads a single bit for the sake of simplicity. Our proof of concept reads multiple bytes at the same time in order to further reduce the size of the required time window.

Method 2: Executing the attack in a transactional memory (TSX) transaction

The TSX version works in the same way.

With TSX it is possible to perform multiple reads and writes from/to memory within a transaction, which fails if the affected memory cells were modified from elsewhere during the transaction, or if any kind of protection violation occured. The hardware rolls back the changes of the transaction that failed and signalizes this in a way much cheaper than a page fault.

The key difference is that the #NM exception is generated inside a hardware transaction (RTM) which is always aborted by the transient #NM. This version leaks the desired register content much faster as it removes the expensive pagefault handling from the execution path.

Method 3: Executing the attack in the shadow of a branch / Retpoline

Hiding the #NM exception in the shadow of a mispredicted branch as suggested in this article about the microarchitectural details is even faster because no recovery path is needed at all.

“Hiding in the shadow of a mispredicted branch” means that the attack code would tamper with the return address on the stack, which sabotages the CPU’s branch prediction efforts for the next return instruction. Code that is then executed based on such induced wrong speculation will successfully trigger the sidechannel without causing #NM exceptions.

Practicability

The practicability of the attack varies across the different methods to employ it: The attacker needs a time window where no interruption by the operating system occurs in order to read all bits of the register, because every preemption has the chance to change the register state and render the result useless. Thus, the faster the attack method is the more likely it is successful.

Different throughput per method
Method Cycles Eff. Throughput
Page fault 359.9k 0.22 MiB/s
Intel TSX 25.4k 3.12 MiB/s
Retpoline 24.0k 3.30 MiB/s

Mitigation

Using GNU/Linux with kernel versions >= 3.7, this attack can be mitigated by adding “eagerfpu=on” to the kernel’s boot parameters. We are not aware of a workaround for older versions. For all other operating systems: install the official update(s) from the vendor.

We are not aware of a performance penalty caused by using eager FPU switching. The commit message of the Linux kernel patch that made eager FPU switching the default on Linux systems even points out that the assumptions which justified lazy FPU switching in the past do not hold any longer on the majority of modern systems.


Share this article: