Understanding Spectre and Meltdown Vulnerability – Part 3

In this blog post, we are going to talk about spectre vulnerability and how does this vulnerability affect current systems all around the globe. As already discussed in previous blog posts, the processor uses speculative execution along with branch predictor to speculatively execute instructions for better utilization of CPU cycles.

Spectre attacks involve speculatively executing some of the instructions which would otherwise never get executed. These speculatively executed instructions bring about some changes in microarchitectural state i.e. CPU cache which might lead to side channel attacks. These side channel attacks include Flush-Reload Attack (the same attack which is used in Meltdown vulnerability). 

Understanding Spectre

The basic difference between spectre and meltdown attacks is that in spectre we trick the process into revealing its own data or secret, unlike meltdown where we trick the kernel into revealing the data present in kernel memory.

Think about, you are running a web browser. On that web browser, you are running multiple apps and all those apps share some common address space. Now how does underlying VM makes sure that these apps cannot access the contents of this common address space which might contain some secret data or passwords? This is made sure by having relevant checks before accessing any memory location.

Say I have an app A on a browser, which creates and uses some arrays in javascript
( Just FYI This javascript code segment gets executed on our browser 😀 😀 )

var fruits = ["Banana", "Orange", "Apple", "Mango"];
var fLen = fruits.length;
var text = "<ul>";
for (i = 0; i < fLen; i++) {
    text += "<li>" + fruits[i] + "</li>";
}

Question: What stops this piece of code from accessing fruits[1000]
Answer:    If this piece of code tries to access anything beyond the array length, the VM returns undefined ( javascript specifics ) because internally the VM for every array access makes sure that access made is within the array bounds i.e. less than the array size. So internally every array access made encapsulates this piece of code getting executed:

var value = if ( x < fruits.size()) {
 fruits[x];
}

So in this way, the underlying VM prevents the apps running on a single VM from accessing the secret data or passwords stored in the common address space.

Spectre attacks provide just a way to break this isolation provided by VMs, browsers in our case. With this attack, we can get hold of any secrets or passwords stored in this common memory space, given we know the address at which these secrets/passwords are stored beforehand.

Deep Dive into Spectre Attack

Spectre attacks are first accompanied by training the branch predictor to take a particular branch in a particular code segment. This is done by invoking the target code with enough values which result in that particular branch being taken. After training the branch predictor, the processor starts speculatively executing the instructions in the predicted branch. At this moment, the attacker passes a malicious value to the target code. The processor still tries to speculatively execute instructions even for that malicious value.

As soon as the processor realizes that it has taken the incorrect branch, it quickly discards the state ( registers etc ) of the speculatively executed instructions. But during this speculative execution window, some microarchitectural changes have already been done which could be used by the attacker to gain information about some secret information hidden inside process memory.

Lets’ understand this with an example:

if ( x < array1.size()) {
  int value = array2[array1[x] * 4096] // branch 1
}

Note: In this simple example, we are simply checking whether X is within the array limits. If it is then we will fetch the value from that particular location which is at X offset in array array1. Also, array2 is a very large enough array to accommodate many values.

So initially the attacker executes this code segment with valid values of which are inside the array limits and during this, the processor with the help of branch predictor starts speculatively executing the branch 1 i.e. array2[array1[x] * 4096]. During this phase, when the processor speculatively executes the branch 1, the attacker does following things:

  • Attacker passes a value of X which is way outside the array limits of array1 i.e. X > array1.size()
  • Before starting the spectre attack, the attacker makes sure that CPU cache is flushed. This will make sure that the processor is idle during the time this value i.e. array1.size() is fetched from memory and hence the processor will start speculatively executing the branch 1 instructions.
  • During this speculative execution window, the processor gathers the value at location (array1 + x) address. As this X offset is way beyond the array limits, this (array1 + x) address may easily contain some hidden secrets or passwords within the process’s memory.
  • Now this memory location is brought onto registers and rest of the instructions in the branch are executed i.e. array2 [ Secret * 4096 ]. After the computation of these instructions, we will be sure that this value array2 [ Secret * 4096 ] would now be present in CPU cache.
  • Rest of the steps are more or less similar to already explained steps in the previous meltdown attack in which we iterate over all the possible values of Secret and check how much time it takes to load the value from memory. See [LINK] for more details about the steps.

In this way, we can gain access to some secret information stored inside the process’s memory ( passwords or documents ) byte by byte.

Note: In javascript, we don’t have CLFFLUSH to ensure CPU cache is flushed, but this can be made sure indirectly by using the Evict+Reload technique.

See this link, for more details

In Evict+Reload, eviction is achieved by forcing contention on the cache set that stores the line, e.g., by accessing other memory locations which get bought into the cache and (due to the limited size of the cache) cause the processor to discard the evict the line that is subsequently probed.

Dealing with Spectre Vulnerability

There is no straightforward way to mitigate this vulnerability. One of the few ways ( discussed in this paper ) by which we can mitigate this is by disabling the speculative executions of instructions in critical sections of code. These code segments can be decided by the VMs on the basis of how much this code segment on two things:

  • Could these segments be potentially used to speculatively execute instructions and then get hold of the secrets or keys in protected areas of process memory.
  • How much of performance hit the application would take if we disable the speculative execution in those code segments.

Instructions like LFENCE makes sure that we block the execution of further instructions till all the instructions till that point have been computed.

Performs a serializing operation on all load-from-memory instructions that were issued prior the LFENCE instruction. Specifically, LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes.

 

References:

Understanding Spectre and Meltdown Vulnerability – Part 2

One of the core security features of the modern operating system is to provide isolation between the different processes running on a system and make sure that one process should not be able to see the data used by another process. In our current ecosystem, we often have multiple VMs running as different processes on a single machine, assuming one might be a victim and other might be an attacker, so we need this process isolation more than ever.

We already know from our previous blog post that this isolation in modern processors is provided by privilege levels. A process running with a lower privilege level i.e. user process does not have access to the memory region having higher privileges i.e. kernel memory

With Meltdown vulnerability, this security feature breaks down in crumbles. With Meltdown, a process can now access kernel memory which might contain certain pieces of critical data about other processes. Meltdown uses a combination of flush reload attack and speculative execution to melt these boundaries between kernel and user process.

Understanding Meltdown

Meltdown works on the core concepts of speculative execution and flush-reload attack.  Speculative execution gives us the liberty to allow an instruction to execute even though the previous instructions execution has not completed. We exploit this behavior of modern processors to execute some instructions which would not have been executed otherwise. There are two parts of this

  • Transmitting the secret from kernel memory to CPU Cache via Speculative Execution
  • Receiving the secret from CPU cache to User Memory via Flush-Reload Attack

Let’s take this code example which will bring the secret information stored from kernel memory to CPU cache.

uint8_t p = *(uint8_t*)(kernel_address);
uint8_t val = probe_array[p * 4096];

In the above example, we are essentially trying to access this memory location at kernel_address ( into a variable p ) from a userspace program. After this first statement, we are trying to access some contents of a large probe_array at an offset of ( p * 4096 ).

So when we will execute this code, ideally we should receive a segmentation fault while executing the first command itself due to invalid access. But due to speculative execution, whilst we are executing the first command itself, the processor starts with the execution of the second instruction.

946px-skylake_block_diagram-svg
Figure 1: Showing Basic Overview of Modern Processor (Source: wikichip)

Every instruction given to the processor for execution is broken down into a sequence of µOP. These µOPs are a basic unit of execution on modern Intel processors. There can be multiple µOPs running simultaneously on a single processor. If any of the µOPs in this speculative execution window gets errored out ( i.e. exception or something else ), then all the subsequent instruction or µOPs are retired and their state is cleared from the registers.

So here is the sequence of events which happens

  1. Before starting the instructions, we will make sure that we have flushed out all the contents of the probe_array and there is no entry of probe_array in CPU cache.
  2. The processor starts with the instruction for reading the memory contents from kernel_address.
  3. When the kernel address is loaded in statement 1, it is likely that the CPU already issued the subsequent instructions ( i.e. p*4096 and probe_array[p * 4096]) as part of the speculative execution, and that their corresponding µOPs wait in the reservation station for the content of the kernel address to arrive. As soon as the fetched data ( i.e. value stored at kernel_address ) is observed on the common data bus, the µOPs can begin their execution.
  4. Now when these µOPs finish their execution, they retire in order and checked for possible exceptions in order. So as in this case our first statement i.e. loading kernel address throws an exception, so the pipeline is flushed to eliminate all the results for the instructions which were executed speculatively. In this way, we throw away the computations done speculatively which otherwise would not have been executed at all.
  5. Now let’s take this instruction accessing probe_array[p * 4096]. We know that during the execution of the above µOPs in the speculation window, we would have loaded this value in some physical register and the processor would have thrown away the results for this knowing that memory lookup at kernel_address was illegal. But when we would have loaded probe_array[p * 4096] from memory to register, there we ( processor ) would have been some changes in the cache state and this value would have been loaded in the cache.
  6. So at the end of this speculation window (after making illegal access to kernel_address), the only change we made to the CPU Cache state is that value at probe_array[p * 4096] will now be cached ( because step 1 flushes out all the values for probe_array )
  7. Now we will iterate over all the possible values of p i.e. from 1 to 256 and for each value of p, we will check if probe_array[p * 4096] is cached or not. This can be done easily via some of the learnings from the previous blog post on flush reload attack.
  8. If it is cached, then we have identified the value of p, otherwise, we will keep on iterating until we found such p.

Note: This value of 4096 or 4KB has been chosen to ensure that there is a large spatial distance between any two values of p so that hardware prefetcher does not cache some other addresses of probe_array into L1 or L2 cache corresponding to some other values of p which might result into fuzzy results.

Dealing with this Meltdown Vulnerability

This vulnerability can be used to attack any system running on cloud platforms which share resources with other systems. Even our standalone desktops and laptops are also at risk, because of browsers and javascript. Javascript allows websites to run custom code to run on our system.

So we need to mitigate Meltdown ASAP. Here are some of the mitigation techniques employed by the operating system vendors.

Kernel Page Table Isolation ( employed by Linux Kernel )

We already know from the previous blog post that kernel memory and user memory are mapped into a single page table. Although the access to the contents of kernel memory is prohibited in user mode because of lower privilege level, still they are mapped into a single page table. As both kernel memory and user memory is mapped into a single page table, so user process is able to speculatively get a hold of the values stored in the kernel memory ( which we just read about ).

This mapping of both kernel memory and user memory in a single page table is essential because when we make a system call, then we need not load new page table for kernel page table entries into TLB as well as CR3 register.

Kernel page Table Isolation proposes to segregate the kernel page entries from the user process page table. So when a process is running in user mode ( lower privilege mode ), then it has no access whatsoever to any of the kernel page entries ( barring few entries which are essential ). But when a process makes a system call, then in this kernel mode, the kernel will have access to all the page table entries of the kernel + process.

Blank Diagram - Page 1 (8)

While executing in kernel mode ( privileged mode ), it is necessary to have the page table entries for user process as well because nature of the system call might be such that we kernel might need to copy some data from kernel memory to user memory which would be possible only when we have user + kernel page table entries visible while executing in privileged mode.

Read this for more details about kernel page table isolation.

References:

Understanding Spectre and MeltDown Vulnerabilities – Part 1

In this blog series, we will go through one of the biggest security vulnerabilities of recent times i.e. Spectre and Meltdown. This article is mostly centered around understanding the concepts which will be necessary for then understanding the internals of these two vulnerabilities.

How is a program executed?

A program is simply a series of instructions which are present in memory. These instructions are executed by our processor one by one. Every instruction which is executed by the CPU or the processor is executed within a privilege level.

From Wikipedia

privilege level in the x86 instruction set controls the access of the program currently running on the processor to resources such as memory regions, I/O ports, and special instructions.

So it essentially means that any instruction executed on a processor running within a particular privilege level might have access to some restrictive subset of system resources ( e.g. memory region, IO ports ).

Intel x86 architecture offers a total of 4 privilege levels which might or might not be used by operating system vendors. Linux for that matter uses only two privilege levels

  • Level 3 ( ring 0 )
    • Kernel operates in this mode. This privilege mode makes sure we have access to all the hardware ( ports ), instruction sets and memory. This is necessary because the kernel needs to access all the hardware devices and different processes memory regions. So it makes complete sense to let kernel in full privilege mode and let it access every hardware device and every memory region.
  • Level 0 ( ring 3 )
    • All user processes run in this mode. Using this mode a user has the limitation of using a segment of the memory. For any, hardware related tasks ( be it disk IO or network IO), it has to involve kernel in this by making appropriate system calls. System calls are a way to change the privilege mode from user mode to kernel mode.

Blank Diagram - Page 1 (5)

Note: Just to add to this, it’s not only the memory regions / IO ports, these levels also prohibit privileged instructions from getting executed in user mode like HLT, RDMSR etc. Read this more details.

Memory Isolation

Memory is divided among different processes as well as kernel via a concept of page tables. Every process has a page table and this page table stores entries to the physical pages in RAM.

Processes use virtual address instead of physical address to store/load content. This implicit conversion from virtual address to physical address is done with the help of page tables which stores the address of the physical pages. These page tables are also stored in memory, so essentially any virtual address to physical address conversion involves a lot of memory seeks ( around 100 cycles for each memory access ) which might slow down our processing, so for that, we have TLB ( Translation Lookaside Buffer ) which is essentially a fast cache for this virtual address to physical address mapping.

This page table is also divided into two segments. One is for user process page table and another is for storing kernel page table entries. Kernel Page table is essential for the memory addressing which would happen while executing in kernel mode ( privileged mode ) for accessing kernel data structures.

blank-diagram-page-1-6.png

As we already know that every user process stores some or other other information in memory and address this memory location via virtual address. We already have TLB for storing this virtual address to physical address mapping, but finally, we need to hit the memory for getting the contents stored at that memory location ( physical address ). If our application involves a lot of memory seeks ( which is generally the case ) this might slow down our processing. For saving these memory seeks we have these CPU caches in place. These CPU caches, cache the contents for those physical addresses and save us those costly memory seeks. blank-diagram-page-1-7.png

For a better understanding of these CPU caches, read this.

Now until this point, I hope,  you have a decent understanding of the CPU architecture in general. But Before going into the spectre and meltdown vulnerability, let’s understand building block of these attacks

Flush Reload Attack

In this attack, the attacker exploits the cache behavior to identify the access for the victim process on memory. L3 cache is shared among different processes running on different cores, so essentially with the help of this attack, we can monitor the instructions executed by the victim process.

Question: But how can we exploit the cache behavior?

Answer:  We already know that if a memory location is cached in the L3 cache, then there is a tremendous amount of CPU cycles saved which essentially means that time taken for uncached read i.e. from RAM is much higher when compared to cached read i.e. from CPU cache ( L3 in this case ).

So basically if an attacker wants to figure out if a memory location has been accessed by another process running on a different core, the attacker just needs to find out the time it takes to access that particular memory location. If that time is on the higher side ( for which we need to train our simple classifier ), then it has not been read by the process but if the time is on the lower side, then we know for sure that process A has recently made access to that particular memory location.  Some prerequisites are that we need to be sure before starting this attack that the memory location ( or line ) has been flushed out.

So with the help of this attack, the attacker can figure out what victim is essentially doing and executing which segment of code.

One of the other interesting observations, made in this paper, is that we can also figure out the data on which victim operates. This is a bit non-trivial in itself, so let’s understand this with the help of an example.

Victim Process A

for (int i = 0; i < PUBLICLY_NOT_KNOWN; i++) {
    performFunction(PUBLICLY_KNOWN);
}
  • We have a number PUBLICLY_KNOWN
  • We will perform certain operation i.e. performFunction on this number
  • This function f will be called PUBLICLY_NOT_KNOWN times
  • Now our motive is to find this number PUBLICLY_NOT_KNOWN

Attacker Process B

  • We already know the PUBLICLY_KNOWN
  • We already know the memory address of the performFunction function
  • We also know that this function takes around t ms
  • Let’s start by flushing the cache line for memory location of performFunction
  • Also let’s initialise PUBLICLY_NOT_KNOWN = 0
  • Now after every t ms, we will check whether this function has been accessed. This can be done with the above-mentioned flush reload technique to figure out whether this memory location has been accessed.
  • If yes then increment the current known value of PUBLICLY_NOT_KNOWN by 1. If no, then it means that the loop has terminated.
  • At the end of this, we know the value of the PUBLICLY_NOT_KNOWN

So this flush-reload attack can be used by the attacker to identify the other secrets inside the victim process memory. This above-mentioned attack/methodology can be used to find the RSA decryption key in the same way in which we have explained. See this for more details.

Speculative Execution

From Wikipedia

Speculative execution is an optimization technique where a computer system performs some task that may not be needed. Work is done before it is known whether it is actually needed, so as to prevent a delay that would have to be incurred by doing the work after it is known that it is needed. If it turns out the work was not needed after all, most changes made by the work are reverted and the results are ignored.

Previous processors used to perform inline processing of instructions i.e. processing instructions one by one. But with speculative execution, a processor can make certain speculations regarding the control flow of the program and pipeline appropriate instructions. Speculative execution has increased the performance of modern processors tremendously.

To understand speculative execution, let’s take this simple example:

if (x < p.size) { // first instruction
  int b = p[x]    // second instruction
} else {
  int b = 1       // third instruction
}

In this code, we can see that we are checking for bounds for x and if x is within the bounds, then we are accessing the data at x offset in the array p.

Inline processing would have meant that each and every time x would be checked against the bounds and after checking those bounds, we will fetch the memory location at x offset. In other words, we would execute instructions serially, one after the other.

But with speculative execution, we need not wait for the first instruction i.e. bounds check to complete before starting with any further instructions. Speculative execution together with branch predictor says

“As most of the times during this particular code execution, branch 1 is taken, so this time also lemme take branch 1”

So speculative execution along with the help of branch predictor takes the branch 1 i.e. second instruction and then reads the memory location at x offset in the array p.

Note: We also might end up in situations where we would have made a wrong speculation and executed the wrong branch. In those cases, the instructions executed via speculative execution are retired from the processors and all the state ( registers ) associated with those speculatively executed instructions is cleared.

References: