Handling Large Amounts of Data with Parquet – Part 2

Parquet provides various configuration to let the applications control how do they want the library to handle the writes. In this blog post, we will try to go over these configurations and understand how do those configurations have an effect on the overall throughput of the writes / reads / compression.

Parquet provides following configurations which can be tweaked by the application. These configurations can be used by the applications to fit their use case.

  • Row Page Size
  • Compression Codecs
  • Dictionary for Column Values
  • Data Page Size and Dictionary Size

Row Page Size

Row Page Size Threshold decides as to when we need to flush the in-memory data structures to the row group and then append them to the parquet file.

if (memSize > nextRowGroupSize) {
  flushRowGroupToStore();
  initStore();
}

Note: Checking for the Size of the In-Memory Data Structures of the parquet Writer is a bit costly operation. So this operation does not happen for every record which is added, but we carry out this operation after certain k operations. This number k is revisited everytime we are done with those k operations.

Compression Codecs

Parquet 1.10 supports around these 6 compression codecs

  • Snappy
  • GZIP
  • LZO
  • Brotli
  • LZ4
  • ZSTD

Along with these 6 compression codecs, we also have the operation of not applying any compression codec i.e. UNCOMPRESSED. This compression codec ensures that we store the data page bytes as well as dictionary page bytes as it is.

Dictionary for the Column Values

As discussed in the previous blog article, parquet provides the ability to encode the values of a column in a dictionary. The application can control if it needs to enable the dictionary encoding or not. There are various advantages of maintaining the dictionary

  • Due to dictionary encoding, most of the times in practice the total size of the dictionary + encoded values is less than the total size of the actual raw values. This is because most of the times, our values are duplicated across different column values, so creating a dictionary out of them removes duplicate entries from the column values.
  • Due to dictionary encoding, we can provide stats filtering over those column values. We already discussed in the previous blog, that dictionary can be used to provide quick filtering to reject row groups which do not have particular terms in them. Eg. Say a dictionary for a column value does not contain the term “Sheldon”, so we do not need to read each column value to figure out which records contains the term “Sheldon”, we can easily filter out those row groups whose column dictionary does not contain those terms.

Data Page Size and Dictionary Size

Parquet allows us to specify the data page size and dictionary page sizes. Data Page Size and Dictionary Page Size configurations help us define the maximum in memory size for each and every column data values and dictionary values respectively. Dictionary Page Size is also used for deciding whether we need to fall back to the PlainValueWriter for columns instead of the DictionaryValueWriter format.

@Override
public boolean shouldFallBack() {
  // if the dictionary reaches the max byte size or the values can not be encoded on 4 bytes anymore.
  return dictionaryByteSize > maxDictionaryByteSize
      || getDictionarySize() > MAX_DICTIONARY_ENTRIES;
}

 

Encodings In Parquet

Apart from dictionary encoding and plain encoding, parquet uses multiple other encoding schemes to store the data optimally.

  • Run Length Bit Packing Hybrid Encoding
    • This encoding uses a combination of run length + bit packing encoding to store data more efficiently. In parquet, it is used for encoding boolean values.
      Eg. 
      
      We have a list of boolean values say 
      0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1 (0 = false and 1 = true)
      
      This will get encoded to
      1000,0000,1000,0001
      
      where 1000 => 8 which is number of occurences of 0
            0000 => 0 which is the number getting repeated
            1000 => 8 which is number of occurences of 1
            0001 => 1 which is the number getting repeated

      Note: There are some other nuisances as well which have not been mentioned in this article for simplicity

  • Delta Binary Packing Encoding

Some Common Question’s

Question:  What is the value one should keep for row group size? 
Answer: This totally depends on your application resource requirements. If your application demands to have a lower memory footprint for these parquet Writers, then you have no option other than to keep the row group size to minimal value e.g. 100 KB to 1 MB. But if your application can sustain higher memory pressure, then we can afford to keep this limit even to a higher value say 10 MB. With higher values of row group sizes, we have better compression guarantees and overall better write throughput for each column.

Question: Which compression codec should one choose for my application?
Answer: The compression codec to choose depends on the data you are handling in your application. Compression and Decompression Speed also play a crucial role in deciding the compression algorithm. Some applications can take a hit in compression speed, but they want best compression ratios. There is no “One Size Fit All” compression algorithm. But if I have to choose one, I would go with ZSTD because of the high compression ratios and decompression speed offered by this algorithm. See this for more details.

References

Handling Large Amounts of Data with Parquet – Part 1

In this era of technological advancement, we are producing data like never before. Let’s take for eg. Large Hadron Collider wherein we are producing data at the rate of 1 PB per second. Given we are producing these amounts of data, we require efficient data storage formats which can provide:

  1. High compression ratios for data containing multiple fields
  2. High read throughput for analytics use cases.

Parquet is an accepted solution worldwide to provide these guarantees. Parquet provides better compression ratio as well as better read throughput for analytical queries given its columnar data storage format.  Due to its columnar format, values for particular columns are aligned and stored together which provides

  • Better compression
    • Due to the very similar nature of the data stored side by side. All the column values are stored side by side, which in turn proves really helpful to the different compression algorithms as they perform much better with similar values to compress data from.
  • Better throughput
    • Analytics use cases mostly involved querying on particular dimensions which means with parquet we have the benefit of only extracting fields which are absolutely necessary for the query to execute. e.g. Select count(*) from parquet_files where student_name like “%martin%”. In this query, we will only extract the column student_name and save those precious IO cycles which were wasted in getting the rest of the unnecessary fields.
    • Parquet also stores some metadata information for each of the row chunks which helps us avoid reading the whole block and save precious CPU cycles.

Understanding Parquet Layout

Before looking into the layout of the parquet file, let’s understand these terms.

  • Row Group
    • Column Chunk
      • Data Page
      • Dictionary Page
  • Metadata Information

Blank Diagram (13)

Row Groups

Parquet File is divided into smaller row groups. Each of these row groups contains a subset of rows. The numbers of rows in each of these row groups is governed by the block size specified by us in the ParquetWriter. ParquetWriter keeps on adding rows to a particular row group which is kept in memory. When this memory size crosses some threshold, we start flushing this in memory row groups to a file. So essentially these collections of rows which are flushed together constitute a row group.

Column Chunks

Each of these row groups contains a subset of the rows which are then stored in column chunks. Say the data we are writing to the parquet file contains 4 columns, then all the values of column1 for this subset of rows will be stored continuously followed by the values of column2 for this subset of rows and so

Blank Diagram (15)

Data Page and Dictionary Page

Each of the column chunks is divided into data page and dictionary page. Dictionary Page is stored in case dictionary encoding is enabled.

In case dictionary encoding is enabled we store the keys in the dictionary and the references to these keys in dataColumns. On flushing, these references in the dataColumns get flushed onto the data pages and the actual keys in the dictionary get flushed onto the dictionary pages. In the case of dictionary encoding not being enabled, we store the actual data in these data columns instead of the references.

Blank Diagram (17)

 

Parquet Metadata Information

Each parquet file contains some metadata information stored with it at the end of the parquet file i.e. footer. This metadata contains information regarding these things

  • Row Groups References
    • Column Chunk References inside of those Row Group References
      • Dictionary Page References inside those column chunks
      • Data Page References inside those column chunks
      • Column Chunk Statistics
      • Column Chunk Encodings
      • Column Chunk Sizes
    • Number of rows in the Row Group
    • Size of the data in the Row Group
  • Some Additional File Metadata

Writing to a Parquet File

As we already explained in the previous sections, parquet stores data in the format of row chunks. These row chunks contain a group of records which are stored in the format of column chunks.

ParquetWriter maintains in memory column values for the last k records, so while writing a record to a parquet file, we often end up writing these values in memory. After memory limit exceeds for these maintained column values, we flush these values to the parquet file. All these records which were buffered in memory constitute a row group.  This is done to save disk IOs and to improve write throughput.

Blank Diagram (19)

Reading from a Parquet File

Before reading the records from the parquet file stream, we need to be aware of the layout of the file. By layout, we mean the following things

  • Row Groups Offsets
  • Column Chunks Offsets within those row groups
  • Data Page and Dictionary Page Offsets

To know this layout, we first read the file metadata. This file metadata is always stored at a known offset in the file. After reading this file metadata, we infer the offsets for the different row groups and the different column chunks stored within those row groups. After getting these offsets, we can iterate over these row groups and column chunks to get the desired values.

Parquet File also offers the capability of filtering these row groups. This filtering for those row groups can be done via

  • Filter Predicates
    • These filter predicates are applied at job submission to see if they can be potentially used to drop entire row groups. This could help us save I/Os which could improve the application performance tremendously. These filter predicates are also applied during column assembly to drop individual records if any.

Examples described below

Filter Predicate: Filter all records which have the term "Sheldon" 
                  for column 1

Row Group Dictionaries
  Row-Group 1, Column 1 -> "1: Leonard, 2: Howard, 3: Raj"
  Row-Group 2, Column 1 -> "1: Penny, 2: Amy, 3: Bernadette"
  Row-Group 3, Column 1 -> "1: Stuart, 2: Memo, 3: Sheldon"

Now with the help of dictionary pages, we can easily filter out 
the row groups which does not have the records which contain the 
term "Sheldon" for column 1

 

Filter Predicate: Filter all records which have the rating > 50 

Row Groups
   Row-Group 1, Column Rating -> (Min -> 20, Max -> 30)
   Row-Group 2, Column Rating -> (Min -> 45, Max -> 80)
   Row-Group 3, Column Rating -> (Min -> 40, Max -> 45)

Now with the help of column page metadatas, we can easily filter out 
the row groups which does not records which have rating > 50.
  • Metadata Filters
    • With the help of metadata filters, you can filter out row groups by looking at their row group metadata fields. Eg. These filters involve filtering row groups via start and end offsets for these row groups. We can create a metadata filter which will make sure to filter the row groups which have the start and end offsets within an offset range.
    • For more details see OffsetMetadataFilter, RangeMetadataFilter

See Next Blog Article for more details on various configurations provided by parquet for encoding your data.

References

 

Exactly Once Semantics with Non-Idempotent Requests

Problem Statement

Currently, in the case of non-idempotent requests eg. incrementing a simple counter, we don’t really have a good way to deal with failures/retries. This is because it is very difficult to make sure whether the request failed before the operation completed on the server or after. So which leads to some common known problems

  • Duplicate data in case of multiple retries
    • Before 1st Increment ( INCR ) Request: Counter X = 10
    • After 1st Increment Request: Counter X = 11. But the client did not receive the ack for the request due to some network issue.
    • Client Retries again, 2nd Increment Request: Counter X = 12. This time client did receive the ack and hence did not retry again.
  • Loss of data in case of no retries.
    • Before 1st Increment ( INCR ) Request: Counter X = 10
    • The increment request was not able to complete due to a networking issue. But the client did not retry again as the client was not sure if the request was successfully executed on the server

Untitled document (7).jpg

Today’s Distributed Systems use this mechanism to deal with this situation

  • Generating a RequestID
  • Making the request to the server with this RequestID
  • The server makes sure that this RequestID has not been processed before by looking up in an in-memory cache of RequestsIDs it has processed before.
  • If not processed, it processes the request and adds the RequestID to its cache.
  • After t seconds, it cleans the entries in this cache. So that this cache does not end up consuming a lot of space/memory.
def putRequest(id: RequestId, key: K, value: V) = {
  if (isDuplicateRequest(id)) {
    return ERROR_CODE_DUPLICATE_REQUEST;
  } else {
    store.put(key, value)
  }
}

But this above-mentioned methodology is not robust. There might be cases, where this strategy might not work out.  Let’s see a case where it might fail

  • A Client sends a request to the server. The server successfully serves the request and makes the update.
  • Request Times Out and now the client is network partitioned so the client cannot make the request to the server for the next t seconds.
  • Now when the client can talk back to the server, it tries to issue the request again. But unfortunately, the requestID has been expired from the store because of which the request will succeed again and the update will be applied again 🙁 🙁

So how can we ensure exactly once writes or updates in a data store?

Solution

So for that, we need to break our non-idempotent operation into these two operations.

  • PREPARE Operation
  • COMMIT Operation

 

Untitled document (10)
PREPARE and COMMIT Operations

Now let’s dig deep into implementation details for these two operations.

PREPARE Operation

  • Client Issuing the request generates a random requestID
    • In this, we just add an entry on the server against the requestID and along with the requestID, we also store the operation OP we want to carry i.e. INC or DEC.
    • This requestID represents an undergoing operation.
    • This adding of this requestID in the DataStore is idempotent i.e. we can make add this requestID to the datastore multiple times with the same outcome and hence idempotent.
  • Even if this PREPARE operation fails, we should be able to handle this as this PREPARE is an idempotent request which means that we can execute this operation multiple times without any side effect.
PREPARE(id: requestID, key: K)

Insert Entry (X, OP)  for requestID in Table Ongoing_Requests

COMMIT Operation

  • In this operation, we COMMIT the results for this requestID.
    • For this requestID in the datastore, we apply the operation OP on Key X.
    • After applying the operation, we remove the entry for requestID from the datastore.
  • Even if this COMMIT operation is repeated multiple times, still we won’t have any side effects or unwanted state. 
    • In case of COMMIT operation being repeated, we won’t carry the same operation again because we would have already deleted the requestID in case of the first COMMIT operation. Hence the COMMIT operation is idempotent.
COMMIT(id: requestID) 

Update Value = OP(X) for Key X in Table Data
Delete Entry(X, V) for requestID in Table Ongoing_Requests
  • In Ongoing_Requests Table, we store the entries Keys which are undergoing update operations
  • In the DATA Table, we have the actual Key Value Pairs ( Key, Value )

Conclusion

So breaking any non-idempotent requests to PREPARE and COMMIT operations, change the nature of this non-idempotent request to idempotent which means we are good with any number of failures or retries. Even in case of failures and network partitions, retries of operation ( PREPARE or COMMIT ) will always make sure that the idempotent nature is maintained.

References

Building Replicated Distributed Systems with Kafka

Just a few days I was having this conversation with one of my colleagues. The conversation went as follows:

Me: Why we are going ahead with Design Y and not Design X even though the later design could improve the performance by a whole lot.
Friend: That was because implementing Design X meant that we had to support replication for fault tolerance. Also, replication seems to be one of the most challenging problems in computer science. So sadly, we went ahead with some sub-optimal design as we did not had any other choice.
Me: Hmm yeah, building replicated systems is really challenging.

In today’s world, we are building and designing systems at scale like never seen before. Every microservice that we build is either stateful or stateless. Now let’s talk a bit about these two kinds of microservices

Stateless Service

These services are somewhat easy to manage and scaling semantics are somewhat easy when compared to stateful services. We don’t need to worry about nodes going down or service becoming unserviceable because these services are stateless, so one can directly start using another node (serviceInstance).

Stateful Service

These services are somewhat challenging to manage because they have some state so we need to build these services considering the ramifications of nodes going down and service becoming unserviceable. So essentially, we need to make these service fault tolerant.

There might be two kinds of stateful services:

  • Some Services serve as a cache in front of another stateful service, to guarantee better performance. These services do have a fallback, so fault tolerance is not that big of an issue for these services strictly from state loss perspective.
  • Some other services do not have any fallback, so state which is present on the service is not present anywhere else. So we need to make sure that we provide durability, so that even if some nodes are down (up to a certain limit), there is no data/state loss.

blank-diagram-page-1-44.jpeg

In the context of this blog, we are going to talk about how can we make sure that these stateful services without fallback remain fault tolerant.

Let’s first see how can we make these services fault-tolerant:

  • Provide Replication Support in your Service
    • This means that whenever you make a write request to one of your primary servers, make sure you send this write request to all the other replicas as well.
  • One of the other solutions is to use have other open source solution or service which can help you out in your use case but that might involve a lot of changes in the source code of the open source solution
    • Repurposing an open source solution to one’s use case is somewhat difficult because you need to understand the whole implementation and its intricate details. Also managing a different fork of the open source solution is often difficult.
  • There might be some cloud-based solutions provided by different cloud service providers e.g DynamoDB, S3 or RDS by Amazon
    • These solutions sometimes do not fit our use-cases as sometimes these cloud services (provided by cloud providers) cannot be repurposed for all of our use cases.

So essentially, we might need to build our own systems which have to support replication for availability.

copy-of-blank-diagram-page-1-2.jpeg

Question: So whats the big deal now, we just need to support replication which shouldn’t be that big of a deal, isn’t it?

Answer: You are totally mistaken here. Solving replication problem i.e. keeping two systems consistent is one challenging task. There is a whole bunch of literature just focussed on how can you make sure that two systems are in the same state aka replication. See this for more details.

A Brief Introduction to Replication

From Wikipedia,

Replication involves sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility.

There are two kinds of replication models:

  • Synchronous Replication: This guarantee exactly same replicas at the expense of higher overhead for write calls to the primary server. This overhead would be because every replica would have its own latencies and all of these latencies would always come in critical path while making a WRITE REQUEST via the primary server.
  • Asynchronous Replication: This guarantee better response time to writes but at the expense of availability, because there might be some data loss if the primary dies in between the two syncs. So some data which was there on primary and was not present on its replicas would be lost.

blank-diagram-page-1-40.jpeg

Both of these replication techniques have their downsides, as already said synchronous replication guarantees high availability and no data loss at the expense of higher WRITE latencies whereas asynchronous replication provides much better WRITE latencies on the expense lower number of network calls and batch syncs.

Note: Also, both of these techniques are not so straightforward to implement and need some deep understanding of the common pitfalls of the replication process in general. See this link for more details.

Kafka to the Rescue

Kafka is an open source stream processing system. We would be repurposing Kafka to solve our replication use case. As already told, replication is a challenging problem to implement unless you have had experiences with it before-hand at production scale.

Kafka provides these following guarantees which we will leverage to implement replication for our service.

  • Exactly once Semantics while writing to the Kafka service
    • Kafka Service makes sure along with Kafka client that all the writes made to the Kafka service are idempotent i.e. Kafka makes sure that there are no duplicate messages or no messages which are not committed.
  • Ordered Delivery of Messages
    • Kafka Service also makes sure that the messages written by producers in a particular order are also read in that particular order by the consumers.
  • Atomic Updates to the partitions
    • Kafka service also makes sure that you can write messages in an atomic fashion to multiple Kafka partitions. Read this for more details.

FYI: You need to have a bit of understanding of Kafka partitions to understand further. Read this to understand further around Kafka. If you don’t have time to go through the Kafka partitions, consider them as logical entities where producers write data and consumers read data in an orderly fashion and this partition is replicated across different nodes, so you need not worry about the fault tolerance of these partitions. Every data which is ever written to a Kafka partition is written at a particular offset and also while reading, consumer specifies the offset from which it wants to consumes the data.

Using Kafka to solve replication

Firstly, we need to make sure that we have one Kafka partition for every Replica for our primary server.

Eg. If you want to have 2 Replicas for every primary server, then you need to have 2 Kafka Partitions, one for each of these Replicas.

copy-of-blank-diagram-page-1-1.jpeg

After we have this setup ready

  1. The primary server needs to make sure that for every WRITE Request made, apart from updating its internal state, it will also write these WRITE Requests to all the Kafka partitions belonging to the different replicas for this primary server itself.
  2. Every Kafka consumer running on those Replicas will make sure that it consumes all the WRITE Requests from the assigned Kafka partition and update its internal state corresponding to the WRITE Request.
  3. So eventually all these replicas will have the same state as of primary server.

Some of the replicas might be lagging behind but this might be because of some of the other systemic issues (which is not at all under our control) but eventually, all of the replicas will have the same state.

blank-diagram-page-1-461.jpeg
Figure: This diagram shows the interaction between the client and the primary server along with the replicas.

Implementation

One of the pros of this design is that it should not take many man-hours to implement. In this design, we need not worry about which replication do we want to use and what guarantees do we want to give in our system. This design should work out of the box for most of the systems.

Few things which we might need to implement in this design as well

  • Mechanism to figure out which are the partitions associated with the replicas and if not already there, create and assign partitions to those replicas.
  • After consuming the write request, making atomic updates to all of the replica partitions. Kafka already exposes APIs for making atomic updates to multiple partitions.
  • Reading at a particular offset from Kafka. This should be a really minimal task, as there are many open source libraries (or code segments) which will do the same stuff for you.

Now let’s discuss the positive points of this design

  • We can easily guarantee eventual consistency with this design, but if you want monotonic consistency, then that can be achieved by making sure that reads are always served by the last updated replica (or by most lagging replica in terms of updates) and this can be easily figured out by checking the uncommitted offsets for all of these replica’s Kafka partitions.

blank-diagram-page-1-451.jpeg

  • High Write throughput for most of the requests on Kafka cluster, see this for more details. Also in most of the benchmarks done, it seems Kafka provides single digit latencies for most of the requests.  Let’s look at down into the request latencies
    Old_Request_Latency = PWT_1 + PWT_2 + ....... + PWT_n
    New_Request_Latency = PWT_1 + KWT_2 + .... + KWT_n
    
    where 
    PWT_1 = Time taken to process request on Node 1
    PWT_2 = Time taken to process request on Node 2
    and so on
    
    KWT_2 = Time taken to write requst to kafka partition for replica 2
    KWT_3 = Time taken to write requst to kafka partition for replica 3
    and so on
    
    Old_Request_Latency encapsulated writing requests to all 
    of the available replicas.
    New_Request_Latency includes writing request to one of the 
    primary servers and making sure the write request is written 
    on all the concerned partitions

    Essentially latencies cannot be compared between these two subsystems, but having said that as there is an extra hop for introducing Kafka, there would be some additional overhead which should be really minimal considering latencies of Kafka.

  • If one of the replicas is having high latencies at some moment (because of high GC pauses or disk being slow), then that could increase latencies for the WRITE requests in general. But in our case using Kafka, we can easily get over this issue as we would just be adding the WRITE REQUEST to the REPLICA’S KAFKA PARTITION so it would be the responsibility of the REPLICA to sync itself whenever it has some CPU cycles.

Apart from these pros, there are obviously some cons in this design as well.

  • Although Kafka gives higher write throughput for the majority of the requests, there will be an additional overhead of adding another network hop 🙁 in this design. The impact should be really less of adding Kafka because Kafka is really sensitive towards WRITE latencies but still there would be some impact nonetheless.

Note: Write latencies for smaller payloads would increase by a higher percentage when compared to latencies for bigger payloads. This is because of the majority of the time for smaller payloads is spent on the network roundTrip and if we increase the number of network roundTrips, then this time is bound to increase. So if we can batch the write requests into a single write request and then write it to a Kafka partition, then the overhead of this approach should be really minimal.

Also, there is one important optimization that we can do in this current design to improve the write latency.

In the current design, we have N Kafka partitions for these n replicas. What if we have only single Kafka partition for all of these n replicas, then we will have better write throughput as we need not make write request to every Kafka partition (n Kafka partitions in total). The only requirement is that all of these n Kafka consumers (running on n different replicas) should belong to different consumer groups because Kafka does not allow any two consumers of a single consumer group to consume the same KTP.

So having these n kafka consumers each belonging to n different consumer groups, should improve the write throughput and latency as these n kafka consumers will read the WRITE Requests from a single KTP.

Kafka - COnsumer groups - Page 1 (3)
Figure: Diagram explaining the working of replication with single partition used across all replicas

References:

Understanding Spectre and Meltdown Vulnerability – Part 3

In this blog post, we are going to talk about spectre vulnerability and how does this vulnerability affect current systems all around the globe. As already discussed in previous blog posts, the processor uses speculative execution along with branch predictor to speculatively execute instructions for better utilization of CPU cycles.

Spectre attacks involve speculatively executing some of the instructions which would otherwise never get executed. These speculatively executed instructions bring about some changes in microarchitectural state i.e. CPU cache which might lead to side channel attacks. These side channel attacks include Flush-Reload Attack (the same attack which is used in Meltdown vulnerability). 

Understanding Spectre

The basic difference between spectre and meltdown attacks is that in spectre we trick the process into revealing its own data or secret, unlike meltdown where we trick the kernel into revealing the data present in kernel memory.

Think about, you are running a web browser. On that web browser, you are running multiple apps and all those apps share some common address space. Now how does underlying VM makes sure that these apps cannot access the contents of this common address space which might contain some secret data or passwords? This is made sure by having relevant checks before accessing any memory location.

Say I have an app A on a browser, which creates and uses some arrays in javascript
( Just FYI This javascript code segment gets executed on our browser 😀 😀 )

var fruits = ["Banana", "Orange", "Apple", "Mango"];
var fLen = fruits.length;
var text = "<ul>";
for (i = 0; i < fLen; i++) {
    text += "<li>" + fruits[i] + "</li>";
}

Question: What stops this piece of code from accessing fruits[1000]
Answer:    If this piece of code tries to access anything beyond the array length, the VM returns undefined ( javascript specifics ) because internally the VM for every array access makes sure that access made is within the array bounds i.e. less than the array size. So internally every array access made encapsulates this piece of code getting executed:

var value = if ( x < fruits.size()) {
 fruits[x];
}

So in this way, the underlying VM prevents the apps running on a single VM from accessing the secret data or passwords stored in the common address space.

Spectre attacks provide just a way to break this isolation provided by VMs, browsers in our case. With this attack, we can get hold of any secrets or passwords stored in this common memory space, given we know the address at which these secrets/passwords are stored beforehand.

Deep Dive into Spectre Attack

Spectre attacks are first accompanied by training the branch predictor to take a particular branch in a particular code segment. This is done by invoking the target code with enough values which result in that particular branch being taken. After training the branch predictor, the processor starts speculatively executing the instructions in the predicted branch. At this moment, the attacker passes a malicious value to the target code. The processor still tries to speculatively execute instructions even for that malicious value.

As soon as the processor realizes that it has taken the incorrect branch, it quickly discards the state ( registers etc ) of the speculatively executed instructions. But during this speculative execution window, some microarchitectural changes have already been done which could be used by the attacker to gain information about some secret information hidden inside process memory.

Lets’ understand this with an example:

if ( x < array1.size()) {
  int value = array2[array1[x] * 4096] // branch 1
}

Note: In this simple example, we are simply checking whether X is within the array limits. If it is then we will fetch the value from that particular location which is at X offset in array array1. Also, array2 is a very large enough array to accommodate many values.

So initially the attacker executes this code segment with valid values of which are inside the array limits and during this, the processor with the help of branch predictor starts speculatively executing the branch 1 i.e. array2[array1[x] * 4096]. During this phase, when the processor speculatively executes the branch 1, the attacker does following things:

  • Attacker passes a value of X which is way outside the array limits of array1 i.e. X > array1.size()
  • Before starting the spectre attack, the attacker makes sure that CPU cache is flushed. This will make sure that the processor is idle during the time this value i.e. array1.size() is fetched from memory and hence the processor will start speculatively executing the branch 1 instructions.
  • During this speculative execution window, the processor gathers the value at location (array1 + x) address. As this X offset is way beyond the array limits, this (array1 + x) address may easily contain some hidden secrets or passwords within the process’s memory.
  • Now this memory location is brought onto registers and rest of the instructions in the branch are executed i.e. array2 [ Secret * 4096 ]. After the computation of these instructions, we will be sure that this value array2 [ Secret * 4096 ] would now be present in CPU cache.
  • Rest of the steps are more or less similar to already explained steps in the previous meltdown attack in which we iterate over all the possible values of Secret and check how much time it takes to load the value from memory. See [LINK] for more details about the steps.

In this way, we can gain access to some secret information stored inside the process’s memory ( passwords or documents ) byte by byte.

Note: In javascript, we don’t have CLFFLUSH to ensure CPU cache is flushed, but this can be made sure indirectly by using the Evict+Reload technique.

See this link, for more details

In Evict+Reload, eviction is achieved by forcing contention on the cache set that stores the line, e.g., by accessing other memory locations which get bought into the cache and (due to the limited size of the cache) cause the processor to discard the evict the line that is subsequently probed.

Dealing with Spectre Vulnerability

There is no straightforward way to mitigate this vulnerability. One of the few ways ( discussed in this paper ) by which we can mitigate this is by disabling the speculative executions of instructions in critical sections of code. These code segments can be decided by the VMs on the basis of how much this code segment on two things:

  • Could these segments be potentially used to speculatively execute instructions and then get hold of the secrets or keys in protected areas of process memory.
  • How much of performance hit the application would take if we disable the speculative execution in those code segments.

Instructions like LFENCE makes sure that we block the execution of further instructions till all the instructions till that point have been computed.

Performs a serializing operation on all load-from-memory instructions that were issued prior the LFENCE instruction. Specifically, LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes.

 

References:

Understanding Spectre and Meltdown Vulnerability – Part 2

One of the core security features of the modern operating system is to provide isolation between the different processes running on a system and make sure that one process should not be able to see the data used by another process. In our current ecosystem, we often have multiple VMs running as different processes on a single machine, assuming one might be a victim and other might be an attacker, so we need this process isolation more than ever.

We already know from our previous blog post that this isolation in modern processors is provided by privilege levels. A process running with a lower privilege level i.e. user process does not have access to the memory region having higher privileges i.e. kernel memory

With Meltdown vulnerability, this security feature breaks down in crumbles. With Meltdown, a process can now access kernel memory which might contain certain pieces of critical data about other processes. Meltdown uses a combination of flush reload attack and speculative execution to melt these boundaries between kernel and user process.

Understanding Meltdown

Meltdown works on the core concepts of speculative execution and flush-reload attack.  Speculative execution gives us the liberty to allow an instruction to execute even though the previous instructions execution has not completed. We exploit this behavior of modern processors to execute some instructions which would not have been executed otherwise. There are two parts of this

  • Transmitting the secret from kernel memory to CPU Cache via Speculative Execution
  • Receiving the secret from CPU cache to User Memory via Flush-Reload Attack

Let’s take this code example which will bring the secret information stored from kernel memory to CPU cache.

uint8_t p = *(uint8_t*)(kernel_address);
uint8_t val = probe_array[p * 4096];

In the above example, we are essentially trying to access this memory location at kernel_address ( into a variable p ) from a userspace program. After this first statement, we are trying to access some contents of a large probe_array at an offset of ( p * 4096 ).

So when we will execute this code, ideally we should receive a segmentation fault while executing the first command itself due to invalid access. But due to speculative execution, whilst we are executing the first command itself, the processor starts with the execution of the second instruction.

946px-skylake_block_diagram-svg
Figure 1: Showing Basic Overview of Modern Processor (Source: wikichip)

Every instruction given to the processor for execution is broken down into a sequence of µOP. These µOPs are a basic unit of execution on modern Intel processors. There can be multiple µOPs running simultaneously on a single processor. If any of the µOPs in this speculative execution window gets errored out ( i.e. exception or something else ), then all the subsequent instruction or µOPs are retired and their state is cleared from the registers.

So here is the sequence of events which happens

  1. Before starting the instructions, we will make sure that we have flushed out all the contents of the probe_array and there is no entry of probe_array in CPU cache.
  2. The processor starts with the instruction for reading the memory contents from kernel_address.
  3. When the kernel address is loaded in statement 1, it is likely that the CPU already issued the subsequent instructions ( i.e. p*4096 and probe_array[p * 4096]) as part of the speculative execution, and that their corresponding µOPs wait in the reservation station for the content of the kernel address to arrive. As soon as the fetched data ( i.e. value stored at kernel_address ) is observed on the common data bus, the µOPs can begin their execution.
  4. Now when these µOPs finish their execution, they retire in order and checked for possible exceptions in order. So as in this case our first statement i.e. loading kernel address throws an exception, so the pipeline is flushed to eliminate all the results for the instructions which were executed speculatively. In this way, we throw away the computations done speculatively which otherwise would not have been executed at all.
  5. Now let’s take this instruction accessing probe_array[p * 4096]. We know that during the execution of the above µOPs in the speculation window, we would have loaded this value in some physical register and the processor would have thrown away the results for this knowing that memory lookup at kernel_address was illegal. But when we would have loaded probe_array[p * 4096] from memory to register, there we ( processor ) would have been some changes in the cache state and this value would have been loaded in the cache.
  6. So at the end of this speculation window (after making illegal access to kernel_address), the only change we made to the CPU Cache state is that value at probe_array[p * 4096] will now be cached ( because step 1 flushes out all the values for probe_array )
  7. Now we will iterate over all the possible values of p i.e. from 1 to 256 and for each value of p, we will check if probe_array[p * 4096] is cached or not. This can be done easily via some of the learnings from the previous blog post on flush reload attack.
  8. If it is cached, then we have identified the value of p, otherwise, we will keep on iterating until we found such p.

Note: This value of 4096 or 4KB has been chosen to ensure that there is a large spatial distance between any two values of p so that hardware prefetcher does not cache some other addresses of probe_array into L1 or L2 cache corresponding to some other values of p which might result into fuzzy results.

Dealing with this Meltdown Vulnerability

This vulnerability can be used to attack any system running on cloud platforms which share resources with other systems. Even our standalone desktops and laptops are also at risk, because of browsers and javascript. Javascript allows websites to run custom code to run on our system.

So we need to mitigate Meltdown ASAP. Here are some of the mitigation techniques employed by the operating system vendors.

Kernel Page Table Isolation ( employed by Linux Kernel )

We already know from the previous blog post that kernel memory and user memory are mapped into a single page table. Although the access to the contents of kernel memory is prohibited in user mode because of lower privilege level, still they are mapped into a single page table. As both kernel memory and user memory is mapped into a single page table, so user process is able to speculatively get a hold of the values stored in the kernel memory ( which we just read about ).

This mapping of both kernel memory and user memory in a single page table is essential because when we make a system call, then we need not load new page table for kernel page table entries into TLB as well as CR3 register.

Kernel page Table Isolation proposes to segregate the kernel page entries from the user process page table. So when a process is running in user mode ( lower privilege mode ), then it has no access whatsoever to any of the kernel page entries ( barring few entries which are essential ). But when a process makes a system call, then in this kernel mode, the kernel will have access to all the page table entries of the kernel + process.

Blank Diagram - Page 1 (8)

While executing in kernel mode ( privileged mode ), it is necessary to have the page table entries for user process as well because nature of the system call might be such that we kernel might need to copy some data from kernel memory to user memory which would be possible only when we have user + kernel page table entries visible while executing in privileged mode.

Read this for more details about kernel page table isolation.

References:

Understanding Spectre and MeltDown Vulnerabilities – Part 1

In this blog series, we will go through one of the biggest security vulnerabilities of recent times i.e. Spectre and Meltdown. This article is mostly centered around understanding the concepts which will be necessary for then understanding the internals of these two vulnerabilities.

How is a program executed?

A program is simply a series of instructions which are present in memory. These instructions are executed by our processor one by one. Every instruction which is executed by the CPU or the processor is executed within a privilege level.

From Wikipedia

privilege level in the x86 instruction set controls the access of the program currently running on the processor to resources such as memory regions, I/O ports, and special instructions.

So it essentially means that any instruction executed on a processor running within a particular privilege level might have access to some restrictive subset of system resources ( e.g. memory region, IO ports ).

Intel x86 architecture offers a total of 4 privilege levels which might or might not be used by operating system vendors. Linux for that matter uses only two privilege levels

  • Level 3 ( ring 0 )
    • Kernel operates in this mode. This privilege mode makes sure we have access to all the hardware ( ports ), instruction sets and memory. This is necessary because the kernel needs to access all the hardware devices and different processes memory regions. So it makes complete sense to let kernel in full privilege mode and let it access every hardware device and every memory region.
  • Level 0 ( ring 3 )
    • All user processes run in this mode. Using this mode a user has the limitation of using a segment of the memory. For any, hardware related tasks ( be it disk IO or network IO), it has to involve kernel in this by making appropriate system calls. System calls are a way to change the privilege mode from user mode to kernel mode.

Blank Diagram - Page 1 (5)

Note: Just to add to this, it’s not only the memory regions / IO ports, these levels also prohibit privileged instructions from getting executed in user mode like HLT, RDMSR etc. Read this more details.

Memory Isolation

Memory is divided among different processes as well as kernel via a concept of page tables. Every process has a page table and this page table stores entries to the physical pages in RAM.

Processes use virtual address instead of physical address to store/load content. This implicit conversion from virtual address to physical address is done with the help of page tables which stores the address of the physical pages. These page tables are also stored in memory, so essentially any virtual address to physical address conversion involves a lot of memory seeks ( around 100 cycles for each memory access ) which might slow down our processing, so for that, we have TLB ( Translation Lookaside Buffer ) which is essentially a fast cache for this virtual address to physical address mapping.

This page table is also divided into two segments. One is for user process page table and another is for storing kernel page table entries. Kernel Page table is essential for the memory addressing which would happen while executing in kernel mode ( privileged mode ) for accessing kernel data structures.

blank-diagram-page-1-6.png

As we already know that every user process stores some or other other information in memory and address this memory location via virtual address. We already have TLB for storing this virtual address to physical address mapping, but finally, we need to hit the memory for getting the contents stored at that memory location ( physical address ). If our application involves a lot of memory seeks ( which is generally the case ) this might slow down our processing. For saving these memory seeks we have these CPU caches in place. These CPU caches, cache the contents for those physical addresses and save us those costly memory seeks. blank-diagram-page-1-7.png

For a better understanding of these CPU caches, read this.

Now until this point, I hope,  you have a decent understanding of the CPU architecture in general. But Before going into the spectre and meltdown vulnerability, let’s understand building block of these attacks

Flush Reload Attack

In this attack, the attacker exploits the cache behavior to identify the access for the victim process on memory. L3 cache is shared among different processes running on different cores, so essentially with the help of this attack, we can monitor the instructions executed by the victim process.

Question: But how can we exploit the cache behavior?

Answer:  We already know that if a memory location is cached in the L3 cache, then there is a tremendous amount of CPU cycles saved which essentially means that time taken for uncached read i.e. from RAM is much higher when compared to cached read i.e. from CPU cache ( L3 in this case ).

So basically if an attacker wants to figure out if a memory location has been accessed by another process running on a different core, the attacker just needs to find out the time it takes to access that particular memory location. If that time is on the higher side ( for which we need to train our simple classifier ), then it has not been read by the process but if the time is on the lower side, then we know for sure that process A has recently made access to that particular memory location.  Some prerequisites are that we need to be sure before starting this attack that the memory location ( or line ) has been flushed out.

So with the help of this attack, the attacker can figure out what victim is essentially doing and executing which segment of code.

One of the other interesting observations, made in this paper, is that we can also figure out the data on which victim operates. This is a bit non-trivial in itself, so let’s understand this with the help of an example.

Victim Process A

for (int i = 0; i < PUBLICLY_NOT_KNOWN; i++) {
    performFunction(PUBLICLY_KNOWN);
}
  • We have a number PUBLICLY_KNOWN
  • We will perform certain operation i.e. performFunction on this number
  • This function f will be called PUBLICLY_NOT_KNOWN times
  • Now our motive is to find this number PUBLICLY_NOT_KNOWN

Attacker Process B

  • We already know the PUBLICLY_KNOWN
  • We already know the memory address of the performFunction function
  • We also know that this function takes around t ms
  • Let’s start by flushing the cache line for memory location of performFunction
  • Also let’s initialise PUBLICLY_NOT_KNOWN = 0
  • Now after every t ms, we will check whether this function has been accessed. This can be done with the above-mentioned flush reload technique to figure out whether this memory location has been accessed.
  • If yes then increment the current known value of PUBLICLY_NOT_KNOWN by 1. If no, then it means that the loop has terminated.
  • At the end of this, we know the value of the PUBLICLY_NOT_KNOWN

So this flush-reload attack can be used by the attacker to identify the other secrets inside the victim process memory. This above-mentioned attack/methodology can be used to find the RSA decryption key in the same way in which we have explained. See this for more details.

Speculative Execution

From Wikipedia

Speculative execution is an optimization technique where a computer system performs some task that may not be needed. Work is done before it is known whether it is actually needed, so as to prevent a delay that would have to be incurred by doing the work after it is known that it is needed. If it turns out the work was not needed after all, most changes made by the work are reverted and the results are ignored.

Previous processors used to perform inline processing of instructions i.e. processing instructions one by one. But with speculative execution, a processor can make certain speculations regarding the control flow of the program and pipeline appropriate instructions. Speculative execution has increased the performance of modern processors tremendously.

To understand speculative execution, let’s take this simple example:

if (x < p.size) { // first instruction
  int b = p[x]    // second instruction
} else {
  int b = 1       // third instruction
}

In this code, we can see that we are checking for bounds for x and if x is within the bounds, then we are accessing the data at x offset in the array p.

Inline processing would have meant that each and every time x would be checked against the bounds and after checking those bounds, we will fetch the memory location at x offset. In other words, we would execute instructions serially, one after the other.

But with speculative execution, we need not wait for the first instruction i.e. bounds check to complete before starting with any further instructions. Speculative execution together with branch predictor says

“As most of the times during this particular code execution, branch 1 is taken, so this time also lemme take branch 1”

So speculative execution along with the help of branch predictor takes the branch 1 i.e. second instruction and then reads the memory location at x offset in the array p.

Note: We also might end up in situations where we would have made a wrong speculation and executed the wrong branch. In those cases, the instructions executed via speculative execution are retired from the processors and all the state ( registers ) associated with those speculatively executed instructions is cleared.

References:

Exploring Code Generation with Janino

In this blog post, we are going to talk about potential advantages of using custom execution plan of a query rather than using the traditional iterator model in which query execution is composed of many operators.

Iterator model comes from those times where we did not pay attention to writing performant code and rather focussed on writing more readable code ( one cannot simply deny the readability aspect of iterator model ). But as of now, we are in situations where we are heavily getting bottlenecked on CPU instructions, so running optimized instruction sets is the need of the hour,

Let’s take this simple query for example. We have got a list of numbers, we need to apply these three operations on a list of numbers.

  • Add a number n to each of the numbers in the list
  • Subtract a number m from each of the numbers in the list
  • Return the final list of numbers after applying both the operations

With iterator model, we wrote this simple and easy code for the above-mentioned operations in this way:

We have an Operator abstract Class and all the other concrete operations as specified above are implemented as different operators ( like AddOperator, Subtract Operator ) and are chained to one another via composition. These chained operators act as a single operator and we can iterate through the numbers emitted by this single operator to get the final numbers having all the transformations.

abstract class Operator {
    abstract public boolean hasNext();
    abstract public int getNext();
}
class AddOperator extends Operator
class SubtractOperator extends Operator
class SourceOperator extends Operator
SourceOperator sourceOperator = new SourceOperator(arrayList);
AddOperator addOperator = new AddOperator(sourceOperator, 10);
SubtractOperator subtractOperator = new SubtractOperator(addOperator, 15);
while(subtractOperator.hasNext()) {
    subtractOperator.next();
}

In this methodology, we can clearly see that every operation follows an iterator kind of a model wherein they return the results to the subsequent operator one by one through their next() function. Though this iterator model is highly extendable for all kinds of queries, this comes with own set of problems. Just to give a glimpse of the issue with this approach, [2] let us write a really naive version of this code which may not be modular or readable for that matter.

for (int i = 0; i < arrayList.size(); i++) {
    int num = ((arrayList.get(i) + n) - m);
}

This code seems ok at first and may not be the most readable but it serves the same purpose as the other piece of code implementing “iterator model”. Let us compare the performance for both of these implementations.

  • Methodology 1 with Operators Chaining has throughput of around ~666 ops/second
  • Methodology 2 with Inlined Operations has throughput of around ~1000 ops/second

In the benchmark, it is clearly visible that this naive implementation with inlined operations is far more performant than modular “iterator model” approach. But why is it so ??

This is because of overhead associated with
1) virtual function calls and
2) unable to inline functions ( see this link )
which in turn results in a bloated set of instructions to be executed on the processor.

So what if we can generate this inlined naive executable code for each query, that should obviously enhance the performance of the queries from the current. This can be achieved by code generation which is used by many modern databases and query engines. In fact, many databases talk about how code generation caused a major performance improvement to their databases. Code generation is just another term for generating this custom executable code for a query. There are many libraries in the market which does this custom code generation and return a native compiled code given a query ( read about LLVM ).

In the next section, we are going about Janino compiler which does this code generation in Java Land and is used by prominently by Spark.

How to use Janino ??

Janino is a super fast java compiler which can be used to translate java expressions or java code blocks into Java bytecode. It easily embeds in your application. Here is an example of how and when to use JANINO compiler in your application.

Suppose we have a query which wants to:

  • Add 10 to each of the numbers
  • Filter all the numbers which are less than 40
  • Multiply all the numbers by 5 and return the list of numbers

With the iterator model, we would have constructed three operators for each of the three stages and then chained those three operators and then iterated through the chained operator to get all the final list of numbers. But with Janino compiler, we can afford to create runtime execution plan for a query and run it against the data to get the final list.

  • Suppose we were somehow able to generate this custom execution code for this query and write it down in some text file.
// FileName: Generated.txt

public ArrayList<Integer> returnResults(ArrayList<Integer> arrayList) {
    ArrayList<Integer> results = new ArrayList<Integer>();
    for(int i = 0; i < arrayList.size(); i++) {
        int num = ((Integer) arrayList.get(i)) + 10;
        if (num > 40) {
            results.add(num * 5);
        }
    }
    return results;
}
  • Now after generating this custom code, we need to compile it with Janino Compiler and generate some executable format of this above code.
public GeneratedOperator init() {
  Scanner scanner = new Scanner("Test.txt");
  ClassBodyEvaluator cbe = new ClassBodyEvaluator();
  cbe.setClassName("JaninoClass");
  cbe.setExtendedClass(AbstractJaninoClass.class);
  cbe.cook(scanner);
  Class c = cbe.getClazz();
  return (GeneratedOperator) c.newInstance();
}
  • After compiling the generated code ( i.e. Generated.txt ), we can easily use the compiled code and pass it a list of numbers to get the final list.
public static void main(String args[]) throws CompileException, InstantiationException, IllegalAccessException, IOException {
    
    // init method compiles the code and return a GeneratedOperator
    // instance which has this method generated method returnResults

    GeneratedOperator generatedOperator = new CompiledCodeExample().init();
    ArrayList<Integer> arrayList = new ArrayList<Integer>();
    arrayList.add(100);
    arrayList.add(1);
    arrayList.add(2);
    arrayList.add(13);
    arrayList.add(40);
    ArrayList<Integer> returnList = generatedOperator.returnResults(arrayList);
    for (int i = 0; i < returnList.size(); i++) {
        System.out.println(returnList.get(i));
    }
}
Output is as expected:
550
250

Note: Janino is responsible for compiling this generated string into a java method, but still this string has to be constructed via your own application logic.

Understanding behavior of Janino Compiled Classes with JIT

Now we know how to generate custom optimized byte-code for a query and execute it.  Let’s understand how do this Janino Compiled Classes behave with JIT.

In this experiment, we will take the same query as defined above and perform two kinds of execution models

  • Iterator Model ( i.e. Chaining of Operators )
  • Custom Code Generation via Janino

results
Note: Results clearly seem to point that Code Generation Model seems to have outperformed Iterator Model.

Execution via Iterator Model

This experiment has been performed with JIT enabled which essentially means that JIT must have inlined and compiled the different operators ( i.e. iterators ) into a single native function.

public static ArrayList<Integer> experimentOperators(ArrayList<Integer> arrayList) {
    ArrayList<Integer> arrayList1 = new ArrayList<Integer>();
    SourceOperator sourceOperator = new SourceOperator(arrayList);
    AddOperator addOperator = new AddOperator(sourceOperator, 10);
    FilterOperator filterOperator = new FilterOperator(addOperator, 40);
    MultiplyOperator multiplyOperator = new MultiplyOperator(filterOperator, 5);
    while (multiplyOperator.hasNext()) {
        arrayList1.add(multiplyOperator.getNext());
    }
    return arrayList1;
}

results_JIT.png

In this screenshot, we can clearly see the

  • Inlining of different operators into a single inlined function
  • Compilation of this single inlined function by C2 compiler

Execution via Custom Code Generation

In this execution strategy, we are generating only once this custom executable bytecode for the entire experiment duration which essentially means this custom executable bytecode should become eligible to be JITed ( after some iterations ) to native instructions which would improve the query performance even more.

// We are doing the initialization of this custom code generated
// method only once and using the same generated method over and 
// over again in the experiments.

GeneratedOperator generatedOperator = new CompiledCodeExample().init();

public GeneratedOperator init() throws IOException, CompileException, IllegalAccessException, InstantiationException {
 ClassBodyEvaluator cbe = new ClassBodyEvaluator();
 cbe.setExtendedClass(GeneratedOperator.class);
 cbe.setClassName("GeneratedClass");
 String[] strings = new String[2];
 strings[0] = HashMap.class.getName();
 strings[1] = ArrayList.class.getName();
 cbe.setDefaultImports(strings);
 cbe.cook(content);
 Class c = cbe.getClazz();
 return (GeneratedOperator) c.newInstance();
}

@Benchmark
public void experimentCodeGeneration(Blackhole blackhole) throws IllegalAccessException, InstantiationException, IOException, CompileException {
 blackhole.consume(generatedOperator.returnResults(arrayList));
}

 

results_jit31.png
In this screenshot, we can clearly see that method “returnResults” gets compiled by C2 compiler.

Note:
This method gets compiled only because we are generating this custom JANINO compiled code for the method only once for the entire experiment duration. But if we will generate this custom bytecode for the method for every invocation of the JMH benchmark, JIT will not compile the method. This would be because JIT will assume that for every invocation we are using a different custom code and hence it is of no use to compile this method ( JANINO compiled code ) across JMH invocations.

//  In this experiment we are compiling the custom code again 
// and again for each invocation of JMH benchmark and hence this
// code would not get JITed across benchmark invocations because 
// JIT has no way of knowing whether it is the same method 
// which was compiled before in the previous invocation as well.

@Benchmark
public void experimentCodeGeneration(Blackhole blackhole) throws IllegalAccessException, InstantiationException, IOException, CompileException {
    generatedOperator = new CompiledCodeExample().init();
    blackhole.consume(generatedOperator.returnResults(arrayList));
}

 

results_jit4
In this, we can clearly see that “ReturnResults” method is nowhere to be seen which essentially means it is not JIT compiled.

So essentially it means that if we have the same query hitting over and over again, with Janino Code generation methodology we will generate the bytecode for every query so essentially the code path will never get JIT compiled whereas with Iterator Model we will already have compiled and inlined methods. JMH Performance numbers also seem to suggest the same.

results_jit5

Note: In this, we can clearly see that Iterator Model outperforms JANINO Code Generation Model. This is because of the reasoning specified above i.e. in JANINO Code Generation methodology we are doing code generation again and again for each query and hence JIT is not able to compile the methods across queries. Some modern databases use Execution Plan Cache to overcome this problem and hence make sure that if the same query is hitting again and again, they use the same Generated Code.

References

JIT Optimizations – Method Inlining

In this blog post, we are going to understand the impact of functions calls in an application and what JIT does to reduce its impact.

A function call is a relatively expensive operation but JIT makes sure that our application does not suffer, performance wise, due to a large number of function calls. JIT does function inlining to make sure that function calls are minimized in an application. Further, we are going to learn “how can we debug one’s application and see which functions are getting inlined”.

Are functions calls expensive?

At first, a function call may seem trivial to you but it involves a lot of instructions getting executed under the hood which in turn might kill your application performance if your application involves a lot of function calls.

Let us a consider a simple function call and see the native instructions involved:

private static void subtract(int num) {
    int r = 20 - num;
    return;
}

private int getNum() {
    int a1 = 10;
    subtract(12);
    return 0;
}

Following things happen when we call subtract function from getNum() function

random4

  • Arguments get pushed onto the stack i.e. “12”
  • Return Address of the Caller i.e. getNum() gets pushed onto the stack
  • Frame Pointer of the Caller i.e. getNum() is also pushed onto the stack. Frame Pointer points to the memory address storing the return address of the caller.
  • Call Instruction transfers the control to the callee and instructions of the callee starts getting executed.
  • Method signature of the callee is executed
  • Ret Instruction transfers the control back to the caller with frame pointer restored to the original frame pointer of the caller method.
  • Original Arguments Passed to the callee i.e. “12” are popped from the stack.

So we can see that there is a whole lot of instructions getting executed even when we call a simple function like subtract which makes a function call really expensive.

Just to add to this nowadays, with the advent of modern programming styles, it is highly recommended to write smaller functions to improve the readability of the code which in turn increases the overhead of the function calls even more.

Method Inlining

This is another important performance optimization used by JIT. Function inlining greatly influences the performance of an application.

Let’s check out the performance boost application gets with function inlining with this example:

We have to apply these three operations on a number.

  • Multiply constant x to the number
  • Subtract constant x to the number
  • Add constant x from the number

There are two ways to do this:

  • Methodology1: Inline all the operations in a single method
public static ArrayList experimentFunctionInlining(ArrayList arrayList) {
    for (int i = 0; i < 10000; i++) {
        int num2 = 10 * i;
        int num1 = 10 - num2;
        int num = 10 + num1;
        arrayList.add(num);
    }
    return arrayList;
}
  • Methodology2: Write all the operations in different methods and call them one by one after each operation
private static int add(int num) {
 return 10 + num;
}

private static int subtract(int num) {
 return 10 - num;
}

private static int multiply(int num) {
 return 10 * num;
}

public static ArrayList experimentFunctionCalling(ArrayList arrayList) {
    for (int i = 0; i < 10000; i++) {
        int num2 = multiply(i);
        int num1 = subtract(num2);
        int num = add(num1);
        arrayList.add(num);
    }
    return arrayList;
}

Note: These two tests have been benchmarked with JMH with JIT disabled, so as to understand the impact of function inlining.

  • Methodology 1 of inlining all the operations in a single method performs at ~135 ops/second
  • Methodology 2 of writing all operations in separate functions and calling them one by one performs at ~98 ops/second

This shows that function inlining has a huge impact on application performance.

But to write code via Methodology 1 is not always possible for the sake of readability. JIT comes in handy for such situations. JIT figures out the hot code path in an application and tries to inline all the methods lying on that hot code path. Now let’s run this same benchmark with JIT enabled and see if there is any performance difference between the two methodologies. Our hypothesis is that JIT should inline the methods/functions in Methodology2 and hence the performance numbers more or less should be the same.

And voila, yes they are

  • Methodology 1 with JIT enabled performs at ~ 9000 ops/second
  • Methodology 2 with JIT enabled also performs at ~ 9000 ops/second

So it seems with JIT, the performance of the JIT inlined method is in the same ballpark as the original inlined method. Also apart from reducing the function call overhead, one other important reason for function inlining is that inlined function have more context which can then be used by compilers to make many other optimizations.

Debug your application

JIT has certain limitations when it comes to inlining methods on hot code path. Method inlining depends on these factors:

  • JIT can inline methods up to a particular depth
  • JIT support inlined methods up to a particular size
  • To be Inlined Method Type
    • JIT can easily align static method types
    • For inlining virtual functions, it needs to be aware of the classType of the object on which function is called so as to resolve the function definition.
  • Many others …

Few terminologies to understand beforehand. Sample JIT output logs:

( Method 1 ) @ 4 com.test.experiments.operators.OperatorPipelineEmulationExperiment::experimentVirtual (45 bytes) inline (hot)
( Method 2 )   @ 4 com.test.experiments.operators.BufferedOperator:: (21 bytes) inline (hot)
( Method 3 )      @ 1 com.test.experiments.operators.Operator:: (5 bytes) inline (hot)
( Method 4 )   @ 15 com.test.experiments.operators.AddOperator:: (25 bytes) inline (hot)
  • @ Annotation in JIT denotes the place in java method which triggered the compilation ( i.e. osr_bci ). Like in the above example, the code at the 4th index in the method 1 triggered an OSR compilation request.
  • To show the method inlining hierarchy, JIT chooses this format. In this, we can clearly see that
    • Method 1 inlines Method 2 and Method 4.
    • Method 2 inlines Method 3
  • TypeProfile is a special kind of check or profiling made by JIT which is used when we want to inline virtual functions. Inlining in cases where polymorphism is involved is difficult due to a simple fact that the caller might refer to different methods or different call sites depending on the classType of the object on which method is called. So in these cases, JIT profiles the types or call sites to which we are making calls and in cases, we are making calls to a single call site, JIT optimizes those after taking enough data samples.

Let’s understand the logs for method Inlining in JIT. We will use this sample application for testing purposes.

public static ArrayList experimentVirtual(ArrayList arrayList) {
    BufferedOperator bufferedOperator = new BufferedOperator(); // Line 1
    AddOperator addOperator = new AddOperator(bufferedOperator, 10); // Line 2
    SourceOperator sourceOperator = new SourceOperator(addOperator, true); // Line 3
    sourceOperator.setArrayList(arrayList); // Line 4
    sourceOperator.get(1); // Line 5
    return bufferedOperator.arrayList;
}

Note: For more code details see this link

Here are the JIT logs for the application

-XX:+UnlockDiagnosticVMOptions
-XX:+PrintInlining
-XX:+PrintCompilation

jit_inline

  • JIT inlines the call sites involved in the first 3 lines of the method ( i.e. experimentVirtual ) which is obvious in Section 1, 2 and 3.
    BufferedOperator bufferedOperator = new BufferedOperator();
    AddOperator addOperator = new AddOperator(bufferedOperator, 10);
    SourceOperator sourceOperator = new SourceOperator(addOperator, true);
  • In Section 4, we can clearly see that JIT is trying to inline all the call sites involved in line number 5 (i.e. sourceOperator.get(1))

    sourceOperator.get(1);
    • In section 4, we can see that first, it tries to inline the source code for the .get() implementation in sourceOperator.
      for (int i = 0; i < nums.size(); i++) {
          int p = nums.get(i);
          if (enableFlush && (i % flushNumber == 0)) {
              underlyingOperator.get(FLUSH_CODE);
          }
          underlyingOperator.get(p);
      }
    • Also with the help of typeProfileit figures out the call site involved in underlyingOperator.get() and inlines that as well  i.e. AddOperator.get().

References

JIT Optimizations – Method Compilations

JIT  ( Just in Time ) is certainly one of the most interesting features of JVM. This feature makes sure that we are able to run our code with machine level optimizations. JIT in itself does tons and tons of optimizations under the hood which are absolutely necessary for running latency intensive applications.

Impact of JIT

Let’s take this code as an example to study how does JIT affects the performance of our application.

This piece of code follows somewhat volcano design paradigm in which every operator does some task and these operators are bound together and exchange data through a common operator interface and collectively do a bigger task. In this case, these operators are bound together to add a particular number to all the elements in the array.

public static ArrayList<Integer> experimentVirtual(ArrayList<Integer> arrayList) {
    BufferedOperator bufferedOperator = new BufferedOperator();
    AddOperator addOperator = new AddOperator(bufferedOperator, 10);
    SourceOperator sourceOperator = new SourceOperator(addOperator, true);
    sourceOperator.setArrayList(arrayList);
    sourceOperator.get(1);
    return bufferedOperator.arrayList;
}

See this Github link for more code details.

We ran this code with and without JIT optimizations. There was a huge difference in the throughputs between these two runs.

  • With JIT Disabled we got throughput of around ~3 operations per second
  • With JIT Enabled we got throughout of around ~290 operations per second

So JIT made the code faster by around 100x. So understanding the internal workings of JIT and then asking this question “what can we do to make the life of JIT easier” is the key if you want to improve the performance of your application.

In this series, we will talk about these optimizations in details and we will also learn about debugging our application JIT logs. This particular blog post deals with one of the most important features of JIT which is code compilation.

Code Compilation

This is one of the most important functionalities of JIT. JIT is responsible for compiling Java bytecode to native code instructions at runtime to boost the application performance. JIT figures out the hot code paths via profiling and then compiles those methods into native machine instructions to improve the performance of those hot paths.

How does method compilation happen

Currently, JIT supports these 5 levels of compilation

 *  The system supports 5 execution levels:
 *  * level 0 - interpreter
 *  * level 1 - C1 with full optimization (no profiling)
 *  * level 2 - C1 with invocation and backedge counters
 *  * level 3 - C1 with full profiling (level 2 + MDO)
 *  * level 4 - C2

A Method has to go through some of these compilation phases to reach to the final optimized version of itself. The lifecycle of a method is as follows:

  • All the Methods starts executing firstly in an interpreted mode. During this execution phase, it is found out if a method is hot enough or not. This is found out mostly with the help of method invocations and backedge counters. So if a method crosses a certain threshold of method invocations and/or backedge counters, then it is eligible for compilation at different levels. See this.
  • Now after a method is declared hot, it is now compiled at level 3 by C1 compiler aka client compiler. This compiler does following things:
    • In short time it determines the obvious optimizations that can be done to improve the application performance. This short time is also because of the fact that this compiler is latency sensitive and wants to make sure that the application is in a working state as quickly as possible with obvious optimizations
    • It profiles the methods adequately to make sure that this profiling information can be used by other higher level compilers and they can do more contextual optimizations which would have been otherwise hard.
  • After a method is compiled by C1 compiler, it starts getting executed and starts gathering metrics and based on these metrics it is decided if it needs to be compiled again by C2 compiler.
    • This C2 compiler tries to focus on the best possible optimizations in the method which might affect latencies in the initial duration but would result in higher application throughput eventually.
    • This C2 compiler gathers more metrics for those methods and does more optimizations which are mostly contextual e.g. virtual function inlining. We will explain this later with the help of examples.
  • Apart from this usual flow of method compilation i.e from level 0 -> level 3 -> level 4, there are some other flows in which methods follow a whole different compile path. For reading about those have a look at this link.

Benefits of Method Compilation

Method compilation is one of the most important sauce of performance optimization in modern compilers. C/C++ is fast when compared to legacy java was this simple reason that C/C++ is a compiled language whereas java is an interpreted language. Lets, first of all, understand why is Java Interpreted even when we know that interpreted languages are inherently slow when compared to compiled language.

Java was built on compile once and run everywhere ( on any architecture ) kind of model. This essentially means that source code would be compiled once and this deployable compiled version of the source code would be run anywhere or on any platform. This basically solves the problem of writing code for each and every architecture and then make sure it runs smoothly on all those architectures. But with Java, we just had to write code once and compile into a deployable and use this deployable across all the platforms. This greatly improved the then development phase of the applications.

But with this architecture ( write once and deploy everywhere ) there came a serious problem of non-performant applications. As these deployables were runtime interpreted to the native instructions it made the application damn slow. Then to solve this problem of runtime compilation of the methods JIT came into existence.

With JIT we got the superpower to compile the methods during the runtime of the applications to their native instructions to hugely improve the performance of the application. In some benchmarks with JIT performance of JAVA seems to cross over the performance of C/C++ code. This is mainly because JIT has runtime information with the use of which JIT can do other contextual improvements in the code.

Now Let’s understand how can method compilation affect the performance of an application. We have a performance benchmark in which once we will disable the compilation of some of the methods and compare it with when we haven’t disabled anything.

  • With compilation of some of the methods of the application disabled, we achieved a throughput of around ~ 30 ops/second
  • With compilation enabled for all the methods of the application, we achieved a throughput of around ~600 ops/second

So we can see that compilation of the methods has a huge performance impact on the application. So we need to have a basic idea of the compilation of the methods in our application to know of any potential bottlenecks/improvements.

To know which methods are getting compiled and which is not, you need to add extra JVM flags while starting up your application.

java -XX:+UnlockDiagnosticVMOptions 
     -XX:+PrintCompilation 
     -jar benchmarks.jar

This would prints logs in this format

ts  denotes timestamp
cid denotes compile_id
l   denotes compile_level

ts   cid  l        methodAffected
498  46   3    com.test.experiments.CodeOptimizedBenchmark::<clinit> (46 bytes)
498  46   3    com.test.experiments.CodeOptimizedBenchmark::<clinit> (46 bytes)
531  47   3    com.test.experiments.operators.AddOperator::get (52 bytes)
531  48   3    com.test.experiments.operators.BufferedOperator::get (42 bytes)
538  49   4    com.test.experiments.operators.AddOperator::get (52 bytes)
538  50   4    com.test.experiments.operators.BufferedOperator::get (42 bytes)
540  48   3    com.test.experiments.operators.BufferedOperator::get (42 bytes) made not entrant
541  47   3    com.test.experiments.operators.AddOperator::get (52 bytes) made not entrant
613  51%  3    com.test.experiments.operators.SourceOperator::get @ 2 (79 bytes)
614  52   3    com.test.experiments.operators.SourceOperator::get (79 bytes)
665  49   4    com.test.experiments.operators.AddOperator::get (52 bytes) made not entrant
666  50   4    com.test.experiments.operators.BufferedOperator::get (42 bytes) made not entrant
666  54   3    com.test.experiments.operators.AddOperator::get (52 bytes)
666  53   3    com.test.experiments.operators.BufferedOperator::get (42 bytes)
668  55%  4    com.test.experiments.operators.SourceOperator::get @ 2 (79 bytes)
674  56   4    com.test.experiments.operators.BufferedOperator::get (42 bytes)
674  51%  3    com.test.experiments.operators.SourceOperator::get @ -2 (79 bytes) made not entrant
674  57   4    com.test.experiments.operators.AddOperator::get (52 bytes)
675  53   3    com.test.experiments.operators.BufferedOperator::get (42 bytes) made not entrant
676  54   3    com.test.experiments.operators.AddOperator::get (52 bytes) made not entrant
679  58   4    com.test.experiments.operators.SourceOperator::get (79 bytes)
685  52   3    com.test.experiments.operators.SourceOperator::get (79 bytes) made not entrant

( We are just showing a subset of the compilation logs, there are much many other java or other libraries methods for which compilation happens. For more details about these logs see this link. )

So in the logs, we can clearly see the different methods getting compiled at different times. Different aspects of these logs are as follows:

  • A compiled method goes through different phases e.g. when a method is compiled it is assigned a compile_id and when this method is deoptimized or in other words made non-entrant, then that particular task is also assigned the same compile_id.
  • This compile_id attribute sometimes might contain %. This symbol indicates that the compilation has been done via OSR ( on stack replacement ). This happens when a method call contains a big loop, then in those cases, we don’t wait for the second invocation of the method but instead in the next invocation during next iteration, we replace the code for the method by its compiled version.
  • As already told, methods get compiled and deoptimized often and this deoptimization might happen for a variety of the reasons. This deoptimization is often denoted via made not entrant aside of the method name in the compilation logs. See this link for more details on the various reason for deoptimization.

So now we do understand the performance impact JIT brings to the table and how can compilations of the functions or method to native machine instructions benefit the application performance.

In the next section, we will talk about JIT method inlining optimization and its impact on the application performance.

References