Java Unsafe APIs

Java Unsafe APIs provides various low-level APIs which could be used by programmers to do things related to:

  1. Off-Heap Memory Management
    • Off-Heap Memory Allocations
    • Reading and Putting of Integers, Long, Char, Short, Byte, Float, Double in Off-Heap Memory
  2. Manipulating Class Elements
    • Unsafe APIs provides various methods for retrieving or manipulating the different Class Elements – methods, variables.
  3. Manipulating Object/Instance Members
    • Unsafe APIs also provides various methods for retrieving and changing the state of the different variables/members in an object.

Why is it called unsafe?

  1. Using Traditional Java APIs provides us with strong guarantees that using them will not return SIGSEV and hence will not result in the killing of the JAVA process. This is mainly because of the fact that for most of the APIs we already have java internal checks which make sure that we are not accessing any memory location which we are not supposed to access and if we did anything which we are not supposed to do, then JAVA blankets our wrongdoing and then throws an exception to notify us. So essentially with these traditional JAVA API’s we never reach to a state where our processing leads to a core dump or SIGSEV.But with these unsafe APIs, there is no more such guarantee because these are mostly the low-level APIs some of them which are itself used by JAVA with some higher level checks. With these APIs, we can potentially access some unaccessible memory location due to which OS might complain and eventually might kill our application. So hence they are unsafe.To illustrate this with an example lets try to access a memory location unsafe.getAddress.
    public static void main(String args[]) {
        long address = unsafe.allocateMemory(1000);
        unsafe.getAddress(address + 100000000L);
    }

    In this example, we get this error and our JVM process is killed

    #
    # A fatal error has been detected by the Java Runtime Environment:
    #
    # SIGSEGV (0xb) at pc=0x000000010c781048, pid=87901, tid=0x0000000000001c03
    #
    # JRE version: Java(TM) SE Runtime Environment (8.0_131-b11) (build 1.8.0_131-b11)
    # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode bsd-amd64 compressed oops)
    # Problematic frame:
    # V [libjvm.dylib+0x581048] Unsafe_GetNativeAddress+0x31
  2. Another very interesting usage of unsafe is to the access and/or even modify the private members of an instance which are otherwise not accessible via normal routes. This unsafe functionality feels like a backdoor entry to some code and so should be used only and only when required and hence the name unsafe 😀 😀
    public class A {
     private int p = 10;
    
     public A() {
     }
    }
    
    public static void main(String args[]) {
     Field f = A.class.getDeclaredField("p");
     A a = new A();
     f.setAccessible(true);
     f.set(a, 1);
     System.out.println("Value of a is " + f.get(a));
    }
    // Value of a is 1

    Note: Field.Set already uses unsafe methods for accessing/modifying the
    state of the members in a class.

Performance of Unsafe

One of the more important use cases of unsafe APIs is for accessing or/and modifying the off-heap memory. This helps in cases when we have a heavy cache and we want to keep that in memory to reduce the latencies but cannot keep in heap because of the GC issues/latencies. So our best bet is storing this off-heap and relying on unsafe APIs to access the cache.

Lets firstly see the performance numbers for in-heap and off-heap memory access.

public class MyBenchmark {

    public static Unsafe unsafe;
    public static long initialAddress;

    static {
        try {
            unsafe = getUnsafe();
            initialAddress = unsafe.allocateMemory(100000);
        } catch (NoSuchFieldException e) {
            e.printStackTrace();
        } catch (IllegalAccessException e) {
            e.printStackTrace();
        }
    }

    @SuppressWarnings("restriction")
    private static Unsafe getUnsafe() throws NoSuchFieldException, IllegalAccessException {
        try {

            Field singleoneInstanceField = Unsafe.class.getDeclaredField("theUnsafe");
            singleoneInstanceField.setAccessible(true);
            return (Unsafe) singleoneInstanceField.get(null);

        } catch (IllegalArgumentException e) {
            throw e;
        } catch (SecurityException e) {
            throw e;
        } catch (NoSuchFieldException e) {
            throw e;
        } catch (IllegalAccessException e) {
            throw e;
        }
    }

    @Benchmark
    public void onHeapStorage(BlackHole blackhole) {
        OnHeap onHeap = new OnHeap();
        onHeap.setField(100);
        blockhole.consume(onHeap.getField());
    }

    @Benchmark
    public void offHeapStorage(BlackHole blackhole) {
        OffHeap offHeap = new OffHeap(initialAddress);
        offHeap.setField(unsafe, 100);
        blockhole.consume(offHeap.getField(unsafe));
    }
}

See this link for more code details.

Benchmark Results

BenchmarkMode  Samples  Score     Score          error     Units
onHeapStorage   thrpt    200   1798771909.287 8732367.558  ops/s
offHeapStorage  thrpt    200    993357583.727 66354831.212 ops/s

OffHeapStorage using Unsafe APIs
Result: 1 Million operations/second   [Average]

OnHeapStorage
Result: 1.8 Million operations/second [Average]

We can clearly see that on heap storage outperforms off-heap storage using Unsafe APIs. This is mainly because of these facts :

  1. We are essentially copying the value 100 from inside the heap to off-heap. This essentially means some extra instructions. This is done when we call offHeap.setField(unsafe, 100) which is essentially copying 100 value to the off-heap memory location.tmp1
  2. We are also essentially copying the value 100 from off-heap to inside the heap which again means some extra instructions. This is done when we call offHeap.getField(unsafe) which essentially copies the value 100 from the off-heap memory location to inside heap.tmp2

As we can see that while using Unsafe APIs for off-heap storage there is some additional overhead, so we need to be really careful when to use these Unsafe APIs.

Use Cases of Unsafe APIs

  1. As already discussed earlier using unsafe APIs for off-heap storage comes with some performance penalty when compared to keeping the objects in heap. But there are certain times when we can benefit from this approach of using unsafe APIs for off-heap storage. This is mostly when we have a large cache to manage. Then we don’t have the liberty of keeping the cache in heap because of long GC pauses which might affect the application latencies.
  2. Other use-cases of unsafe APIs include that these unsafe APIs also provide a way by which we can access any fields of an object even private given we know the offsets of those fields.
    public class NormalObject {
      public long a1 = 10;  // Offset = 16L
      private long a2 = 20; // Offset = 24L
      public long a3 = 30;  // Offset = 32L
    }
    
    NormalObject normalObject = new NormalObject();
    unsafe.getLong(normalObject, 16L); // returns 10
    unsafe.getLong(normalObject, 24L); // returns 20
    unsafe.getLong(normalObject, 32L); // returns 30

For Better Understanding also read

Thought Paper – What if we combine compression with encryption ??

This thought paper tries to reason about how can we combine encryption and compression in a single operation and get away with the traditional way of first compressing and then encrypting the compressed data to create the final encrypted and compressed data.

Problem

Currently, in this era, we have huge volumes of data and cheap external storage options, so more than often we have this use case of transferring data from those external storage options on our compute machines during runtime, which means that we have to

  1. Use HTTPs connection because we cannot risk the third parties to have a sneak peek at this data.
    • Although this could also be achieved via our own encryption mechanisms
      • Encrypting the data with a symmetric key A
      • Uploading the encrypted data
      • Downloading the encrypted data from the external storage
      • Decrypting the downloaded data with the same symmetric key
  2. Then Uncompress the data which has been downloaded, because we want to save $$$ and these external data stores charge on data stored, so better compressed the data lesser would be the cost.

So essentially a lot of CPU cycles gets wasted in these two operations. What if we can combine these two somewhat orthogonal operations in a single operation, only motivation being that we can save some CPU cycles or instructions.

Traditionally Encryption and Compression have been considered orthogonal to each other, but both of those operations essentially have one thing in common that they shape data is one possible

Let us understand this statement with a simple example:

Compression

Payload (10101010100010010010) + CA ( ) = Result (11011001010)

where
Payload is the uncompressed data
CA () is the compression algorithm
Result is the compressed data


Encryption

Payload (10101010100010010010) + ECA ( Salt ) = Encrypted (10100101)

where
Payload is the unencrypted data
EC ( salt ) is the encryption algorithm which take a SALT as input
Result is the encrypted data


Now let’s save some CPU Cycles

We will try to construct a function which compresses as well as encrypts the data in fewer CPU cycles when compared to traditional way of compression + encryption.

We will first take a deep dive into understanding GZIP compression and its internals. Then we would figure out ways of encrypting that GZIP compressed data which would help us save CPU cycles.


GZIP

GZIP provides a lossless compression, that is, we can recover the original data when decompressing it. It is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman coding.”

Source: http://blog.servergrove.com/2014/04/14/gzip-compression-works/

LZ77

LZ77 is an algorithm in which we replace the subsequent occurrences of substring T’ with the reference of the original substring T.

Original text: "ServerGrove, the PHP hosting company, provides hosting solutions for PHP projects" (81 bytes)
LZ77: "ServerGrove, the PHP hosting company, pidessolutions forjects" (73 bytes)

Source: http://blog.servergrove.com/2014/04/14/gzip-compression-works/


Huffman Encoding

The same logic can be applied to Huffman encoding as well in which we have different encodings for every character.

“Huffman coding is a variable-length coding method that assigns shorter codes to more frequent “characters”.”

Quoted from: http://blog.servergrove.com/2014/04/14/gzip-compression-works/


Deflate Algorithm

In deflate algorithm, we combine the LZ77 as well as Huffman encoding to achieve best compression levels. In deflate algorithm, we divide the uncompressed data into chunks and then apply the compression. In every chunk, we first apply LZ77 compression technique to replace the repeated pieces of strings with a pointer to the original string and then encode the literals in the corresponding chunk with Huffman trees. These Huffman trees are in turn appended to the compressed chunk and then transferred along with the data.


Suggested Change to the current compression technique

We already know that our compressed data with GZIP ( deflate algorithm ) consists of chunks of compressed data along with their Huffman trees.

So here is the suggestion, what if we encrypt only the huffman trees 
for those compressed chunks. This would give us encryption with 
very few instructions set possibly because of the very limited size 
of the huffman trees. These huffman trees are the basis for 
decompressing the chunks. So if we have encrypted those huffman 
trees for those compressed chunks, then there is no possible way 
for retrieving the decompressed chunk without decrypting the huffman 
trees.

So with this approach, we reap the benefits of compression as well as encryption with the lesser instruction set or in other words low CPU consumption.


So how many CPU cycles can we save ??

Now each of these Huffman trees inside each chunk of text contains the mapping for the literals to their bit codes.

However, the Huffman code table itself is pretty big. It has to include codes for all 255 possible literal bytes, as well as the back pointers themselves. As a result, the lengths of the codes are themselves Huffman encoded! To avoid conceptual infinite recursion, the lengths of those codes are not Huffman encoded, but instead given as fixed three-bit fields. This means that there can only be at most 8 unique length codes for the Huffman codes that represent the compressed data itself. To summarize, then, the Deflate format is illustrated in figure 4:

Quoted from: http://www.infinitepartitions.com/art001.html

Let’s understand this with an example.
Suppose these are the Huffman codes corresponding to the characters.

0: ‘a’
1: ‘i’
10: ‘e’
110: ‘t’
111: ‘s’
1101: ‘j’

We will encode in this way

1: ‘a’
1: ‘i’
2: ‘e’
3: ‘t’
3: ‘s’
4: ‘j’

We encode in this way because we want to further reduce the space taken by Huffman codes.  For further understanding read this.

Now assuming this compressed Huffman code table amounts to 20% of the compressed data, then we could easily save around 80% of the CPU cycles.


Conclusion

So far we have talked about how can we save CPU cycles while encryption by taking advantage of the distribution of this compressed data in case of GZIP. Apart from GZIP, there are many other algorithms in the market which uses Huffman codes for better compression for literals. So those as well can be compressed + encrypted with fewer CPU cycles with this same technique.

Be it any compression algorithm , we can achieve lower CPU 
cycles consumption for encryption given we encrypt only the 
dictionary of compressed data and not the whole compressed 
data.

 

References

[1] http://blog.servergrove.com/2014/04/14/gzip-compression-works/
[2] https://catchchallenger.first-world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO
[3] https://en.wikibooks.org/wiki/Data_Compression/Dictionary_compression#Adaptive_dictionary_algorithms
[4] http://www.infinitepartitions.com/art001.html
[5] https://jvns.ca/blog/2015/02/22/how-gzip-uses-huffman-coding/

Understanding Linux Internals for Data Transfer – Part 3

In this blog post, we will try to have a better understanding in current downloading scheme with AWS S3 Client or HttpClient by looking into the strace of the downloading client and see how can we improve on those.

We already know from first blog post the essential system calls which are necessary for the download to happen.

Now lets see the system calls involved in the download calls issued using each of the clients:

AWS S3 Client STrace for a 100 KB Payload Size

% time seconds usecs/call calls syscall
55.32 0.028953 7 4023 poll
26.85 0.01405 59 239 futex
11.41 0.00597 0 16018 gettimeofday
3.37 0.001765 0 3927 recvfrom

100k_1_s3_all-cpu

In this we can clearly see the system calls used by AWS S3 Client for downloading data. Conclusions

  • Most of the time is being spent in Non Blocking IO ( combination of poll and recvfrom system calls )
  • High Usage of gettimeofday which are supposed to be slow on AWS Platforms
  • Some System Time is also being consumed  by the futex system calls which in turn mostly means that time is the time spent in waiting for a child thread to complete ( as we are mostly tracing a single thread or a process in linux terms ).
  • Using S3 Client
    • CPU User = ~7% 
    • CPU System = ~2.5%

HttpClient STrace for a 100 KB Payload Size

% time seconds usecs/call calls syscall
76.12 0.162679 1 239703 read
9.19 0.019631 5 4089 recvfrom
8.35 0.017847 45 399 futex
4.54 0.009708 98 99 connect
1.4 0.002996 0 12232 gettimeofday

100k_1_http_all-cpu

In this we can see that why HttpClient Strace is less performant than AWS S3 client. This is because of the system call involving read() which clearly is the most critical section considering the system time of this download.

Conclusions:

  • Most of the system time is being spent in read system calls.
  • Other system calls involved in the process of downloading are the typical system calls which we expect during downloading process ( aka recvfrom , connect ).
  • Multiple​ connects system calls are because of the fact that we were making many downloading calls to the HttpClient.
  • Using Http Client
    • CPU User = ~7% 
    • CPU System = ~3%

Note: We can definitely remove the need for these read system calls with some tweaking with the HttpClient settings. AWS S3 client in itself uses HttpClient for communication with the S3 Service.

 

Simple Client STrace for a 100 KB Payload Size

Now that we understand what are the basic things required for downloading data, lets write our own client to download data from S3.

We would be using this simple piece of code for downloading data from S3 via a URI.

Socket s = new Socket(host, 80);
PrintWriter wtr = new PrintWriter(s.getOutputStream());
wtr.println("GET "+ url +" HTTP/1.1\r\nHost: " + host + "\r\n");
wtr.flush();

InputStream inputStream = s.getInputStream();
int payloadSize = readHeaders(inputStream);
int read = 0;
int ret = 0;
byte[] buffer = new byte[size];
while (read != payloadSize){
    ret = inputStream.read(buffer, 0, buffer.length);
    if (ret == -1) return;
    read += ret;
}
wtr.close();

Lets dig into the strace for this code while downloading data.

% time seconds usecs/call calls syscall
95.06 0.026071 1 37609 recvfrom
1.36 0.000372 4 98 connect
1.11 0.000305 1 431 gettimeofday
0.57 0.000157 2 98 dup2
0.35 0.000095 1 97 sendto

100k_3_simple_all-cpu

 

Conclusions:

  • In this particular client implementation we can clearly see that most of the system time is being spent in recvfrom system calls which we know means that it is being spent for copying data from kernel space to user space or waiting for data to arrive in those kernel buffers.
  • CPU Consumption seems close to the AWS S3 Client CPU Consumption.
  • Using Simple Client
    • CPU User = ~7% 
    • CPU System = ~2.5%


Now lets compare the speed of download of our new simple client with AWS S3 Client or HttpClient for 100 KB Payload Size.

S3 Client HttpClient Simple Client
10th Pct 538 631 390
20th Pct 561 693 399
30th Pct 591 711 417
40th Pct 598 763 435
50th Pct 605 772 446
60th Pct 610 778 457
70th Pct 643 786 461
80th Pct 671 791 486
90th Pct 702 805 498

Conclusions:

  • Simple Client Outperforms S3 Client by around 30%
  • Simple Client Outperforms HttpClient by around ~45%


Now lets compare these numbers for every client with a big payload size i.e. 50 MB and compare the CPU consumptions for every client

S3 Client HttpClient Simple Client
10th Pct 4081 3742 3463
20th Pct 4207 4035 3845
30th Pct 4405 4107 4105
40th Pct 4699 4352 4412
50th Pct 6098 4634 4748
60th Pct 6457 6288 5407
70th Pct 7458 7074 5647
80th Pct 7911 9079 5847
90th Pct 11491 11531 6022

clients_50m

Conclusions:

  • Simple Client outperforms S3 Client by around 15% for low percentiles but for higher percentiles the performance is even better with simple client around 20%.
  • Simple Client outperforms HttpClient by around 10% for lower percentiles and for higher percentiles it is around 15%.
  • CPU Consumption is also lower for the Simple Client still lingering around 7.5% to 8% when compared to S3Clients or HttpClients whose CPU consumption lingers around 12.5% to 15%. So thats a win and win situation.


Note:
Simple client performance is better with lesser consumption in resources but in the current implementation or experiments we are using this client over http connection which means that we need to be really cognisant when do we want to use this client.

Understanding Linux Internals for Data Transfer – Part 2

We have already discussed in the part 1 of this series that what are the basic system calls required for communicating with a server.

In this blog post our main goal is to understand the differences with downloading data from S3 with and without using AWS-SDK.

Experimental Setup:

  • I3.xlarge machine
  • Different Payload Sizes
    • 100KB
    • 500KB
    • 1MB
    • 5MB
    • 10MB
    • 50MB
  • Concurrency
    • 1 Threads
    • 10 Threads
  • Downloading Client
    • HttpClient
    • AWS-SDK

Metrics to Monitor:

  • Download Speed ( Mb/s )
  • CPU Utilisation
  • Network Percentage Utilisation

Experiment Results:

Download Time ( in ms )

  • 100 KB Payload
Concurrency HttpClient Aws SDK
100 KB
10 Threads 721 ms 615 ms
1 Thread 734 ms 611 ms

 

  • 500 KB Payload
Concurrency HttpClient Aws SDK
500 KB
10 Threads 1116 ms 970 ms
1 Thread 1152 ms 1023 ms

 

 

  • 1 MB Payload
Concurrency HttpClient Aws SDK
1 MB
10 Threads 1334 ms 1221 ms
1 Thread 1362 ms 1268 ms

 

  • 5 MB Payload
Concurrency HttpClient Aws SDK
5 MB
10 Threads 1932 ms 1788 ms
1 Thread 1991 ms 1870 ms

 

  • 10 MB Payload
Concurrency HttpClient Aws SDK
10 MB
10 Threads 2294 ms 2156 ms
1 Thread 2247 ms 2177 ms

 

 

  • 50 MB Payload
Concurrency HttpClient Aws SDK
50 MB
10 Threads 4448 ms 4096 ms
1 Thread 4223 ms 4100 ms

 

 

CPU Statistics

  • With 10 Concurrency

  • With 1 Concurrency

 

Network Statistics

  • With 10 Concurrency
  • With 1 Concurrency

 

Conclusions

  • AWS SDK outperforms HttpClient for all the payload sizes.
  • CPU Utilisation while using AWS SDK and HttpClient is comparable.
  • Also Network Throughput while using AWS SDK and HttpClient is also comparable.


In the next blog post , we will go into the details of these downloads and see if we can improve upon the download time.

Go to Previous Blog Post

Understanding Linux Internals for Data Transfer – Part 1

This would be a series of 3 blog posts in which we are going to talk about:

  1. Basics of Networking and System calls involved in client-server communication
  2. Understanding and benchmarking different data download techniques from AWS-S3
  3. Potential Bottlenecks in the current downloading scheme from AWS-S3 and how to mitigate those

In this blog we are specifically going to talk about how does linux currently implements and expose APIs or system calls for transferring of data from one system to another with focus being on downloading and how can we improve the download time.

In this modern era with the advent of big data and distributed systems, its becoming more and more difficult to cache all the data you have onto production machines. So for storing this vast majority of data , we have options like Amazon S3 or Google Cloud Storage. But with this approach comes another set of problems, that being how can we quickly download this data and make users experience better.

But first we need to have a good understanding about how does this data gets downloaded onto our machine.  To understand this problem , we will just have a look into this simple piece of code which makes a simple get request on a specific url on a server.

This piece of codes makes a get request / URI on google.com:80

int main(int argc , char *argv[])
{
  int socket_desc;
  struct sockaddr_in server;
  char *message , server_reply[10];

  //Create socket
  socket_desc = socket(AF_INET , SOCK_STREAM , 0);

  //ip address of www.google.com (get by doing a ping www.google.com at terminal)
  server.sin_addr.s_addr = inet_addr("216.58.220.164");
  server.sin_family = AF_INET;
  server.sin_port = htons( 80 );

  //Connect to remote server
  connect(socket_desc , (struct sockaddr *)&server , sizeof(server));

  //Send some data
  message = "GET / HTTP/1.1\r\nHost: google.com:80\r\n\r\n";
  send(socket_desc , message , strlen(message) , 0)

  recv(socket_desc, server_reply , 6000 , 0)
  puts(server_reply);
}

Lets understand each of the system calls in

Socket

This system call is used initialise the data structures which are essential for carrying out the communication between a client and a server.   This system call returns a file descriptor on which we can use generic file descriptor system calls like read and write which internally gets mapped onto send or recvfrom system. This is done by the kernel itself.

int socket(int domain, int type, int protocol)

where

domain: Specifies the Protocol Family which would be used 
        for creating the socket.
        e.g LocalCommunication , IPv4 Internet protocols
type: Specifies the communication semantics which needs to be 
      followed while creating the socket.
      e.g. SOCK_STREAM, SOCK_DGRAM
protocol: Particular protocol to be used with the socket. Normally 
          there is only a single protocol which can be used 
          with a given socket.
          e.g. Most of the time 0, as there is only a single protocol

It returns the file descriptor of the socket on which subsequent 
system calls have to be made.

Connect

This system call is responsible for creating the connection between a client and a server. This system call is used by clients to connect to the server running on a particular port. This connect system call is responsible for making the 3-way handshake for a TCP connection and also changing the state for the kernel level data structures to reflect the connection state change i.e. ESTABLISHED

This is a generic flow for a connect system call in case of TCP connection.

blank-diagram-page-1-10.jpeg

int connect(int sockfd, const struct sockaddr *addr, socklen_t addr)

where

sockfd: File Descriptor representing a socket
addr: Address which contains the details of server to which we 
      make the connection
len: Size of the socket address data structure

If the connection is successfully established 0 is returned otherwise
-1 is returned.

Recv

This is a generic flow for a generic recv system call.
blank-diagram-page-1-6-e1510773461626.jpeg

ssize_t recv(int sockfd, void *buf, size_t len, int flags);

where

sockfd: File Descriptor
buf: Pointer of User space Memory Buffer for received data
len: Amount of data to be received

It returns the amount of data or bytes which are actually returned
or copied in the user space buffer

But there is a single caveat in this diagram that being, that not always recvfrom system call result into process getting blocked. This happens when we already have some data to read from the socket buffer into the user space buffer of the process.

Send

This is a generic flow for a send system call

Blank Diagram - Page 1 (7)

size_t send(int sockfd, const void *buf, size_t len, int flags);

where

sockfd: File Descriptor
buf: Pointer of User space Memory Buffer in which data is contained
len: Amount of data to be send

It returns the amount of data or bytes which are actually send.

As with the case with recv in this system call as well we will send out the data to the socket buffer and it might block us if and when we have the socket buffer full or socket is in non blocking mode.

More on this you can read out here:

  1. TCP IP Network Stack
  2. Linux TCP IP Presentation
  3. Tcp System Calls

 

Linux Block MQ – simple walkthrough

In this article we are going to talk about Linux Block MQ and its potential benefits.


What is Block MQ ?

So Block MQ is just a different queueing mechanism for Block Layer. As of now , we have a single queue which resides in the Block Layer and is responsible for accumulating all the BIO Requests which gets eventually send out to Device Drivers. With Block MQ we plan to change this single queue behaviour to multi queue, which would result into increased throughout ( discussed later ).

 

NVM based SSDs vs Legacy SSDs

With NVMe based SSDs becoming popular day by day , it is becoming more and more obvious that the device is not the bottleneck anymore. Due to the close to million IOPS provided by these devices courtesy of the high internal parallelism, we have come to realise the bottlenecks in our kernel / software layer as you may call.

History of NVM based SSDs

Legacy interfaces like SATA or SCSI were used predominantly to communicate to storage devices. These interfaces were built during the time when rotating devices were the main source of storage. But with the arrival of SSDs and huge internal parallelism offered by these devices, these old storage interfaces proved to be a bottleneck. So to get away with these bottlenecks , PCIe communication interface was designed keeping in mind that the interface should be designed in a way that is able to harness the full capacity of the underlying devices and should not themselves prove to be a bottleneck.

With PCIe based SSDs we also get huge benefits:

  1. Better bandwidth
  2. Direct connection to the CPU so less overhead.
Furthermore PCIe offers less latency than a SATA connection due 
to how PCs are designed. It allows for a direct connection to 
the CPU. Traditionally, the SATA controller is connected to a 
chipset which is then connected to the CPU. The PCIe links travel
 directly to the CPU. Therefore, there is no need to bridge back 
and forth between intermediate technologies in data communication 
allowing for faster data flow and processing.

Now with the advent of PCIe based SSDs we needed different interface for communicating between the host and the device. Each PCIe SSD device vendor was creating his own driver without following any standardisation to communication with the block layer. and hence it became very difficult to manage all these device drivers. So to enable a faster adoption to PCIe SSDs , NVMe protocol was developed.

NVMe protocol was developed keeping in mind that host should be able to fully utilise the full parallelism of the underlying storage device with minimal latency / high performance.

NVMe protocol with PCIe storage interface was able to achieve these predefined goals with some basic architecture changes:

  • High Parallelism for the IO requests
    • With NVM based SSDs we had multiple hardware queues which were manipulated by the device drivers as well as the disk controller. These device drivers enqueued requests into those hardware queues ( aka submission queue ) and those controllers dequeued the requests from the queue and then after completing the IO submitted into the another queue ( aka completion queue )
  • Better PCIe Interface capabilities
    • This has already been discussed previously

Still we have some bottlenecks in the block layer 🙁 🙁

After NVM based SSDs taking the market by the storm and providing millions of IOPS, still there were some bottlenecks in the kernel block layer which meant that there was huge benefits to be discovered just round the corner.

Lets discuss about those issues / overheads.

  1. Request Queue Lock proved to a single point of contention in the block layer. There are lot many CPU cycles which get wasted just because of spinning / contending for this lock. This lock was responsible for synchronising shared access to Block Layer Request Queue and its data structures.  There are huge number of operations which are offered by request queue. So request queue lock must be acquired beforehand to execute any of these operations which meant that all other threads needed to wait for some amount of time before they can acquire the lock and perform their corresponding operation.Just to give you a taste:
    struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
    {
    // Some Code
    	spin_lock_irq(q->queue_lock);
    	if (gfp_mask & __GFP_WAIT) {
    		rq = get_request_wait(q, rw, NULL);
    	} else {
    		rq = get_request(q, rw, NULL, gfp_mask);
    		if (!rq)
    			spin_unlock_irq(q->queue_lock);
    	}
    
    // Some Code	
    }

    This code is responsible for getting a block IO Request from the Request Queue for the device driver. We can see that the executing thread has to acquire the lock and then execute some critical section code and then release the lock.

     

  2. Cache Coherency Problems
    Currently we have a single request queue in which we submit all the IO requests. Irrespective of which CPU has issued which IO request , all end up in the same request queue. These IO requests are transferred to the block device driver queue which internally maintains number of hardware queues ( which maps onto the number of parallel tasks supported by the underlying storage device ). So this mutable data structure called request queue is shared across different CPU ( or cores ) which in turn demands high cache coherence because most of the time the shared request queue data structures would be present in cache due to temporal locality.

Block Multi Queue to the rescue 🙂 🙂

Block Multi Queue solves these problems and bottlenecks by making some very simple changes to the architecture of the Block Layer Queueing Mechanism.
As of now , we had this single queue per device which led to various problems as we discussed like contention in locks , cache coherence.  All of these problems which we had discussed previously had one thing in common , that they arise because of multiple execution threads running simultaneously and using the same data structures.

Isn’t it ?

What if we segregate the request queue for every CPU ( or core ) ??
Now as we have request queue per CPU, so only one execution thread can run at a time on a CPU ( or core ) which automatically solves the problem of lock contention because we need not worry about other simultaneously executing threads ( running on other COREs ) might have access to the same data structures or same request queue.

Also this request queue per CPU approach also minimises cache coherence . This is because this time we don’t have any common mutable data structures being shared across different CPU cores which means no cache invalidation for the common data structures which are being shared and hence multi queue would be more performant than the single queue implementation.

We propose a new design for IO management within the block layer. 
Our design relies on multiple IO submission/completion queues to 
minimize cache coherence across CPU cores. The main idea of our 
design is to introduce two levels of queues within the block layer: 
(i) software queues that manage the IOs submitted from a given CPU 
core (e.g., the block layer running on a CPU with 8 cores will be 
equipped with 8 software queues), and (ii) hardware queues mapped 
on the underlying SSD driver submission queue.

Quoted from Linux Block IO[1]

Architecture Changes in Block Multi Queue

Few changes which were introduced in Block Multi Queue were:

  1. Software Staging Queues per CPU core
    In the previous implementation we used to have a single queue per device , but with block multi queue request queue per core got introduced. This resulted in negligible lock contention across different request queues.
  2. Hardware Dispatch Queues
    Just before the requests were handed over to the device driver by the software staging queues, a new queueing mechanism was introduced namely hardware dispatch queues. These hardware dispatch queues act as a mediator before the requests get handed over to the device driver for final execution. Hardware dispatch queues control the submission rate to the underlying device driver buffers to prevent them from getting overwhelmed.linux block layrt - Page 1 (1)

Now lets get our hands dirty !!!

So we are going to do benchmarking of block multi-queue over NVMe based SSDs.

Setup:

AWS I3.xLarge machine

  • 4 CPU Cores
  • This instance has two instance store volumes: nvme0n1 and nvme1n1. We are going to benchmark only on nvm0n1 for analysis sakes.
  • DD to make read requests and writing to /dev/null
  • blktrace and btt to understand the time spent in the request lifecycle

Experiment:

dd if=/dev/nvme0n1 bs=4k of=/dev/null count=10000000 iflag=direct
// This command would keep on making IO requests 

blktrace -d /dev/nvme0n1 -o nvmOutput -w 30
// BLKTrace does the work of profiling the request lifecycle
// It shows the result as where does the request spent what time
// It generates the result per CPU core because it is CPU who 
// is spending the time ( queuing and merging and plugging ) 
// for the request until it gets dispatched

btt -i nvmOutput.blktrace.2 
// Btt shows the result from blktrace in human readable format
// We are analysing the results from all the requests on a 
// single core i.e. 2nd core

Results:

==================== All Devices ====================

ALL             MIN            AVG           MAX           N
--------------- ------------- ------------- ------------- -----------

Q2Q              0.000026726   0.000042731   1.488062947   702073
Q2G              0.000000122   0.000000218   0.000024321   702074
G2I              0.000000432   0.000000573   0.000023750   702074
I2D              0.000000270   0.000000384   0.000037557   702074
D2C              0.000019678   0.000025327   0.001429022   702073
Q2C              0.000020829   0.000026503   0.001430629   702073

To understand the each individual terms, have a look at this link.

Quoting from the btt guide

Q2Q which measures the time between queue traces in the system. 
This provides some idea as to how quickly IOs are being handed 
to the block IO layer.

Q2C which measures the times for the complete life cycle of 
IO's during the run

So we can ignore Q2Q and Q2C for the time being.

But apart from that most of the time for the request is being spend on driver + underlying NVMe SSD which is denoted by D2C which is a good sign for the block multi queue.

References

[1] Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems
[2] BTT Guide
[3] NVM Express
[4] Elixir For Linux Code Documentation
[5] Performance Analysis of NVMe SSDs and their Implication on Real World Databases

Uneven Distribution of Requests on Large Scale Clusters

In this post, we are going to talk about one of the issues that we faced on our production environment while partitioning.

Some Basic Terminologies:

  • Task
    • It is a basic unit of work which gets executed on a node.
    • Tasks are assigned to the nodes based on hashing.
      1. Each Task has a metadata field which is a string.
      2. We take the hashCode of the string and based on this hashCode and cluster size, we then route the task appropriately on one of the nodes of the cluster.
  • Request is nothing but a collections of these tasks.

Let us take an example:

  • Request
    • R( T1, T2, T3, ….. , Tn)
  • Sub-Requests
    • R(TI1, …..TI2) for Node1
    • R(TI3, …..TI4) for Node2
    • R(TI5, …..TI6) for Node3
    • ….

Blank Diagram - Page 1 (1)

Now as we are cloud based company, so our nodes becomes unavailable more than often. So in such scenarios where a node of a cluster is unavailable, we route the tasks for that node to an auxiliary cluster.

  • So suppose we had assigned tasks Tk1, …..Tk2 to the Node6
  • Nodebecomes unavailable due to some intermittent issue.
  • Then we will route the tasks Tk1, …..Tk2 to the auxiliary nodes.
  • So distribution of tasks on auxiliary cluster is decided by the same hashing technique which we used for routing the tasks to the main cluster.

 

Blank Diagram - Page 1 (4)

Problem:
We started seeing uneven distribution of tasks on those auxiliary nodes when any of the nodes in the main tier was unserviceable. Due to this constantly we were keeping our auxiliary clusters over-provisioned which led to huge increases in cost and operational overhead.

Explanation:

So this uneven distribution of the tasks on the auxiliary cluster was due to a simple fact that the tasks which were assigned to the Node6 were having one special property that all their hashCodes followed this format:

HashCode ( Tasks Assigned to Node) = 8*k + 6

So now all these tasks when got routed to auxiliary cluster , would only be routed to specific nodes.

HashCode ( Tasks Assigned to Node) % 4 =  i  =  ith node of the auxiliary cluster
As HashCode ( Tasks Assigned to Node6 ) = 8*k + 6

Hence,
(8*k + 6) % 4 = 2*m ( Even Number )
So, ith node of the auxiliary cluster would always be of form 2*m.

So all of these tasks would get routed to auxiliary nodes which are even numbered and hence uneven distribution.

Solution:

This uneven distribution on auxiliary cluster was because of the fact that both of the hashing functions ( for the main cluster as well auxiliary cluster ) were the same and used the same hashCode of the tasks.

We want that both of these hashFunctions should be independent from each other.
By independent, I mean to say is that if we have any information about HashFunctionMain , we shouldn’t be able to make any prediction regarding HashFunctionAux.

We tried to solve the problem by using HashCodeBuilder which is supposed to give a different functionality for finding the hash of an object ( string in our case ).

Still the Problem Persists:

Even after using HashCodeBuilder for finding the hash of the object , still we were seeing uneven distribution on our auxiliary nodes. We started digging deeper into the codebase of the hashCodeBuilder and to our surprise we found that every string or char hashCode got reduced to this generic formula:

HashCodeBuilder's HashCode Implementation

s[0]*m^(n-1) + s[1]*m^(n-2) + ... + s[n-1]
where 
  m: Prime Number
  n: Characters in the string
  s[0], s[1], ... s[n]: Characters in the string


Java's HashCode Implementation

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
where 
  n: Characters in the string
  s[0], s[1], ... s[n]: Characters in the string

So we can clearly see that there is a clear dependency between these two hashCodes.
Dependency being that if s[0] + s[1] + ... s[n] is even ,

  • Java’s HashCode would also be even
  • HashCodeBuilder’s HashCode would also be even

So as we can see that there is some relation between the two hashFunctions, so we cannot use HashCodeBuilder for finding the hashCode for routing to Auxiliary Nodes.

So finally , we arrived at two solutions.

  1. Using SHA Hash
    • More Time Consuming
    • Takes more CPU cycles
  2. Using HashCode for this first (n-1) characters
    • So we have to show that given hashCode of n characters which is essentially a constant , hashCode of (n-1) characters would be uniformly distributed. This is due to this simple fact:
Proof:

C :       HashCode of the n characters
s[n-1]:   nth character
Hn-1:      HashCode of the (n-1) characters

Now, we know

( C - s[n-1] ) / 31 = Hn-1
C - s[n-1] = 31 * Hn-1

Now this C is constant and s[n-1] has a uniform distribution 
and can take any value from [0, 255)

Hence addition of a uniform distribution i.e. s[n-1] 
to a constant i.e. C in our case would keep the resultant 
uniformly distributed [0, x).

We wrote a small program to test the hypothesis as well

object HashingTest {
  val r = new scala.util.Random(31)
  protected[routing] def getHash(task: String): Int = {
    task.hashCode
  }

  private def mod(i: Int, modulus: Int): Int = {
    val result = i % modulus
    if (result >= 0) {
      result
    } else {
      result + modulus
    }
  }

  private def pickInitialNode(task: String): Int = {
    val nodeIndex = mod(getHash(task), 18)
    nodeIndex
  }

  private def pickAuxiliaryNode(task: String): Int = {
    val hashCodeBuilder = new HashCodeBuilder
    hashCodeBuilder.append(task.substring(0, task.length - 2))
    val hashCode = hashCodeBuilder.hashCode
    mod(hashCode, 18)
  }

  def main(args: Array[String]): Unit = {
    val tasks = (1 to 100000).map(elem => {
      val task = Random.nextString(31)
      val a = pickInitialNode(task)
      if (a == 6) {
        pickAuxiliaryNode(task)
      } else {
        -1
      }
    })
    shards.filter(elem => elem != -1).toSet.toList.sorted.map(println)
  }
}
Generated Output

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

In this above output we can see that the tasks were getting routed on all the nodes of the auxiliary cluster.
Also to show a performance comparison between SHA hashing and our home made hashing technique.

Hashing for 10000 task strings took:

SHA Hash
pct90 47 ms
pct50 25 ms

HashCode (n-1) Characters
pct90 38 ms
pct50 21 ms

So in this experiment we can clearly see that our Hashing Technique for Routing on Auxiliary Nodes performs better in terms of performance when compared to SHA hashing. Also we were able to save $$$ after we rolled this out due to uniform distribution across the clusters.

 

Understanding Long GC Pauses with JVMs

We recently started getting high GC pauses with our in house built in memory cache.
This memory cache has:

  1. All of the metadata required for response is cached in memory
  2. Low Response rates.

Configuration for the metadata cached application that we were using:

  • -Xmx: 44 Gb
  • Young Generation: 15Gb
  • Old Generation: 30 Gb

Now we started investigating those high GC pauses in our application and interestingly enough we couldn’t find any minor GC cycles in our logs. On looking further we could find application was doing FULL GC at regular intervals

Sample log lines:

[Full GC [PSYoungGen: 13417472K->1003667K(14090240K)] [ParOldGen: 30037496K->30037470K(30037504K)] 43454968K->31041137K(44127744K) [PSPermGen: 82734K->82734K(82944K)], 11.6876010 secs] [Times: user=91.27 sys=0.00, real=11.69 secs]
[Full GC [PSYoungGen: 13417472K->1021218K(14090240K)] [ParOldGen: 30037096K->30037496K(30037504K)] 43454568K->31058714K(44127744K) [PSPermGen: 82740K->82734K(82944K)], 15.4744130 secs] [Times: user=118.96 sys=0.00, real=15.47 secs]
[Full GC [PSYoungGen: 13417472K->993065K(14090240K)] [ParOldGen: 30037131K->30037096K(30037504K)] 43454603K->31030161K(44127744K) [PSPermGen: 82734K->82734K(82944K)], 17.5521800 secs] [Times: user=134.98 sys=0.00, real=17.55 secs]

In this log line , we can clearly see that during this FULL GC cycle , only young generation has been garbage collected and old generation is more or less the same, which means there were plenty of short lived objects which got garbage collected when FULL GC triggered.

This was not at all what we were expecting , our expectation was that this young generation getting filled up , should never trigger FULL GC.  And in fact, we were under the impression that whenever we have a failure to allocate new objects in young generation, minor GC should kick in and garbage collect all the short lived objects as we had plenty of short lived objects in this case. But it seemed from the logs , that every time we had a failure to allocate a new object in young generation, FULL GC was kicking in which caused huge pauses to our application.

( This is because we had 30gb of old generation , so inspite of having all strong referenced objects , it took the GC threads huge time just to traverse through the 30gb of heap space)

Screenshot_2017-09-20_11_54_53_png_and__Users_mridul_Dropbox_Screenshots

We can clearly see the jstat output in this snapshot attached.

According to the snapshot , we can see that the:
FULL GC cycle started at 13383165.2
FULL GC cycle ended at 13383177.3​.
During this time our application was not able to serve any requests.

This FULL GC cycle got triggered because there was object allocation failure in the young gen. Interestingly enough we could have used the survivor regions and could have easily dealt with the object allocation failure but instead it triggered FULL GC paid the penalty of traversing this old generation.

Now i tried reproducing this behaviour with this small piece of code.

object GCTest {
 val str = new scala.util.Random(500)
 def main(args: Array[String]): Unit = {
   var l: List[String] = List.empty
   while(true) {
     val a = str.nextString(50)
     l = l :+ a
     Thread.sleep(10)
   }
  }
}

GC cycles were as expected

Screenshot_2017-09-22_22_54_06

Few Observations from the GC logs:

  1. Application is doing FULL GC cycles
  2. Old Generation is fully occupied by the long lived objects – objects strongly referenced by the List.
  3. No Minor GC cycles
  4. Lots of non live objects in the Young Generation which gets cleared in the next GC cycle

We can clearly see in the above GC logs that inspite of having survivor regions as empty and many non live objects in the young generation[4] , JVM still does FULL GC cycles instead of the expected minor GC cycles which would have resulted in the collection of   these non live objects and hence decrease the memory pressure on young generation.

So at last my learning from this whole fiasco was that JVM behaves oddly when used with heaps in which we have to maintain a lot of long-live objects. Because all of this would eventually gets pushed to old generation, so now after this old generation gets full , instead of minor GC our application would do FULL GC cycles which would mean much lower throughput.

Note: All of the above observations are from openJDK java version “1.7.0_151”