Linux Block MQ – simple walkthrough

In this article we are going to talk about Linux Block MQ and its potential benefits.


What is Block MQ ?

So Block MQ is just a different queueing mechanism for Block Layer. As of now , we have a single queue which resides in the Block Layer and is responsible for accumulating all the BIO Requests which gets eventually send out to Device Drivers. With Block MQ we plan to change this single queue behaviour to multi queue, which would result into increased throughout ( discussed later ).

 

NVM based SSDs vs Legacy SSDs

With NVMe based SSDs becoming popular day by day , it is becoming more and more obvious that the device is not the bottleneck anymore. Due to the close to million IOPS provided by these devices courtesy of the high internal parallelism, we have come to realise the bottlenecks in our kernel / software layer as you may call.

History of NVM based SSDs

Legacy interfaces like SATA or SCSI were used predominantly to communicate to storage devices. These interfaces were built during the time when rotating devices were the main source of storage. But with the arrival of SSDs and huge internal parallelism offered by these devices, these old storage interfaces proved to be a bottleneck. So to get away with these bottlenecks , PCIe communication interface was designed keeping in mind that the interface should be designed in a way that is able to harness the full capacity of the underlying devices and should not themselves prove to be a bottleneck.

With PCIe based SSDs we also get huge benefits:

  1. Better bandwidth
  2. Direct connection to the CPU so less overhead.
Furthermore PCIe offers less latency than a SATA connection due 
to how PCs are designed. It allows for a direct connection to 
the CPU. Traditionally, the SATA controller is connected to a 
chipset which is then connected to the CPU. The PCIe links travel
 directly to the CPU. Therefore, there is no need to bridge back 
and forth between intermediate technologies in data communication 
allowing for faster data flow and processing.

Now with the advent of PCIe based SSDs we needed different interface for communicating between the host and the device. Each PCIe SSD device vendor was creating his own driver without following any standardisation to communication with the block layer. and hence it became very difficult to manage all these device drivers. So to enable a faster adoption to PCIe SSDs , NVMe protocol was developed.

NVMe protocol was developed keeping in mind that host should be able to fully utilise the full parallelism of the underlying storage device with minimal latency / high performance.

NVMe protocol with PCIe storage interface was able to achieve these predefined goals with some basic architecture changes:

  • High Parallelism for the IO requests
    • With NVM based SSDs we had multiple hardware queues which were manipulated by the device drivers as well as the disk controller. These device drivers enqueued requests into those hardware queues ( aka submission queue ) and those controllers dequeued the requests from the queue and then after completing the IO submitted into the another queue ( aka completion queue )
  • Better PCIe Interface capabilities
    • This has already been discussed previously

Still we have some bottlenecks in the block layer 😦 😦

After NVM based SSDs taking the market by the storm and providing millions of IOPS, still there were some bottlenecks in the kernel block layer which meant that there was huge benefits to be discovered just round the corner.

Lets discuss about those issues / overheads.

  1. Request Queue Lock proved to a single point of contention in the block layer. There are lot many CPU cycles which get wasted just because of spinning / contending for this lock. This lock was responsible for synchronising shared access to Block Layer Request Queue and its data structures.  There are huge number of operations which are offered by request queue. So request queue lock must be acquired beforehand to execute any of these operations which meant that all other threads needed to wait for some amount of time before they can acquire the lock and perform their corresponding operation.Just to give you a taste:
    struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
    {
    // Some Code
    	spin_lock_irq(q->queue_lock);
    	if (gfp_mask & __GFP_WAIT) {
    		rq = get_request_wait(q, rw, NULL);
    	} else {
    		rq = get_request(q, rw, NULL, gfp_mask);
    		if (!rq)
    			spin_unlock_irq(q->queue_lock);
    	}
    
    // Some Code	
    }

    This code is responsible for getting a block IO Request from the Request Queue for the device driver. We can see that the executing thread has to acquire the lock and then execute some critical section code and then release the lock.

     

  2. Cache Coherency Problems
    Currently we have a single request queue in which we submit all the IO requests. Irrespective of which CPU has issued which IO request , all end up in the same request queue. These IO requests are transferred to the block device driver queue which internally maintains number of hardware queues ( which maps onto the number of parallel tasks supported by the underlying storage device ). So this mutable data structure called request queue is shared across different CPU ( or cores ) which in turn demands high cache coherence because most of the time the shared request queue data structures would be present in cache due to temporal locality.

Block Multi Queue to the rescue 🙂 🙂

Block Multi Queue solves these problems and bottlenecks by making some very simple changes to the architecture of the Block Layer Queueing Mechanism.
As of now , we had this single queue per device which led to various problems as we discussed like contention in locks , cache coherence.  All of these problems which we had discussed previously had one thing in common , that they arise because of multiple execution threads running simultaneously and using the same data structures.

Isn’t it ?

What if we segregate the request queue for every CPU ( or core ) ??
Now as we have request queue per CPU, so only one execution thread can run at a time on a CPU ( or core ) which automatically solves the problem of lock contention because we need not worry about other simultaneously executing threads ( running on other COREs ) might have access to the same data structures or same request queue.

Also this request queue per CPU approach also minimises cache coherence . This is because this time we don’t have any common mutable data structures being shared across different CPU cores which means no cache invalidation for the common data structures which are being shared and hence multi queue would be more performant than the single queue implementation.

We propose a new design for IO management within the block layer. 
Our design relies on multiple IO submission/completion queues to 
minimize cache coherence across CPU cores. The main idea of our 
design is to introduce two levels of queues within the block layer: 
(i) software queues that manage the IOs submitted from a given CPU 
core (e.g., the block layer running on a CPU with 8 cores will be 
equipped with 8 software queues), and (ii) hardware queues mapped 
on the underlying SSD driver submission queue.

Quoted from Linux Block IO[1]

Architecture Changes in Block Multi Queue

Few changes which were introduced in Block Multi Queue were:

  1. Software Staging Queues per CPU core
    In the previous implementation we used to have a single queue per device , but with block multi queue request queue per core got introduced. This resulted in negligible lock contention across different request queues.
  2. Hardware Dispatch Queues
    Just before the requests were handed over to the device driver by the software staging queues, a new queueing mechanism was introduced namely hardware dispatch queues. These hardware dispatch queues act as a mediator before the requests get handed over to the device driver for final execution. Hardware dispatch queues control the submission rate to the underlying device driver buffers to prevent them from getting overwhelmed.linux block layrt - Page 1 (1)

Now lets get our hands dirty !!!

So we are going to do benchmarking of block multi-queue over NVMe based SSDs.

Setup:

AWS I3.xLarge machine

  • 4 CPU Cores
  • This instance has two instance store volumes: nvme0n1 and nvme1n1. We are going to benchmark only on nvm0n1 for analysis sakes.
  • DD to make read requests and writing to /dev/null
  • blktrace and btt to understand the time spent in the request lifecycle

Experiment:

dd if=/dev/nvme0n1 bs=4k of=/dev/null count=10000000 iflag=direct
// This command would keep on making IO requests 

blktrace -d /dev/nvme0n1 -o nvmOutput -w 30
// BLKTrace does the work of profiling the request lifecycle
// It shows the result as where does the request spent what time
// It generates the result per CPU core because it is CPU who 
// is spending the time ( queuing and merging and plugging ) 
// for the request until it gets dispatched

btt -i nvmOutput.blktrace.2 
// Btt shows the result from blktrace in human readable format
// We are analysing the results from all the requests on a 
// single core i.e. 2nd core

Results:

==================== All Devices ====================

ALL             MIN            AVG           MAX           N
--------------- ------------- ------------- ------------- -----------

Q2Q              0.000026726   0.000042731   1.488062947   702073
Q2G              0.000000122   0.000000218   0.000024321   702074
G2I              0.000000432   0.000000573   0.000023750   702074
I2D              0.000000270   0.000000384   0.000037557   702074
D2C              0.000019678   0.000025327   0.001429022   702073
Q2C              0.000020829   0.000026503   0.001430629   702073

To understand the each individual terms, have a look at this link.

Quoting from the btt guide

Q2Q which measures the time between queue traces in the system. 
This provides some idea as to how quickly IOs are being handed 
to the block IO layer.

Q2C which measures the times for the complete life cycle of 
IO's during the run

So we can ignore Q2Q and Q2C for the time being.

But apart from that most of the time for the request is being spend on driver + underlying NVMe SSD which is denoted by D2C which is a good sign for the block multi queue.

References

[1] Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems
[2] BTT Guide
[3] NVM Express
[4] Elixir For Linux Code Documentation
[5] Performance Analysis of NVMe SSDs and their Implication on Real World Databases

Leave a Reply