Most of the people reading this article must have asked this question at least once i.e. “What is the optimum number of threads for my service” ??. Mind you in this blog post we are not going to answer the golden value of threads you need to have in your service :P. However, we are going to give you a brief idea on how can increasing/decreasing threads have an impact on the latencies of a service.
So before going into further details, let me show you the results of an experiment we conducted to understand the latencies of a program with a varying number of threads
We did some benchmarking in which we were reading 1000 files each of 1000 bytes with variable number of threads. The conclusion that there is a sweet spot for the number of threads that we need to have in our application to minimize the latency / maximize the throughput.
Number of Threads | Latency |
1 | 1747 |
2 | 1012 |
4 | 759 |
8 | 828 |
16 | 983 |
32 | 1523 |
64 | 2067 |
128 | 3355 |
256 | 5780 |
512 | 12718 |
1024 | 28649 |

So basically we can clearly see this workload has a sweet spot for the number of threads under which the throughput is maximized. From Wikipedia,
Throughput is the maximum rate of production or the maximum rate at which something can be processed.
For answering why do we have a sweet spot for the number of threads, we do need to understand two questions
- How could throughput increase with the increase in number of threads
- How could throughput decrease with the increase in the number of threads
Because of the existence of both of these properties, we see a sweet spot for the number of threads in an application and rather not a monotonic function for the number of threads vs latency graph.

Now lets try to explain both of the properties with respect to the number of threads.
Increase in throughput because of the increase in the number of threads
Around 99% of the time, programs/services we write, do involve some I/O operations. This is mostly because we cannot have a program running in isolation barring some cases like cracking an encrypted password or mining bitcoins. Whenever we write any production-ready service/program, more than often this program needs to read some file or exchange data packets with other services running within the same infrastructure or different.
So if most of the program/services involve some I/O operations, then it is only logical that those threads which will be performing those I/O operations, will get blocked on these I/O activities and hence will be unscheduled from the processor.

So now let’s assume, that if we have a single core machine and 1 single thread running on that machine, then during that time that single thread is blocked our processor will lie IDLE, with no useful work to do. But suppose during that time our processor is blocked, if we would be having much more threads say N >>> 1, then during the time our processor would have scheduled other RUNNABLE threads on the processors and started executing the instructions set on those threads. In this way with increased threads, we could have achieved better concurrency and better overlapping of CPU Cores and other resources on the machine.
Also with the increasing number of threads, the possibility that any time my current thread is blocked and there will be at least one thread in RUNNABLE state increases and hence with an increasing number of threads it becomes clear the processor will become busier and hence will always be executing some or the other instructions relevant to the program/service.
But had this been only the case, we could have lived with
Decrease in throughput because of the increase in the number of threads
There are many many issues in having higher number of threads in any program. Let’s go by the issues one by one
- Increased Number of threads result into poor memory utilization
- Overhead of Context Switches
Lets first talk about how can increased number of threads result into poor memory utilization
From the linux man page of pthread_create,
Under the NPTL threading implementation, if the RLIMIT_STACK soft resource limit at the time the program started has any value other than "unlimited", then it determines the default stack size of new threads. Using pthread_attr_setstacksize(3), the stack size attribute can be explicitly set in the attr argument used to create a thread, in order to obtain a stack size other than the default. If the RLIMIT_STACK resource limit is set to "unlimited", a per-architecture value is used for the stack size.
So essentially every thread which gets created in our system by default, is allocated a memory area of 2 MB. This assigned memory is used by the threads to store the arguments and pass return values from one function call to another. This means that if we keep on creating more and more threads, then most of the memory in our system is going to be assigned to these newly created threads for maintaining the stack. Because of the low number of free memory pages in our system means that disk cache cannot be efficiently used by the application and hence most of the times application need to fallback to disk for any file I/O resulting into reduced application performance.
Now let’s talk about how increasing the number of threads result into
For the context of this blog, we would be talking about the overhead of a context switch between the threads of an application. Thread Context switch involves various steps
- Pushing the register values into the currently executing thread kernel stack
- The scheduler will figure out the next thread which it needs to run from the runnable queue. This runnable queue is maintained by the kernel as well.
- After scheduler figures out the thread to be executed next, the kernel then pops the register values i.e. state of the registers at the time program was halted and the program counter which points to the last instruction at which program was halted
- CPU again starts executing the instructions for this new thread
So all of these above steps happen during a context switch. So we do have some overhead because of these steps. But in general the cost of these steps in the order of nanoseconds, so these steps, in general do not constitute a major portion of the issue while context switching.
Major performance degradation, while context switching is because of the cache misses the new thread suffers in its early phase which results in the increased latency of the program.
How can increased context switches can cause high CPU cache misses ??
We all know that how crucial is CPU caching to the overall execution of a program. L1 / L2 Caches provide a high cache hit ratio which essentially means most of the times we are able to serve the load / save instructions from the cache itself. The reason for such a high cache hit ratio is because of the spatial and temporal locality of the memory locations being used by the application. But due to the increased number of context switching, the spatial or temporal locality of the memory addresses being used is no longer true which results into lower cache hit and hence high latencies.

One of the interesting things which I learned a few days back was how can increased number of context switches cause high User CPU. This may sound interesting at first but the reason for this is the same that CPU Cache hit decreases with increased context switching. Because of this decreased CPU cache hit, CPU now has to do more work by asking the RAM for the data at a memory location instead of the CPU cache. This result in increased CPU consumption.
User CPU and Context Switches wrt Number of Threads
As we can see clearly in the above graph that as
References
- https://medium.com/@adamzerner/spatial-and-temporal-locality-for-dummies-b080f2799dd
- https://www.extremetech.com/extreme/
- https://en.wikipedia.org/wiki/Instruction_cycle
- http://www.linfo.org/context_switch.html
- https://en.wikibooks.org/wiki/Operating_System_Design/Processes/Context_Switch
- http://man7.org/linux/man-pages/man3/pthread_create.3.html