In this blog post, we will discuss the balancing act between the number of pages a team receives vs. the reliability of the system the team handles.
Decreasing the number of pagers a team gets to improve developer productivity
I have been working in the software industry for 9 years now, and in all these years, one of the common problem I have encountered in all the teams I have worked with is “how to balance these two opposing forces”:
- Decreasing the number of pages a team receives to improve developer productivity.
- Making sure the system is always operational and running reliably.
In all honesty, finding the balance between these two forces is quite challenging, and the solution will vary across different teams depending on the safety factor the team feels comfortable with. We will talk about the thought process or the questions one must ask within the team to improve this balance and make sure the pager volume is not too harsh.
Before moving ahead with the suggestions and the conclusion, one needs to understand the broad categories of alerts a team handles. These broad categories are:
- KPI Alerts
- Application Alerts
- System Alerts
For any system running in the production env, KPIs are an essential part of figuring out exactly how the system is doing. Any application / microservice running in the production environment needs to have defined KPIs
If the KPIs are breached , then it means the service is not doing well. So KPIs based alerting is almost always necessary and essential.
In the above example of the defined system, we would have the following KPI based alerts
- API Latency: Response Times for the API to get the Telephone Number for a Username. 95th Percentile of the Latency for all the API Responses should be within 500 ms
- API Response: Number of times 5xx Error Code was returned as the API response. Number of Errors in a 10 mins time window should not exceed than 100
Any system running in the production env, has its own set of application metrics which tells us about how the application is doing. Alerts like ThreadPool Rejections , DynamoDB Throughput Exceptions are too specific to the application.
These alerts definitely help the developers tell that there is something definitely wrong with the application but it might not exactly tell that the KPIs of the system are impacted.
In the design phase, developers generally tend to / should solve the problems associated with the application level issues like ThreadPool Rejects, DynamoDB Throughput Exceeded Exceptions. So if the application has been built with the assumption that threadpool rejects might happen, then in that case application / service needs to be designed in such a way that in case the exceptions happen the application / service needs to handle the issue gracefully.
So all in all, application based alerts are definitely useful but in general it tends to point to a problem which is associated with the design of the application and not any KPI itself..
In the above example of the defined system, team could have following application alerts
- API Requests Rejections: Number of API Requests getting rejected from the Application in any 10 mins time window should not exceed 100
- Postgres Connections Used: Number of Postgres Connections which are being used by the application should always be lesser than the Maximum Limit of 200 connections
- Queue Length of the ThreadpoolExecutor: Number of Pending API Requests in the Threadpool waiting for a thread to serve the request should not exceed 50 at any time
- …. It can go on and on
Given these applications run on certain machines and these machines have finite resources in terms of CPU and Memory and Disk, so running applications on these machines will more often than not lead to scarcity of resources on those machines.
In those cases, some of the teams tend to add these alerts related to the scarcity of resources on the machines. Eg
- High CPU on the machine for the past 10 mins
- Low Memory on the machine for the past 10 mins
- Disk Usage exceeds 95% for the past 30 mins
These alerts tends to point towards provisioning the application cluster in a better way or some time it points towards that there might be some systemic bug in the system which needs to be looked into ASAP otherwise, the performance of the overall system is degraded to an unbearable extent.
In the above example of the defined system, team could have following system alerts
- CPU Usage : CPU of the System serving the request should not exceed 90% in a given 10 mins time window
- Memory Usage: Memory Usage of the System should not exceed 90% of the overall Memory in the System
- CPU Usage of the Postgres Instance: CPU of the System serving the request should not exceed 90% in a given 10 mins time window
- …. It can go on and on
General Setup vs Ideal Setup
As you can see in the above diagram, we can see that the Teams usually have more alert volume coming up from System Alerts and Application Alerts combined rather than KPI Alerts.
Why is this not the ideal State ?
As you can see in the left diagram that most of the alerts coming to the team are either Application Alerts or System Alerts. By the very definition of these alerts they do not represent an issue with the Application KPIs. If there would be any issue with the application then it would be captured via KPI based alerts and not Application Alerts or System Alerts. So hence we should try to minimize the alerts being generated because of Application Issues or System Issues and rely more and more on KPI based Alerts
Top Reasons because of which Teams rely on Application Alerts and System Alerts
Mean Time to Resolve the Alert reduces
Teams usually configure Application Alerts and System Alerts because it becomes easier for them to root cause the exact issue because of which the system is getting affected. Alerting just on Slower API response would lead to higher times to resolve the underlying issue as opposed to Alerting also on High CPU because of GC.
Alerting on System alert i.e. High CPU in this case reduces the time to reach the Root Cause for the Increased API Latency.
Of Course the issue in this approach is that more often than not you will get these System Alerts not accompanied by these KPI alerts because of which there won’t be any need to act on these System Alerts.
Leading Indicator for the KPI Alert
By the time KPI alerts are raised for an underlying issue, it is usually late for the team to react. However, if a team has configured other alerts for application issues or system issues, then there is a high chance that the teams will be notified much earlier rather than later. This leads to quicker resolving of issues
However, there are various other ways in which you can have quicker and better leading indicators to underlying issues. Those ways include rate-of-change in the KPI as opposed to only alerting on only just KPIs.
- Have better Alerting Framework which relies mostly on alerting on KPIs and not on Systemic Issues and Application Issues.
- Have better Application Architecture so that Systemic Issues and Application Issues do not have impact on Application KPIs
- Have better Monitoring Dashboards in place so that whenever the KPI based Alerts comes up to the team, they can look at a single dashboard to identify what other application issues or systemic issues are correlating with the KPI alert. See Sumo Logic.
- Alert on Rate of Change of KPIs so that the team can get quickly to know of the issue at hand.
In conclusion, striking a balance between pager volume and system reliability is an ongoing challenge for teams in the software industry. By focusing on KPI-based alerts and implementing better monitoring and alerting practices, teams can work towards reducing pager fatigue while ensuring system reliability. This balance will ultimately lead to improved developer productivity and a more efficient system overall.