12 March 2020
Metrics in infrastructure optimization process: Finding application bottlenecks (1/4)
Infrastructure optimization is absolutely essential for high-traffic apps. For that, you need metrics. But what if the app you have to work with is a big mess full of useless/legacy code and you are not even provided with proper documentation or domain knowledge? You have your hands full, sure. But the situation is far from hopeless. Let’s tackle the infrastructure metrics optimization for an app just like that.
Have you ever taken over a project you knew very little about? Does the application struggle with lots of traffic? Do you want to reduce the application’s costs? At The Software House, developers get to work with all kinds of projects. Some of them are long-running applications with interesting stories.
There are many ways to tackle these issues and all of them have to do with metrics. That’s why today we want to show you a quick way to get a lot of valuable knowledge about your app’s performance, traffic and… whatever you would like to learn about it!
Why metrics? Software optimization process is all about knowledge
Recently, one of our clients has challenged us – to optimize an application without being able to ask anyone about the domain (features and the related documentation etc.), traffic statistics or anything else related to it. The only thing we knew was that the application was a long-running project with many users.
Of course, at first we didn’t think it was possible… but we kept thinking – are we really not able to do it? Finally, we decided to accept this challenge.
We spent quite a lot of time just thinking about how to best approach this optimization process. Getting started proved the most difficult as we didn’t really see any clear path forward.
First, we decided to check the code source to define how hard it would be to work with it. Unfortunately for us, the application was a long-running project. Some of the code was written 8 years ago!
The project had a complicated history – many developers, remaining code from a different app which was separated from the main one on another occasion, new product owner lacking basic domain knowledge about it.
The application also had a lot of legacy code, which displayed strange behavior for today’s standards. We couldn’t always easily define which code made it work that way.
We decided to ask people on the client’s side about the domain, the way the app works… without much success. To make things worse, we didn’t have a clue on how to get knowledge about the traffic and bottlenecks.
Eventually, we came up with an idea – if we lack knowledge about the application, why don’t we simply ask the application how it works. Sound strange?
You may be thinking right now: “how does one ask an application about anything?” The answer is easy – logs. Logs are an essential part of development. And they never lie. Usually, it’s easy to find a place where something is logged. Also, we can easily customize messages.
We decided to log everything, but two big problems quickly occurred. We received thousands of logs per hour. It wasn’t possible to read it all and reach meaningful conclusions. Adding logs in every place was also hard and time-consuming. We needed to improve the solution. We decided to separate logged information into two categories:
- Application flow – their role needs to be analyzed on a case-by-case basis, a topic for another article.
- Statistics – that’s where we decided to really use metrics.
But why do we really want to measure anything at all? That’s the most important question if you want to make your measurements count for business.
Define why you want to measure
In this case, we split our metrics into two types and we defined why we needed them:
Server-side
We needed to know how our infrastructure works and if it needs more resources. We got information that the client had a problem with “write operations” in the database, but we didn’t know why. The environment was scalable, but since we didn’t know if the values were optimal given the application’s traffic, we couldn’t tell if it was efficient.
As a first step, we decided to check the application’s health. For example:
- Response statuses to check if the server is able to handle the traffic. Based on that, we were able to send notifications in crucial moments.
- Request quantity and the average time of response.
- How many pods are running at a given moment.
- CPU, memory usage.
Onto the next step.
Application
Thanks to these statistics, developers were able to find bottlenecks in the code and crucial places to optimize, focusing on improving the overall quality. Thanks to these metrics, we found the most frequently called and slowest endpoints – crucial places in optimization efforts.
Finally, we decided to measure:
- Request, sub-request, command, database queries execution time and quantity.
- Database queries types (insert/update/delete/select).
What about the tools?
Let’s choose the metrics optimization technology!
Our main challenge was to handle metrics using a single tool for displaying data.
We had different languages and a lot of logging destinations to handle everything. Also, we already had integrated Grafana in the system. In this case, we needed a tool that would be easy to integrate with it.
There are many metrics optimization tools. Our main goal was to implement it as soon as possible and without much effort. Following an investigation and discussion with our DevOps Team, we decided to use a tool called Prometheus .
From the original documentation: “Prometheus is an open-source system monitoring and alerting toolkit”.
It’s meant to be an “all in one” solution. You know the saying: “when something is for everything then it’s for nothing”? Well, luckily it doesn’t apply to Prometheus in the slightest.
Below is a list of the most important Promotheus features for the task at hand:
- Counters, histograms, gauges, summaries. The way we collect data. For specific cases, we need different collector types. In our case, we needed to use only two of them:
- Counter – a collector for measuring quantity, e.g. total requests number
- Histogram – a collector for measuring duration, specified quantities
- Data filters. This is a small disadvantage – you need to learn PromQL language to work with metrics. Luckily, it’s quite easy to learn.
- Integration with Grafana. Some systems already use Grafana to display some statistics. Prometheus is easy to integrate with it. Also, Grafana by default delivers most of the metrics infrastructure we needed.
- Support for multiple languages. We had some scripts to collect data in PHP and Go. Prometheus supports several languages and technologies.
- Alerts. Best way to know if/when the application responds with 500 errors.
- Labels. Thanks to labels, it was very simple to filter data and get interesting information.
It’s worth it to remember that Prometheus saves data in memory. If you use languages like PHP, which doesn’t have in-memory storage, you need to use something like Redis.
Storing data “in memory” is not a recommended solution. You should have extra storage to store data for a longer period. We will discuss it later.
Integrating with the application
Time for the seemingly hardest part – integrating a working application with metrics. We opened and started to read documentation and… it turned out it’s not difficult at all!
Prometheus has many libraries that help with the integration process.
We created an integration with an application (PHP + Symfony), infrastructure (Kubernetes) and visualization (Grafana) without any significant problems.
Especially for this article, we created a small tutorial and repository to show you exactly how we did it. If you’re interested in how it works or you want to integrate your application, you should definitely read the followup articles from this series, which we will soon publish on this blog.
Also, we prepared Git repositories for you to run in the local environment.
Use collected information to improve application
The last step before we can start the actual optimization process was to understand the data which we collect. A natural step that follows is to prepare an easy-to-understand report for business to show the application’s health, optimization progress and areas that need improvement.
We focused on three main values:
- Reducing infrastructure cost.
- Finding the most important features to optimize based on their usage and determine best optimization directions.
- Determining benchmarks and current results so that we can compare our efforts before/after.
To achieve the first goal, we checked metrics for pods (usage CPU and RAM usage). We used the default “Kubernetes / Compute Resources / Node (Pods)” dashboard. It provides all the necessary information about our PHP pods. We noticed that we had reserved more resources that the application needs.
The next step was to observe the database. Thanks to application metrics we could notice that some endpoints generated surprisingly a lot of database queries.
We compared it with the traffic and prepared a list of endpoints to be optimized as soon as possible.
Believe us – this optimization saved us during a big event which a year earlier turned into a big… disaster. For business, the information on the features usage was the most important.
Thanks to the metrics, we created a list which helped to plan business goals for the next months. The business was able to prioritize improvements for the most used features instead of doing the same thing for all features, including the unpopular/abandoned ones.
Infrastructure metrics optimization – summary
Getting knowledge about an application is one of the hardest processes during the whole development. Sometimes, it takes months before a developer realizes how some functionalities work. It’s important to improve this process and reduce time to get knowledge. This time, we achieved our goal.
Metrics saved us. We found places to optimize and we did it in time for a big event (and the traffic related to it). Business reduced costs and defined new goals for the future.
If you’re interested in how we optimized the application after we got necessary information from metrics, we wrote an article on high-traffic app optimization.
After all the work, the most satisfying part was the (newly acquired) ability to observe metrics during big events and react to changes almost instantly. In moments like these, It’s even hard to imagine working without metrics!
Now that we know why and when we should use metrics, it’s high time we got to the point – how to integrate Prometheus with your application. You can read all about it in the second part of the series, or you can jump straight to the third part about data collection with Prometheus, or to the final installment about Grafana custom dashboards.