16 January 2020
How to optimize and upgrade a high-traffic application? A case study
Do you wish to make your high-traffic web or mobile app faster and more resistant to sudden spikes in usage? There are many things you can do to achieve that. Most importantly, you have to view this as part of a bigger process of analyzing the state of your app (why did it turn out like this in the first place) and making sure that these problems won’t happen again. Let’s see how it works based on a real-life example of a product that consists of both mobile and web components.
If your app has severe performance issues and you start wondering if you’ve chosen the right technology stack – you’re not approaching it the right way. It’s not really important that it’s a Node.js app, a PHP app or a Java/Swift based mobile app. It rather comes down to the quality of the code and processes that guide its ongoing development.
To tackle performance issues holistically and make sure they don’t happen again, you need to begin with analyzing the codebase, development process and business context. Then, you improve the process.
The Software House developers have recently been working on a suite of high-traffic apps with the goal of optimizing their performance. SPOILER alert – it went well. 🙂 But what exactly did we do leading up to this outcome? Let’s find out.
The “Before” – what we got to work with
At its core, the product that we’re talking about makes it possible for end users to order and pay for food from a restaurant and pick it up on the go. It consists of a web app for restaurants and a mobile app for individual consumers. The web app has various admin panels for different classes of users – workers, restaurant managers and superadmins that run the whole system and add new places.
The app was once part of a bigger system, before it was separated. This, among other things, resulted in a number of performance challenges we’re going to bring up later.
The Software House’s developers participated in the development of new views for admin panels and rewriting the web app in React as a single-page application (SPA) combined with REST API. In order to handle complex business logic, our developers decided it would be based on the CQRS patterns. It makes it easier to maintain the architecture in the future.
But the real problem was the performance.
High-traffic app optimization – goals and challenges
As we have said before, the app was once separated from a bigger system. As a result, more than half of its codebase was virtually redundant and not necessary for it to function. Many changes to the development teams and rapid growth resulted in the codebase getting a bit out of hand. To put in the simplest words, nobody exactly knew how it worked. Therefore, our first challenge was to thoroughly analyze the codebase and understand its intricacies.
Actual optimization and refactoring
The poor quality of the original codebase made it initially difficult to make sense of it all. However, once we understood the code, we could refactor it – remove the needless parts, get rid of redundant requests, combine some other requests into a single one and more. Remember the advice from the beginning? Before improving the code, you must understand the business context first.
As it is often the case with what’s essentially a shopping app, the traffic is not just high – it also has a tendency to drastically change over time. Luckily, it’s usually easy to predict the specific hours and days in a year when it happens. By doing stress testing, we can predict just how much traffic an app can handle and design the best server scaling strategy to respond to predictable spikes (sharp increase in the number of users). It also makes it easy to see if the optimization process bore fruit and prepare charts that are easy-to-understand for business people.
Future-proofing the app
Besides that, the remaining work consisted of bug fixing and adding some new features, which was challenging to the fact it often had to be done simultaneously with the process of getting the app back on track. To make sure these problems won’t come back, we aimed for three goals simultaneously:
- Architecture efficiency – thorough optimization of the whole codebase to improve its performance.
- Faster development process – easy-to-read and free of all redundancies in order to improve development processes in the future. One of the reasons we wrote so many tests is also to fix the problem of difficult learning curve for new devs to get into the app.
- Better communication – easier communication with the customer center to shorten bug reporting and fixing, easy-to-understand metrics that simplify locating sources of problems,
That way, we could best combine development and business goals in order to turn the whole system into an efficient machine that is ready to work and grow around the clock.
The nitty gritty of high-traffic app optimization
Our optimization work involved many considerations. Some of the most important aspects included:
- Reducing the number of database queries – the fewer, the better.
- Rewriting the largest and most costly queries so that they are faster and simpler, reducing the amount of data that needs to be loaded.
- Moving some of the data to Redis (more on that later).
- Adding an efficient caching mechanism for data that was previously loaded many times.
- Getting rid of unnecessary and unused code.
- Low-level refactoring (e.g. reducing the number of loops).
- Adding indexes to database columns – by doing that we could filter out the indexes faster for many queries that are performed often. Easy way to speed up querying.
- Removing unnecessary column indexes.
There are a couple of issues here that we should take a closer look at.
The case for using a non-relational database
Redis is an open source in-memory data store, which can be used as either a database, or a caching mechanism. As part of our optimization efforts, we decided to get rid of some queries that are made over and over again, and use Redis instead to store the data in cache memory. Redis supports all kinds of data structures, which makes it very flexible for such operations.
In addition to that, we also used Redis successfully to store various temporary data (e.g. single-use tokens), including session data.
Test and analyze
To get the clearest possible view of the app’s original stare, we employed various tools.
A perfect tool for getting a bird’s eye view of the app. In particular, you can learn what kind of scripts/libraries are used and how much, find specific areas in need of code optimization and get useful alerts for all kinds of traffic-related issues.
Stress testing made easy. With Artillery, we could measure exactly how much traffic our app can handle as well as the performance of the app before and after the optimization. As you optimize your app in an iterative way, Artillery provides a great way to quickly assess your progress or prove it to your stakeholders.
The Software House’s own open-source framework for writing E2E testing scenarios. The app’s original test coverage was quite poor and automated testing allowed us to quickly turn it all around
To further improve the testing and improve the overall stability of the app, we implemented functional tests with Behat. A big advantage of this solution is that it is very easy to get into and start writing new test cases
- Other tools
The software above are just some highlights of the many tools we employed during the process. The list includes (but is not limited to) Splunk, Kibana, Grafana, ElasticSearch, Helm, Vault or Jenkins.
The “After” – what we did and what we have learned
As we have said before, the optimization process was an outstanding success. Its true test was the day of some of the highest number of orders, which took place during a week of national-wide charity activities. Back in 2018, the app crashed during the spikes in traffic. In addition to that, during the very same day in 2019 the traffic increased by an additional 20 percent. Despite that, the refurbished app worked smoothly all day.
On the chart below, you can see the number of users per each 2-minute period during the day. During the highest spake, it reached around 48,000, with roughly 13,3 orders per second. During the highest 5-minute spike, the order rate per second was between 10-12. The system responded well to these sudden spikes, appropriately increasing and decreasing the use of instances and jobs required to handle the volume.
How do you go from the initial state to such results? As we have confirmed during the project, it’s important to:
- Look at the processes holistically – not as a collection of bugs or problems but as a development process that went wrong and needs permanent reforms.
- The app may have some very simple shortcomings. Don’t assume that even the simplest things are definitely done right. There may be many easy optimization wins to score.
- Measure the improvements before and after by using specialized DevOps/QA analytics tools.
We hope that this article will make it easier for you to understand what exactly is needed to truly optimize a high-traffic app. If you still have doubts, or if you would like TSH developers to help optimize your software, do not hesitate to write to us.