16 November 2023
The Observability Strategy Guide – turn your raw data into actionable insights in 3 steps
Every business knows how much revenue to make to improve. That’s because they know how much they made in the past. Wouldn’t you like to keep the score for metrics such as developer productivity, system reliability, or UX? Our experts developed a generalizable observability strategy plan to do just that! Get your observability strategy blueprint right now.
A client turned to us with a request – to produce a productivity audit. They grew fast, but they thought their developers could do better. This audit was part of a strategy to add more standardization into the development.
We quickly learned that there is not much data to go around. The client’s opinion about their developers’ productivity was subjective. It’s not that they were wrong, but there was no data to prove it. They didn’t even know what they wanted to measure.
We needed to start from scratch:
- pick the metrics,
- set up measurements,
- give it at least half a year.
They thought they wanted a productivity audit, but what they needed first was an observability audit.
The Observability Strategy Guide – what you’ll learn
Recently, we collected a lot of our observability strategy knowhow. We put it together to create a one-of-a-kind test.
The experts who worked on the test also authored this guide. Our CTO Marek Gajda and Head of DevOps Wojciech Wójcik join forces to give you:
- A practical explanation of the observability strategy concept.
- A detailed, 3-step process of implementing an observability strategy in your company.
- Further considerations that let you contextualize the process and adjust it to your company’s profile (and if you have more questions, you can always contact us).
Before Marek and Wojciech enter the stage, we invite you to take a look at a selection of report data that in our view really drives home the necessity of having an observability strategy.
Can you afford NOT to implement an observability strategy?
According to a 2022 report by Flexera, cloud waste accounted for about 30 percent of cloud budgets. The next year, it went up to 32 percent.
The State of Cloud Cost Report by Anodot confirms that almost 50 percent of IT leaders find it difficult to get cloud costs under control even as 60 percent of them plan to move ever more workloads to the cloud.
The last example tells us that a great number of businesses are not in full control of their IT architecture. They fail to detect redundancies that could be eliminated to pay less, or roadblocks that could be removed.
It’s not for a lack of trying
It’s not that companies aren’t aware that their distributed cloud-based systems are really complex and difficult to fully grasp, the data flowing in all directions in ways that are difficult to understand except for an elite group of engineers and scientists.
The Splunk report shows that 86 percent of surveyed IT leaders think that it’s important to have a flexible observability solution. The majority of researched companies are still in early stages of its implementation. 33 percent of them are classified as observability beginners and 37 percent as emerging.
Such a big share of observability newcomers might mean that those that try give up rather quickly.
Our experience tells us that our client’s dissatisfaction with early results comes from two major factors
- a lack of patience (remember the productivity audit request?),
- a lack of organized approach to observability (i.e. an actual observability strategy).
Benefits of an observability strategy
It’s too bad because once you invest time into a proper strategy, you’re bound to see some results.
- Downtime prevention – The same Splunk report mentioned above shows that observability pros are 4 times as likely to resolve unplanned downtime in minutes when compared to companies that don’t invest in it at all.
We saw the benefits of observability in our own projects too:
- Failure detection – Xpate used observability to ensure that their third-party integrations always function correctly to find out immediately when it’s not the case.
- Improved productivity – in a project that involved implementing a data lake-type repository, we were able to measurably improve the efficiency of data scientists by taking advantage of Amazon Athena’s observability capabilities.
- Improved scalability – as part of the data migration project for Pet Media Group, we’re implementing observability processes to help sustain the high pace of growth the migration causes (400% revenue increase!).
Do you think that you have the patience to implement an actual observability strategy to reap all of the benefits that it could bring? Then, we invite you to listen to our experts.
Observability strategy defined by a CTO
Marek Gajda: Observability strategy is to observability what SEO strategy is to SEO.
Just like SEO strategy plans how to bring more traffic to your website, an observability strategy tells you how to get more out of your system’s data to improve development and business efficiency and avoid technical problems.
Since we’re talking about strategy, we target long-term results. We want to take time to make changes and see changes, then analyze them, rinse and repeat.
The lesson of the productivity audit story mentioned above illustrates exactly what an observability strategy should be. If you want to succeed in the data department:
- you need to know what you want to measure,
- you need to know how to measure it.
The first of these two needs concerns your business objectives, the other concerns how you will implement it technologically.
As you can see, observability is not a goal in and of itself. It is something that helps you determine and quantify your goals. In the case of the productivity audit, the goal would be to improve the efficiency of developers.
We’re about to delve deeper into crafting an observability strategy. It might be the right moment to check where you stand right now. Take the test – it only takes a couple minutes! ⬇️
The Sensible Observability Score test estimates your ability to measure and analyze your system’s data. In a few minutes, you’ll be able to get the first measurable review of your company’s data capabilities.
The three pillars of an observability strategy
The main one is the inability to tell why something works or doesn’t work as well as you want it to.
Productivity
Earlier in the article, we mentioned an example of a company that wanted to improve the productivity of its developers only to find out that they need an observability audit.
This example points to something that seems obvious at first. To assess productivity in a large organization, you need historical data that can serve as a benchmark. In my experience, it’s surprisingly difficult for a lot of companies to understand. Most of the time, industry benchmarks will not suffice as productivity of developers and teams seems to vary a lot between companies. Productivity metrics are also mostly inconsistent when compared to metrics related to reliability and marketability.
Some of your choices for productivity metrics may include:
- Total Cost of Workforce (TCOW) – the sum of money spent by an organization on its workforce in a given period of time.
- Planned-to-Done ratio – it allows you to assess how much of the work assigned was completed by each teammate.
- Revenue per employee – measures how much money each employee gets you.
- Mean ticket resolution time – the time taken by its employee to resolve a customer’s issue.
- Defect escape ratio – the percentage of issues your testers find before they go to production – the higher, the better.
Of course, there are a lot more metrics like this. But stick only to those that you really need to assess the efficiency of your organization. I’ll talk more about the problem of information or metric overload later.
Another issue to consider is how your employees will react to metrics-based productivity audits. If you want to find out how you can introduce such measures without affecting their morale, make sure to read up on making a case for observability in the Further Considerations section.
Reliability
If you don’t implement a way to measure the reliability of your system and its individual parts, such as microservices or a third-party integration, you risk that your users will realize something doesn’t work before you do. Worse yet, you may expose their sensitive data and face grave consequences, not limited to damaging your brand.
If you believe that a customer angrily calling you that they can’t purchase your service qualifies as a risk to your business, you know exactly what I mean. The information about errors should come from early warning mechanisms of tools such as Prometheus rather than your customers!
An obvious application of reliability is preventing outages and downtimes. Reservix, one of our clients, runs a major ticket platform in Germany. When an event happens, traffic may go up substantially. Your system should guess how many instances it needs beforehand.
This is called autoscaling. But it doesn’t come without risks too. Your platform may face a DDoS (i.e. Distributed Denial of Service) attack. It happens when your server is flooded by malicious requests coming from different services. The goal of the requests is to either block your server or rack up cost when autoscaling kicks in. If you don’t have defense mechanisms in place, your costs may go up greatly!
System’s reliability is tested in portions of the software that aren’t often used such as URLs only viewed by a handful of customers or information in need of an update once a year. I remember a situation where one of our new clients wanted us to get familiar with some information. They sent us a link to their app. I clicked it and… it didn’t work. They sent it themselves and still didn’t know it was faulty!
Another time, we created a tiny app designed to check if certain pieces of information are up-to-date from time to time. At some point, this health-check failed and the information was no longer updated until someone stumbled upon it by chance. The problem was the lack of failure detection in the system.
Another client offered two ways to create an account in their system via an email or via Facebook. Almost everyone used the former, so when the latter failed, it took them weeks to find out!
It’s not that nobody used it for such a long period of time. It’s just that since there were two methods, nobody sent a complaint when the Facebook registration malfunctioned. They just used the other method. But I’m sure that they remembered the situation, and that it definitely didn’t reflect well on their perception of the brand.
Marketability
I won’t go into detail about this one, because this aspect seems to be the most familiar one for most businesses, although it is not always viewed through the lens of observability.
Marketability is about measuring / tracking user behavior, testing, or micro optimizations. Ecommerce businesses excel in that because for obvious reasons, it is the easiest for them to link test results to an actual increase in revenue.
You can take marketability a step further when you know how to use your system’s data.
I know of a company that feeds their automated ad campaigns in social media with system data to improve their targeting continuously and get more impressions and engagements for their money. It also takes patience because you need to gather data over a longer period of time.
The deployment process and its impact on marketability is yet another interesting observability problem. It’s not always a big deal. For some businesses, it doesn’t really matter if deployment takes 5 minutes or half an hour. But there are exceptions.
One such exception happens when payment gateways are used. You can minimize problems by completing deployments at times when the traffic is at its lowest (e.g. in the middle of a night). But what if you need to fix a bug as soon as possible? Can you wait for your deployment to complete for 30 minutes?
Platforms such as Uber or Booking.com are highly aware of this. Fintech companies should pay a lot of attention to it as well.
Now, our Head of DevOps Wojciech Wójcik will introduce you to a basic observability strategy plan.
The observability strategy plan
1. Aligning observability KPIs with business objectives
Wojciech Wójcik: The process starts with figuring out what exactly you want to measure. The metrics you observe should help reach your business objectives.
1.1 Specialists needed
- Business management (CEO and CCO in particular).
- Any technical employees familiar with and responsible for the company’s tech stack and software development life cycle (SDLC)
1.2 Walkthrough
You need to determine business outcomes that are associated with observability KPIs. These may include Mean Time to Repair (MTTR), false positive ratio, peak load, response time, or latency. Start with just a few such associations. They are the key to making your data meaningful for the growth of your business.
There are things you can do to find the measurements that directly affect your business goals. For example, declining website performance can affect user satisfaction, which in turn can increase bounce rate and decrease conversion.
Using your trace data (a recording of a request’s journey throughout your system) and comparing them against traffic and performance can reveal which parts of the system you should focus on to improve performance, reliability, and marketability.
Remember, whatever you prepare in the initial phase with your IT teams and other stakeholders is not the final version. Your observability strategy will continue to evolve over time.
1.3 How much time it takes
This first step shouldn’t take too long if you understand it’s subject to continuous improvement. Depending on the size and profile of your business, it can take a few days or up to a few weeks.
1.4 Definition of done
Clearly defined base technical metrics associated with measurable business goals you pursue.
2. Collecting and storing your data
When you’re just starting out with your observability strategy, you might not realize just how important it is to think through this one. Your collection of data will grow in size and versatility and if you don’t manage it properly, you might soon not even know exactly what you have.
2.1 Specialists needed
- Developers in charge of implementing measurements technically.
- DevOps engineers familiar with observability tools.
2.2 Walkthrough
Now that you know what to measure and what you want to achieve with it, the next step is to determine how you will collect and store data.
This is the time when you want to define what tools you’d like to use. The choice will depend on a number of factors including your budget and the size of your underlying infrastructure. Some of the popular choices for data collection and storage include:
- third-party tools such as DataDog,
- Cloud-hosted setups using Grafana, Prometheus, Loki, or NewRelic.
- Self-hosted setups that also use Grafana and Prometheus as well as Splunk or Opensearch.
And that’s just the data collection! Beyond that, you will also have to take care of delivering your data to a monitoring solution. OpenTelemetry will help you with telemetry data. Within your own environment, you’ll be able to make use of in-built extensions of monitoring tools such as Prometheus.
2.3 How much time it takes
Several days up to a couple of months.
2.4 Definition of done
A fully implemented toolset for data collection, storage, and visualization coupled with documentation, diagrams, and flows designed to maintain a stable observability solution.
3. Defining actions when out of agreed frames (when threshold is crossed)
The log data, key metrics, and traces you collect have no purpose unless you take an action whenever their value crosses your defined thresholds. The threshold refers to a minimum or maximum acceptable value before an action needs to be taken.
The job here is to define what actions it should be.
3.1 Specialists needed
- A diverse team of DevOps engineers, developers and business-minded Product Owners.
- The technical experts are responsible for defining actions and alerts as well as for their practical implementation.
- The non-technical stakeholders are to assist in defining procedures for actions that affect business outcomes.
3.2 Walkthrough
System alerts inform of any unusual and quantifiable events.
Remember that you should only set up alerts for events that have a direct or indirect impact on your business outcomes. It’s easy to set up too many alerts and face alert fatigue – a situation when an overabundance of semi-relevant alerts makes finding the important information harder. Make sure to consult with both your developers and business to ensure an optimal choice of alerts that contribute to your productivity, reliability, and marketability.
The next step is automation.
Whenever possible, your system should respond automatically to certain events, protecting you from threats such as third-party integration failure or downtime immediately.
Automated responses can handle issues with no further help from the team. You should handle non-automated responses according to a standardized process defined in a playbook.
That’s the standard approach. Beyond that, you should pay attention to modern AI tools that can take your automation of repetitive tasks, including quantitative decision-making, to the next level. Many of the tools mentioned earlier already offer AI capabilities.
3.3 How much time it takes
Based on our experience, the average implementation takes about three months, but it may vary a lot project to project.
3.4 Definition of done
Alerts set up in a way that maximizes business outcomes. Clear procedures and a division of responsibilities regarding reaction to alerts are established.
First revolution, then evolution
That’s the basic observability strategy process. But the work doesn’t end there. What comes afterwards is… an evolution. That’s right. You need to continue to analyze and refine your data collection, alerting and visualization efforts and the way they relate to business outcomes.
Many companies that struggle with their observability platform at the beginning simply give up. If they remained patient and continued to refine it, they would soon join the ranks of companies that deal with outages, downtimes and other issues that affect user satisfaction four times faster than their competitors!
As you continue to customize your observability strategies, you will come across new challenges unique to your business. Marek will tell you more about them.
Are you ready to find out where you stand with your current observability efforts?
Try our Sensible Observability Score test, which is centered around the three pillars of observability: productivity, reliability, and marketability. It only takes a few minutes to complete!
Further considerations for an observability strategy
Making a case for introducing observability
Marek Gajda: If you want to introduce observability in an organization that never paid much attention to their data, you’re going to have to gain some internal buy-in.
First, you need to understand that some changes might be uncomfortable to your potential allies. You need to sell your observability initiative as a way to help them get even better.
Should you start making your case from the top or bottom of the organization? It doesn’t matter. Try to get a feel of where there are more people who could find observability appealing and start there.
When you do manage to gather some allies, it’s time to start crafting the draft of your observability strategy.
Establishing an observability culture
Observability has to be a continuous process so that new employees embrace it and pass it forward too. Then, it becomes a part of the culture.
I’d say that the first thing to do when you want to establish an observability culture is to start talking about it across your entire organization.
Talking will inevitably lead to a number of small everyday actions such as evaluating quantifiable goals, developing metrics, checking out reports, and driving conclusions from them – that’s the foundation for your culture of observability.
But talking and small actions won’t be enough. What more can you do?
From my experience, meeting regularly to discuss and report on metrics really makes a difference.
One of our clients regularly invited us to such meetings. They gathered all kinds of data with a focus on development process metrics. Because of these meetings, all developers involved in the project took great care to log their work hours. They also take a closer look at their own performance metrics, because they know that in a month, they’re going to talk about them during that meeting.
The same goes for marketability. Another client held a KPI session every quarter, another every six months. Every stakeholder prepared for such sessions by gathering insights, commenting on work progress, and analyzing failures and the viability of future KPIs.
When it comes to observability culture, consistency is key. Create a routine that introduces a metric-based approach to development.
New project vs an existing one
Here’s the thing – if you go to a non-technical board as a CTO and start selling them an observability strategy out of the blue, they will probably ask you: “… So we don’t have anything like that?”.
You’re basically telling them you don’t have full control over the system! You’re admitting to a mistake: you’re trying to regain control of things. That’s a tough spot to be at.
It’s easier when you start with a clean slate. It might be a whole new company or a system. Or when a company hires a CTO for the first time. Then, the CTO comes up with an initiative to make an observability audit and review the metrics.
So what should you do if you don’t have the luxury of having a clean slate? You need some kind of a trigger.
A trigger could be a major downtime incident or a failure to detect problems in an app that an observability strategy can prevent the next time.
You might still be held responsible for the incident, but it gives you a chance to come up with a detailed plan to prevent it in the future. Well, if you don’t have any observability measures set up, an incident like this is bound to happen sooner or later.
Once you get to the point when you are in a position to talk about observability, you need to convince your board of one thing – that going through with the observability strategy plan is better than doing nothing.
Data quality
If you really want to get ahead of your competitors in terms of data quality in today’s world, you should invest in AI-based solutions.
A lot of data quality issues have to do with formatting. A poorly managed repository full of manually inputted data is hard to analyze. But if you go for a data lake backed by machine learning, your system will be able to resolve a lot of these quality issues on its own – by skipping the unreadable parts, sorting out others, guessing when necessary etc.
The more data volume you have, the more difference machine learning makes for data quality.
Data volume
Speaking of data volume, all I can say is that it’s pretty much always better to collect data rather than not to do it. After all, today just storing data is very cheap. You will definitely not go bankrupt by purchasing some Amazon S3 storage for that.
You never know when you will need to turn some of the stored data into observability data. Just think about the company mentioned earlier – they surely would have loved to have that performance data when they finally decided they wanted to take a closer look at their team’s efficiency. It would have saved them months of work. Don’t miss out on a successful observability strategy because you didn’t want to spend a little more money.
Data visualization
Some observability beginners believe that once you know what you want to measure and how, data visualization is easy. That couldn’t be further from the truth.
If you want to visualize your metrics in a way that makes sense even to people who aren’t intimately involved in the project, you need a strong background in data analytics.
You need that kind of dashboards for your business intelligence teams.
And then you’ve got an entirely different kind of visualization – custom-made dashboards for the select few, typically C-level executives. These convey the essence of the company and system’s situation in a way that makes most sense for a particular person.
They don’t need to be useful for everyone, but they need to be tremendously and uniquely useful for the intended audience.
What’s next? Act now!
Do you feel like you’re ready to establish or upgrade your observability strategy? The task ahead of you is a big one, but you can get there one step at a time:
- Find a reason to begin a conversation about observability systems. Try showing how they can solve real problems. Make sure not to intimidate or scare off anyone!
- Determine worthy and quantifiable objectives and design viable ways to measure them. That’s the beginning of your observability system!
- Engage both development and business in gathering your insights.
- Propose regular meetings to analyze your system performance and make a case for an observability strategy.
And if you haven’t yet, try our Sensible Observability Score test. You’ll quickly know where you stand in any of the big three areas: productivity, reliability, and marketability. ⬇️