20 March 2024
“The goal is to have a self-optimizing system” – Picnic’s CTO Daniel Gebler explains observability-driven development
A report by Splunk says that 86% of companies believe observability is important, but most are perpetual beginners in that area. Daniel Gebler, the CTO of Picnic, believes only some companies truly prioritize observability. In this interview, he presents his vision for an observability-first system that prevents incidents, predicts patterns, and continuously improves.
The CTO vs Status Quo series studies how CTOs challenge the current state of affairs at their company to push it toward a new height … or to save it from doom.
Change your mindset to create a self-optimizing system
Everyone knows the definition of observability as the ability to learn more about a system’s internal state by analyzing its external outputs. And yet, so many leaders continue to think of observability strictly in the monitoring category.
Reacting to threats and anomalies is essential. But if you focus on preventing them with observability, you could create a system so easy-to-read that most incidents would never materialize.
And if you could power your development process with machine learning, the system’s self-healing potential would only grow from there.
That’s what Daniel Gebler, Picnic’s CTO, believes. He’ll tell you about:
- a strong left shift in observability and its implications for software development,
- reasons why most companies fail to realize the potential of observability,
- why a reliable system changes your organization in ways you couldn’t imagine,
- the latest trends and predictions for the future observable systems and how they interest AI, ML, or security.
Say hello to Daniel, and let’s get it started!
About Daniel & Picnic
Bio
With a Ph.D. in Computer Science from VU Amsterdam and an M.B.A from the Dresden University of Technology, Daniel’s interests have revolved around the intersection of business and technology for a long time. At Fredhopper, he worked as a Software Architect, Product and Development Manager, and Director of R&D. In 2015, he co-founded Picnic. As its CTO, he oversees the development of a stable and scalable infrastructure that supports his company’s mission and ambitious expansion plan.
Expertise
Entrepreneurship, startups, scale-ups, Scrum, Agile, software development, venture capital
Hobbies
Sport, travel, piano, reading, experiencing the unknown, challenging yourself
Picnic
Headquartered in Amsterdam, Picnic is an online supermarket. The company has its own distribution centers and a fleet of hundreds of electric trucks. Every day, they deliver products directly to the customers who use Picnic’s app for grocery shopping. Picnic came to be in 2015 when no more than 1.5% of the market sold food online. Today, it’s at the forefront of a revolution that aims to reinvent the food supply chain, delivering products directly from producers to consumers.
Picnic’s vision
Sławomir Turkiewicz: Hello DanieI. I like to bring attention to the successes of the companies our guests represent. I can’t help but mention the 355 million euro funding Picnic got from The Bill and Melinda Gates Foundation, among other sources. Congrats! What piqued their interest in your company?
Thanks a lot. It’s really amazing to find somebody who believes in the long-term mission and ambition of Picnic.
The Gates Foundation was part of a group of long-term investors we talked to regarding the next round. This was the second investment they made. We got 600 million in 2021, and a couple of weeks ago, we closed another 355 million round.
I think they liked our technology-driven approach to improving the food system. They also realized that we are on a long journey.
With this kind of money, expectations also rise. Hence, one should only raise the amount that is really needed to realize the vision. Ours is to build the best technology-powered milkman on earth.
It seems that things are going great at Picnic.
We’ve recently launched our second automated fulfillment center. We are doing a lot of development of our consumer proposition. We are getting closer to our goal of being the leading online food retailer in Europe.
Recently there was a big worldwide outage of Google Sheets and Scripts. Hence, all our proprietary sheets and scripts that we have built around our core systems didn’t work automatically. This was a very nice stress test to see how resilient our system landscape was and how much we actually depended on this ecosystem.
It sounds like an interesting challenge from the observability perspective, which is our today’s topic. But before we get to that specifically, I wanted to ask you about Picnic’s approach to data management.
It seems you’re pretty busy in that regard. For example, your colleagues Tob Steenbergen and Giorgia Tandoi recently took the stage at PyData Amsterdam to talk about Picnic’s machine-learning capabilities.
We have invested in data capabilities, including machine learning, from the start.
However, it wasn’t until recently that we crossed the turning point when we had enough data in a triple sense – volume, velocity, and variety – to train deep learning models in a meaningful way.
We set up our tech stack so that every product is a software and a data product.”.
We implement hard-coded business rules and requirements as software modules. Then, we implemented everything that needs to be learned from data to self-optimize in Python-based ML solutions. Hence, there are both software and data components within every separate Picnic product.
The performance of a dual software-data product improves all the time. Every order the customer makes, every journey the customer has in the app, every item we pick up in a warehouse, every delivery that we make to a customer – all of it gives us a bit more data to train our models. Based on those improved models, the prediction and execution are a bit better every day. The goal is to have a self-optimizing system.
Observability migration
Let’s try to put it in the context of Picnic’s observability strategy.
At Picnic, you want to make sure that people who use your app daily can do their shopping without any problems. I assume that the ability to learn as early as possible about any potential malfunctions of your infrastructure must be crucial to a business like that?
There are two reasons why you should think of setting up an observability strategy. One is to achieve a stable product operation.
But what’s even more critical, you can only learn from a product if it is stable for an extended amount of time. Only that gives you enough uptime of good quality. Hence, the second reason is to enable fast feedback loops.
When most companies think of observability, they think of monitoring, alerting, and reacting when something doesn’t work. They fit the issue, deploy a bug, and call it a day. They do it until they achieve a running operation again.
The real power of observability is not in a quick resolution of a product incident but in the prevention of one. There are many predictors that hint that a problem will happen in the near future. Our observability strategy focuses on identifying incidents before they build up.
Consider a simple example. If you are running out of memory, then there must be a moment beforehand when you had 99, 95 or 90% memory occupation. You should be able to take action when you have a 90% occupation level rather than when you’re already out of memory.
You can apply the same type of logic of preventive incident management to more complex systems. If you can do that, you’ll find the holy grail – a truly scalable system.
Is this why you changed your observability solution recently? I read a more technical analysis of that process. Could you give us a more business-oriented overview of why you decided to go for Datadog?
There were a couple of reasons.
First, we wanted to have a broader observability solution, which became crucial for us when we started an on-premise operation in our automated fulfillment center. We needed an observability tool that worked for systems that had a physical component in the warehouse as well as a digital one in the cloud.
Another goal was to find a strong observability tool for our machine-learning stack.
Adding observability capabilities to an ML system is totally different from doing it for software products. In the case of the former, you still look a lot at classical metrics to define. You check if the system is healthy or if it has enough memory. On the other hand, data science requires you to look at metrics that define things such as model drift.
Our customer base has also increased greatly. This increased the impact observability may have on our product – the more data, the better. We needed a tool that could help us realize the potential of the data.
So, it seems the migration wasn’t just about gaining new capabilities but a result of a significant change in your observability strategy?
Definitely. At first, we looked at observability in our organization as an almost purely technical area.
At some point, we decided to establish a stronger link between technical metrics and business metrics. The need to understand how a decodation of a technical metric implies a decodation of a business metric to drive our decision-making. We found a tool that allows us to do it.
Barriers
Implementation problems are of special interest to me because I discussed observability with people at The Software House. I found out that a lot of companies struggle to begin the conversation about observability. Some stakeholders aren’t sure if they can justify the cost and effort.
How did you help your colleagues understand the importance of making systems observable?
The vast majority of companies struggle with what you’re describing.
That’s because, to most people, observability is an afterthought. It’s something they apply when their system is already unstable. They want to understand why the solution is unstable and how to change it. They make a big effort to set up observability, complete with strategy, configuration, implementation, and processes. This approach is becoming obsolete now.
The new trend is to shift left from post-deployment observability to built-in observability and observability by design. In this new approach, observability is the first thing you think about. You ask yourself what you would observe to determine if the system is healthy or not healthy. Once that is defined, you add it to your implementation.
More generally, it is the logical extension of test-driven design – you first specify observations, then define tests, and only later develop the implementation.
The shift left is well-known in the world of QA and security. This new cutting-edge trend is also applied for observability to create truly scalable systems.
What about the implementation? What are the biggest obstacles? Is the technology still challenging to use and make sense of? Is it due to a lack of talent with the necessary skills to process the data?
The biggest obstacle here for most organizations is the inability to include a business team in defining observability.
Ideally, you want the business stakeholder to define a business SLA for observability. It includes a set of business metrics coupled with a driver tree, which connects business metrics to technical observability metrics and SLAs.
The issue of not involving the business team has many angles. It has to do with how you organize your company. If your tech and business teams are entirely independent and struggle to communicate, the involvement will be probably limited.
Instead, you should think of business and tech as two sides of the same coin. This mindset pushes you to have them cooperate and align constantly.
Productivity benefits
With the barriers out of the way, let’s try to convince more CTOs to give observability a try by talking about all the ways it can help a business. Recently, my colleagues at TSH created the Observability Guide and a survey to go with it.
The guide mentions the three pillars of observability or three areas a company can improve in. Let’s talk about them, starting with productivity.
Do you use observability-derived data to improve the productivity of your teams? Developers can potentially avoid some extra work.
If you run a DevOps process, you know that everything a team does on the operational side distracts them from development.
Ideally, you want as little manual operational work as possible. This allows you to develop more every single week, which in turn increases product value creation.
Observability should lead to less operational work and less incident management. That way, our observability strategy helps our team be productive and do more development.
Reliability is actually the starting point of observability, and productivity is the by-product of reliability. This is why we don’t look directly into productivity. We observed that if productivity rises, that results from improved reliability.
What about the automation of specific tasks? Have you already tried AI-powered observability tools?
Observability is a young space. It has been developed over the last ten years. All the available observability tools have some kind of AI capabilities. However, none of them have satisfactory AI support that you can use at scale. There are a few reasons for that.
Most AI tools need a large volume of data. Most organizations simply don’t have enough data points.
The so-called cleanness of the available data needs to be improved, too. Hence, AI tools don’t learn fast enough because of the quality of the data they receive.
To experiment a bit with AI, we exported data from our observability tools and put it in classical machine learning and deep learning tools.
Marketability benefits
Observability can also help a company learn more about users and their needs. Do you use logs, metrics, and traces for that purpose?
To some extent – but not very attentively. We have focused more on reliability and stability for now.
However, creating product value with observability is something we will probably look into more at some point.
If the product is available in a reliable way to customers, the feedback you get through analytics is more meaningful than what you’d get from software full of defects. A glitchy product will not pass the data to your system properly. Even if it did, the feedback would be affected by technical difficulties users endured. It wouldn’t be useful for making business decisions.
So, in the end, it is all about reliability. Reliability is the foundation of both productivity and marketability.
What about predicting increased or seasonal traffic to scale or strengthen the infrastructure?
A part of the predictive observability strategy is also the identification of seasonal patterns or expected deviations from typical user behavior. But you need to take some considerations into account.
You will only be able to identify exceptions or deviations if they happened in the past. So, for seasonal patterns, you must have some past data concerning such deviations.
Another consideration is whether an anomaly is expected or not. Let me give you an example:
If I ran a market campaign, my email system would probably experience much higher traffic. Ideally, your observability solution should know that the traffic is safe and expected rather than a sign of an attack on the email system. It isn’t easy. There is no system that does it reliably at this point. So what I’d like to see is a system that can identify whether an anomaly is expected or not.
The holy grail of observability would be a tool that could classify an anomaly as an expected or unexpected behavior even if it has never seen this behavior in the past.
I noticed that business-related aspects of observability are under the radar in professional reports. They tend to focus on purely technical metrics. The same is true for observability tools. But you can also extract a lot of useful marketing insights out of your system, yes?
This is actually a very important point.
What you can see right now is that the space of business analytics – with tools such as Google Analytics – and the space of observability are being separated. In fact, they are two sides of the same coin.
It’s easy to see why this is the case. Business analytics and observability have different stakeholders. For the analytics tool, it is the marketing or growth team. For observability, it is the platform or infra team. They are the ones to make purchasing decisions in their respective area.
Data software providers can target business and infrastructure individually and sell companies two tools instead of one. From a commercial perspective, it makes sense for them.
However, I expect that in a few years, somebody will build a combined solution for marketing analytics and technical observability. These two will then evolve together and combine their insights to achieve a synergy that will benefit everyone.
Reliability benefits
Since reliability is the most critical aspect of observability for Picnic, let’s discuss deployment.
Each deployment is a sensitive moment for a technological company. Teams should release often, but each deployment can leave something undesirable in the codebase. How are you monitoring that?
We use observability to analyze blue/green deployment.
Typically, observability is about defining what is good or bad behavior. When you use observability with nodes, it’s not necessary. You simply ensure that the app behaves exactly like it did before the deployment.
We start by rolling out the new version to a few nodes in a partial deployment. The system compares the behavior of the app on the updated nodes with the behavior of the remaining nodes.
From a system design perspective, this approach requires some preparation. You need a stateless application. Otherwise, you can’t compare observations objectively. But if you make all the preparations, you can use this method to ensure that the behavior of the app remains unchanged following a deployment.
It sounds like something that can also help you prevent downtimes – a crucial challenge for a company that sells products directly through its apps. If the system was unavailable, it would have an immediate negative impact on your bottom line.
Yes. That’s another reason why reliability is our number one observability objective.
Going back to the subject of Artificial Intelligence, we have built our own AI solution for predictive incident management. It’s a more advanced version of the concept I mentioned earlier – it’s all about the ability to act before an incident happens based on indicators that typically precede the occurrence of such an incident.
The AI-powered system also provides incident response suggestions – it recommends the next best action based on similar events from the past. For example, it is capable of showing how we solved a given problem at different times and how much effort it took depending on the method used. Using that data, we can pick the best course of action more easily. So, it supports our ability to learn from and act on our experience.
It’s something we have custom-built around our observability tool because it’s just not available yet out of the box in any tool we tried.
Security-related issues are a special category of critical issues. What do you think about analyzing observable data for that purpose?
Few people today know that observability has started in the space of security. That’s why the security circle has been thinking very structurally about it. There’s a good reason for that.
Functional incidents in a typical system may occur every day, week, or month. Regardless of the frequency, it’s fair to say that incidents of any given class take place quite often.
Non-functional incidents, such as those related to security, are far more rare. The data you have is sparse — especially for the standards of observability. You don’t have enough data points to describe exactly what bad behavior looks like. The solution is to identify an attack early. Most security attacks are sequences of actions. If the sequence has 20 steps, you should try to identify it at step 5 or 10. The earlier, the better.
At Picnic, we also try to collect a lot of security metrics to identify early on that somebody is trying to attack our infrastructure.
So far, I only talked about prevention rather than identification. That’s because the security world still has not cracked the problem of resolving incidents that have already happened, at least not completely.
Resources
Thanks for all the insights Daniel! You have shown me that observability is a much broader subject than I thought. It has an impact on pretty much every aspect of software development.
Are there some resources that you recommend that CTOs interested in observability could use to deepen their observability knowledge? I heard you’re an avid podcast listener.
This isn’t observability-related, but I can definitely recommend some resources that keep me up-to-date with technology and technical leadership insights.
Pragmatic Engineer, TLDR, and LeadDev are all newsletters that enjoy great popularity for a good reason.
But everyone knows these. If I were to mention something less mainstream, I would warmly recommend Alphalist – this newsletter has a lot of useful content written by or based on the work of practitioners from the world of IT and business.
I also follow a number of podcasts for tech leaders:
Here are the links:
Make sure to check them out later!
What’s next? Three actions for CTOs to take
The future observability looks bright with leaders such as Daniel Gebler shaping its adoption, doesn’t it?
If you’d like to try his approach in your organization, follow Daniel’s tactics.
- Make observability an inherent part of your development from the start – at Picnic, each software product is also a data product.
- Involve business in a conversation about observability – Daniel believes that business people should help shape observability-related SLAs to tie technical metrics with business requirements.
- Prioritize system reliability – this is a key focus of Daniel. By preventing incidents, he makes the team more productive and the product more fun to use and easier to analyze.
Make data the driver of your product’s growth.
Find out how Picnic changes the world of groceries through technology
Check out the latest events, expert technology commentary, and recruitment experience.