19 June 2024
Implementing Active Observability methods for graphical computing saves over $720 per instance per month
Active Observability methods for graphical computing streamlined operations for a large retail company and saved over $720 monthly per instance by automating incident remediation and preventing unnecessary cloud costs.
Business Challenge
A British company we collaborate with faced a clear need for observability methodologies to ensure smooth operations. Utilizing Datadog as their service provider, they aimed to consolidate metrics, logs and traces into a single source of truth.
However, managing incidents arising from multiple failures simultaneously proved challenging.
This prompted us to explore automation options within the Datadog and PagerDuty ecosystem.
Solution
As part of our DevOps strategy, we identified common issues in graphical computing, a key process for the client. We identified two major issues triggering monitor alerts:
- instances not stopping properly after finishing graphical compute-related tasks
- unresponsive Jenkins workers during high usage concurrent builds.
We opted for Datadog’s Workflow Automation feature to automate remediation.
Automated instance shutdown
In the first of our use cases, we automatically shut down instances used for rendering VR images. This process typically takes around 15 minutes, and the instance should terminate automatically after successfully generating the image. However, there was a risk that the shell script responsible for this process could get stuck for various reasons, resulting in prolonged usage of expensive g5.type instances and unnecessary cloud costs.
To mitigate this risk, we introduced a specific monitor designed to identify all instances of the g5.type running for over 3600 seconds. When triggered, this monitor activates a handler connected to a workflow automation. DevOps configures this workflow automation to extract metadata about the instance captured by the monitor and execute a pre-configured AWS action.
JavaScript code was used to extract and transform the relevant information into a format Datadog understands. This transformed data was then used by the pre-configured ‘Terminate EC2 instance’ action to automatically shut down the instance, thus preventing prolonged usage and reducing unnecessary cloud costs.
Jenkins worker stability
Our second use case is incredibly relevant to the overall health of CICD systems. Using Jenkins as a CI tool for creating Unity builds can be challenging due to the rapidly changing resource usage, particularly CPU and memory. This becomes even more complex when multiple builds run concurrently on the same worker, as in our project.
In such scenarios, we often encountered issues like losing connection to EC2 workers because overloading memory on the instance prevented the SSH agent from responding promptly. Consequently, we lost track of build statuses, which hindered further scheduling and disrupted already running builds in the worst-case scenario. These builds typically take 1-3 hours each. Due to architectural limitations, we couldn’t simply scale up or down a static instance.
In this case, our procedure involved rebooting the machine, addressing SSH agent connectivity issues, and reallocating already processed builds to better utilize the worker’s resources.
To streamline this process, we implemented workflow automation based on a monitor that checks the ‘jenkins.node_status.up’ metric. This metric indicates whether the Jenkins master connects with the worker’s agent.
The workflow automation is triggered to restart the machine whenever the metric indicates a lack of connection for over 5 minutes (to exclude any intermittent spikes).
The Sensible Observability Score test estimates your ability to measure and analyze your system’s data. In a few minutes, you’ll get the first measurable review of your company’s data capabilities.
Active Observability saves over $720 per instance per month
Implementing active Observability methods to automate incident remediation steps has proven highly beneficial.
Cost Savings
With this system in place, we’ve experienced substantial cost savings by preventing unnoticed instances from running, saving over $720 per instance per month.
Over a year, our workflow has closed more than 20 instances successfully, leading to significant cost reductions.
Time Efficiency
By automating incident resolution, our partner saves considerable time that would have otherwise been spent on failed builds. With each build taking approximately 1.5 hours, avoiding failures means maximizing their development time and speeding up time to market for new improvements.
Struggling with growing complexity, system overload or business-crushing downtimes?
Get ahead of the competition, as 70% of companies still lag behind observability (The State of Observability 2023). Our DevOps team helps determine and quantify your goals. Their workshop will show you how Observability immediately recovers your systems and responds to any incident. Click below to contact us!