Our Journey Into Cutting Kubernetes Costs by 40%

By Victor Hernando - DECEMBER 19, 2022
BACK TO blog

As companies start their Kubernetes and cloud-native journey, cloud infrastructures and services grow at a rapid pace. This happens all too often as organizations shift left without thorough controls, which can lead to overallocating and overspending on their Kubernetes environments.

Organizations running workloads in the cloud can put budgets at risk when they lack information about key facts, like:

  • How many resources applications are using
  • The maximum and the minimum resources applications need
  • What are the trends in terms of consumptions, etc.

Our experience has taught us that if you fail to monitor, control, and analyze your resource usage, you may end up having to pay huge Kubernetes bills at the end of the month. It depends on the magnitude of the company, but unplanned costs like the ones coming from an out of control Kubernetes bill can certainly cause serious headaches, especially when going through tough economic conditions.

Are you spending a lot of time reviewing numbers in spreadsheets to figure out where your Kubernetes and cloud costs are going? If that’s the case, keep reading and discover how our cost reduction experience at Sysdig can help you along your cost optimization journey.

At Sysdig, hundreds of developers are building and continuously deploying hundreds of microservices into 50+ clusters, which are hosted on many different cloud providers, on a daily basis. All our workloads, both stateful and stateless, are running 100% in both managed and self-hosted Kubernetes clusters. In addition to reliability and observability management, cost management is an important focus area for such a huge footprint of clusters.

We started our journey a few months ago, and have accomplished a variety of optimizations resulting in over 40% reduction in cost and HUGE savings across all the cloud providers.

Sounds good, right?

This is the first of a blog series where we plan to cover different optimization techniques and how they helped us achieve significant cost savings.

Visibility to current cost and usage

The first thing to think about while doing cost optimization is to decide who should own this charter, as cost optimizations are never once-and-done activities. We formed a small dedicated team of CloudOps and infra engineers to own this charter. They started with the Monitor, Analyze, and Optimize approach.

Let’s talk about these three principles briefly, and how they helped us to get a better understanding on the current cost of our Kubernetes and cloud infrastructure.

Monitoring

The Sysdig Infrastructure engineering team makes a heavy use of Sysdig Monitor to gain deep visibility across containers, Kubernetes, and Cloud. The deep out-of-the-box visibility into Kubernetes, and easy-to-build Dashboards in Sysdig Monitor, allowed us to bring in metrics of all key cost contributors under one tool. We identified key areas in our stack where we spent the most money, and built dashboards to provide granular visibility into load, utilization, and capacity.

Analyze

Once we had the usage data, we brought in our cost data from the cloud providers to analyze and understand the areas to focus to reach the best ROI. The dynamic nature of Kubernetes makes it really hard to understand cost at a workload level.

Sysdig Monitor provides an out-of-the-box extended label set and metrics enrichment, making integrating and analyzing costs much easier. Kubernetes, infrastructure metadata, and application context is used to enrich all metrics automatically, without the need to instrument additional labels nor adding any extra component in your environment. This metadata and metrics combination converges into real Kubernetes and cloud cost data.

In terms of resources usage, imagine you are able to identify the highest spending workloads, which namespaces requested more resources, and which of those namespaces are under utilizing these requests, all in a few seconds.

Optimize to save costs

Once all the key data about resource usage, performance, sizing, etc. is analyzed, it’s time to make decisions, propose right sizings, and finally apply these changes.

Some of the top opportunities that were recommended by the Cost Optimization team are:

  • Top Workloads and datastores right sizing
  • Workload optimizations and tuning
  • Cluster autoscaler tuning
  • Instance reshaping
  • Leveraging ARM processors
  • Network traffic optimizations and Inter AZ (Availability Zone) traffic reduction

While we plan to cover each of these optimizations in the upcoming blog series, you can see a couple of images showing the outcome after workload right sizing.

This dashboard represents the CPU right sizing of a group of workloads within a namespace. CPU was reduced based on unused-but-requested resources metrics.

This time, right sizing was applied to memory, being shrunk up to the recommended levels based on the unused-but-requested memory dashboard.

As you can see in both images, unused requested CPU and memory were reduced significantly, saving resources at run time and as a consequence, reducing costs.

Optimizing is the last step of a typical cost savings process. Monitoring and analyzing steps have been key for identifying which workloads need to be right sized, eventually allowing us to reduce our wasted spending.

Promoting FinOps best practices

Per FinOps foundations own words: “FinOps is an evolving cloud financial management discipline and cultural practice that enables organizations to get maximum business value by helping engineering, finance, technology, and business teams to collaborate on data-driven spending decisions.”

Nowadays, the vast majority of companies are in the middle of their cloud-native and Kubernetes journey. That means, in most cases, changes in the applications architecture, moving to a microservices paradigm, and using Kubernetes and cloud providers to host their applications and services. This new approach provides tons of benefits, like scalability, high availability, accessibility, and on-premise infrastructure simplification/reduction, among others. There are drawbacks, though. The lack of cost observability can cause a huge overspend.

It’s important to promote and implement a culture of cost discipline, where all the departments within a company are able to identify and quantify how many resources are used and what is the spending for such usage.

Here are some FinOps best practices that we use at Sysdig that can help you implement your own FinOps strategy. Following and implementing these best practices was key for our success during our cost optimization journey.

  • Improve or implement cost observability. It is of vital importance to properly monitor and analyze your Kubernetes and cloud infrastructure. As soon as you can correlate the resources usage data and the cost information, you will get a better understanding of who and what are the most consuming users/applications/services, or what the top overspendings are.
  • Identify the most urgent overspending areas. If you spot certain areas that need to be addressed first, go ahead and avoid a big problem with your Kubernetes and cloud bill at the end of the month. Don’t forget that to tackle this from a more generic point of view, from top to bottom, it will help you to identify and rank spending data in your organization.
  • Create your own reports or feed your chargeback. For the sake of simplicity, it is always a good idea to elaborate your own cost reports. It will help the company to associate and assign costs to each department, and eventually will assist with budget planning and assignments.
  • Design and execute cost saving strategies. In terms of reducing wasted spending, make your own strategies, execute them, and measure how effective these actions are. Right size appropriately, value other kinds of instances, adapt your workloads, and redesign them according to your infrastructure, etc.
  • Make people accountable for usage of resources and cloud spendings. Promoting a culture of cost discipline means (among a lot of different things) ensuring the different personas in a company are accountable for correct and appropriate usage of resources. Wasting resources can also mean consuming more electricity, hardware, and equipment in general, among others that don’t help with sustainability at all.
  • Design and/or adapt your workloads to the cloud. Be sure your workloads are well designed and right sized for your Kubernetes and cloud environment. A poor application design most likely means a poor performance, and eventually a suboptimal usage of resources.

Conclusion

Industries are accelerating their cloud journey from on-premise environments to plug-and-play multi-cloud at a fast phase. While cloud services and Kubernetes provide agility to dev teams like never before, it comes at a cost. At some point in time and scale, every company has to take a closer look at their infrastructure and optimize it for cost. The sooner you start on this journey and build a culture of cost conscious teams, the better off you are. Having all the data needed for complete visibility into your infrastructure, and a team to plan and execute, is important to be successful. We will go into the details of every kind of optimization we performed in future blogs.

The new Cost Advisor feature in Sysdig Monitor automates many of these best practices and can help you to reduce wasted spending by 40%. Sign up for a 30-day trial of Sysdig Monitor. You’ll have access to all the features, and there is no payment required!