Organizations are turning in droves to Prometheus to monitor their container and microservice estates, but larger companies often run headlong into a wall: They face scaling challenges when they move beyond a handful of apps.
Containers Have Complicated the Picture
Monitoring monolithic environments used to be relatively straight forward. You had a certain number of static physical servers and virtual machines and a finite number of metrics to watch. Today the number of entities to track is exploding because of containers and the migration to microservice architectures.
If servers sitting in data centers were pets (as they have been described) requiring our constant attention, and cloud instances are more like cattle (you don’t care about a single one because you have a lot), then containers are more akin to locusts. There are a lot of them, sometimes hundreds per machine, new ones appear all the time, and when used in conjunction with an orchestrator like Kubernetes, their life time can be very short. This makes it much harder to keep track of them, and if you’re not careful, they can cause a lot of damage.
As the complexity and distribution of environments increases, so does the number of entities you need to monitor. Additionally, you might want to monitor more attributes to ensure you have an accurate picture of what is going on, or, in the case of troubleshooting or incident response, what was going on. The latter is particularly problematic in these ephemeral environments because by the time you want to understand the root cause of a problem, often the resources in question have already been decommissioned, meaning the monitoring solution has to provide a way to store enough history for forensics.
Increasingly, when in need of cloud monitoring, teams are turning to Prometheus, an open source, CNCF project. Prometheus has become the go-to monitoring tool developers use to collect and make sense of metrics in cloud-native environments. It is supported by a large community, with 6,300 contributors from more than 700 companies, and 13,500 code commits and 7,200 pull requests.
A typical cloud-native application stack — such as Kubernetes, Ngnix, MongoDB, Kafka, golang — exposes Prometheus metrics by default. Prometheus is designed as a Go program that scales vertically. It is easy to deploy as, say, a single container or a single host. This means it’s very easy to get started with Prometheus to get visibility into your first Kubernetes cluster. But it also means that as your infrastructure grows, you will hit its limit.
The Scale Problem
As your environment grows, the number of time series data you need to track skyrockets and at a certain point, a single Prometheus instance won’t be able to keep up. The straightforward option would be to run a fleet of Prometheus servers across the enterprise, but this comes with several challenges. For example, managing and federating data across tens or hundreds of Prometheus servers is not easy. Similarly, figuring out enterprise workflows, single-sign-on, role-based access control, and adhering to SLAs or compliance are not easy problems either. As applications grow, it becomes a huge manageability and reliability issue to operate an all-encompassing monitoring solution without disrupting developer work.
To deal with that, companies have adopted a few approaches.
A simple first step is to have a separate Prometheus server for every namespace or for every cluster. This approach is clearly harder to scale beyond a certain point, and in addition to that, it has the disadvantage of creating a big number of disconnected data silos. This makes troubleshooting cumbersome because most issues will span multiple services/teams/clusters. Not only is it hard to find the same metric in each environment, you then have to stitch together the data to try to understand what is happening.
Another common approach is to use open source tools such as Cortex or Thanosto federate multiple Prometheus servers. They are powerful tools that enable you to query servers in a centralized way, collect the data and then share it in a single dashboard. However, as any data-intensive distributed system, they require substantial skills and resources to operate.
Six Factors to Consider
For companies that start with Prometheus and then look for a commercial solution to serve up a holistic view, it is important that they do not lose all of the development work done standardizing on Prometheus — dashboards, alerts, exporters and other work. However, that is not the only thing you should consider. If you go this route, insist on the support of these core criteria:
1. Full ingestion compatibility that really supports all Prometheus features
Your vendor/tool/SaaS solution needs to be able to consume data from any entity that can produce Prometheus metrics, whether Kubernetes on-premises or any cloud services. Consuming Prometheus metrics is relatively trivial, but don’t overlook the little things, such as being able to re-label metrics as you ingest them into storage or augment the data so it makes more sense for your environment. These things add up and make a big difference in being able to use the mountains of data collected.
2. PromQL compatibility
The Prometheus Query Language was invented by the creators of Prometheus to extract information stored by Prometheus. PromQL enables you to ask for metrics on, for example, specific services or specific users. It also enables you to aggregate or segment data. For example, you can use it to show CPU utilization on an app-by-app basis across all of your containers. Or to show only data for Cassandra containers and show it as a single value for each cluster. PromQL unlocks the real value of Prometheus; therefore, ingesting Prometheus metrics into a product that doesn’t fully support PromQL defeats the whole purpose of using Prometheus.
To be truly compatible with Prometheus, the solution has to be hot-swappable in terms of being able to work with your existing dashboards, alerts, and scripts. Many companies that use Prometheus, for example, use Grafana for dashboards. This open source tool is nicely integrated with Prometheus, including at the query level, and can be used to produce a range of useful charts and dashboards. Commercial offerings that purport to be compatible with Prometheus should, therefore, be compatible with tools like Grafana. It isn’t enough to say the solution allows you to see a number in Grafana. You need to be able to ingest existing Grafana dashboards as they are without any changes and reapply them to data installed in the commercial solution.
4. Access controls
Access controls are another security issue you should consider when evaluating tools. Having the ability to secure user authentication with industry-standard protocols — including LDAP, Google Oauth, SAML, and OpenID — enables companies to isolate and secure resources with service-based access control.
Kubernetes simplifies the deployment, scaling, and management of containerized applications and microservices. This helps to keep services up and running, but to identify and resolve underlying problems, such as slow performance, failed deployments and connection errors, you need the ability to gather and visualize in-depth infrastructure, application and performance data from across your environment. Not having access to both real-time information, along with contextual data makes it nearly impossible to correlate the metrics in your environment so you can solve problems more quickly.
Compatibility with existing alerts. And finally, if you’re looking for a commercial answer to help deal with the Prometheus scalability problem, make sure it supports all levels of alerting. The key for achieving this is full support for Alert Manager functionality, which in turn requires 100% ingestion and PromQL compatibility.
If you find a commercial tool that meets these criteria, you should be able to swap it into existing Prometheus integrations with a minimum of fuss and sidestep the scalability problems companies are running into. Developers adore Prometheus, with good reason, and due diligence now will help you to ensure that they can still use the metrics they love.