In this article, we’ll cover the three main challenges you may face when maintaining your own Prometheus LTS solution.
In the beginning, Prometheus claimed that it wasn’t a long-term metrics storage, the expected outcome was that somebody would eventually create that long-term storage (LTS) for Prometheus metrics.
Currently, there are several open-source projects to provide long-term storage (Prometheus LTS). These community projects are ahead of the rest: Cortex, Thanos, and M3.
Starting with Prometheus is fairly easy. It has several mechanisms, like service discovery, designed to work out of the box with little effort. But if you plan your infrastructure to grow, you’ll soon face some challenges.
Challenge 1: Prometheus LTS know-how
Once you set Prometheus in a cluster and start creating alerts and rules (or adding resources from promcat.io) and visualizing metrics, your cloud-native infrastructure usually grows fast.
Suddenly, you may have several clusters in different regions, with different applications running on them. And that is when the challenges with your Prometheus solution start.
Managing Prometheus is not just a matter of using out-of-the-box materials – you need to take care of details like cardinality, performance, and PromQL optimization.
You need good control of features, like:
- Relabelling, including changing, dropping, and adding labels.
- Target configuration.
- Different types of aggregations and rollups.
On top of this, in order to move to centralized long-term storage, you need to have strong knowledge in several areas, like:
The architecture of the Prometheus LTS solution.
Deploying and maintaining a Prometheus LTS is not trivial. Although the different solutions have come a long way in reducing operational complexity, there is still a fair amount of research and learning you need to do to be able to start using them.
The architecture of the monitoring structure.
You still need a good knowledge of different strategies to be able to create efficient labels that allow you to slice and dice the information of the metrics.
You also need a good aggregation strategy, a set of labels that allow you to segment the information, recording rules to improve read path efficiency. Also, you probably need to focus your attention on some singularities that come from the design of your infrastructure that we can’t even start to imagine.
This knowledge must be widespread throughout the teams of the company that are creating apps and need the observability provided by these metrics to monitor and troubleshoot their applications. This requires a good amount of training and might bring some issues in the monitoring platform due to possible misuse.
Challenge 2: Scale Prometheus LTS
We have previously talked about the challenges of scaling Prometheus.
Even though deploying a solution can be affordable, there are several things one has to have in mind to be able to properly manage the Prometheus LTS system. One of the most important is scale. Scale plays an important role in different dimensions:
- Cardinality: The number of metrics in your system can drastically change depending on your custom metrics, labels, and the scale of the apps you are using at different times. You will need to change the provisioned storage size or be ready to use a scalable system like object storage (most Prometheus LTS support object storage now).
- Read path: The number of dashboards and users loading those dashboards can increase in time as you instrument your apps and DevOps teams are introduced to the use of Prometheus. The number of alerts can grow too. The use of recording rules or different optimizations (query caching) can be necessary when the systems scale.
- Intracluster collection: You need to be ready for the growth of your clusters. At the beginning, one Prometheus per cluster is enough to collect all the metrics. But in time, you would need to switch to different strategies, like one metric collector per node or other sharding strategies.
Challenge 3: Infrastructure optimization (costs)
Operating a Prometheus LTS requires a fair amount of resources to be able to collect, store, and serve the metrics.
Depending on the architecture of your infrastructure, there are almost infinite variants of on-prem, cloud, and hybrid resources. Controlling the cost and the proper provision of resources to run the service can be a challenge.
At this point, everybody knows that the cloud provider bill is one of the most important costs associated with any IT infrastructure. The outcome of challenges one and two will have a direct effect on the cost of the infrastructure and the complexity associated with it. You need a lot of knowledge to optimize cloud provider costs, which in this case can be directly related to the degree of optimization you can squeeze from the Prometheus LTS application of your choice.
To sum up
In this article, you’ve read the three main challenges when maintaining a Prometheus Prometheus LTS solution on-premise. Now, you can dig deeper into one of the scariest ones, scaling, in this article.