The Prometheus monitoring system and time series database.

Prometheus

Last update: Jan 1, 2023

Related tags

Monitoring monitoring time-series metrics alerting prometheus graphing hacktoberfest

Overview

Prometheus

Visit prometheus.io for the full documentation, examples and guides.

Prometheus, a Cloud Native Computing Foundation project, is a systems and service monitoring system. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts when specified conditions are observed.

The features that distinguish Prometheus from other metrics and monitoring systems are:

A multi-dimensional data model (time series defined by metric name and set of key/value dimensions)
PromQL, a powerful and flexible query language to leverage this dimensionality
No dependency on distributed storage; single server nodes are autonomous
An HTTP pull model for time series collection
Pushing time series is supported via an intermediary gateway for batch jobs
Targets are discovered via service discovery or static configuration
Multiple modes of graphing and dashboarding support
Support for hierarchical and horizontal federation

Architecture overview

Install

There are various ways of installing Prometheus.

Precompiled binaries

Precompiled binaries for released versions are available in the download section on prometheus.io. Using the latest production release binary is the recommended way of installing Prometheus. See the Installing chapter in the documentation for all the details.

Docker images

Docker images are available on Quay.io or Docker Hub.

You can launch a Prometheus container for trying it out with

$ docker run --name prometheus -d -p 127.0.0.1:9090:9090 prom/prometheus

Prometheus will now be reachable at http://localhost:9090/.

Building from source

To build Prometheus from source code, first ensure that have a working Go environment with version 1.14 or greater installed. You also need Node.js and npm installed in order to build the frontend assets.

You can directly use the go tool to download and install the prometheus and promtool binaries into your GOPATH:

$ GO111MODULE=on go install github.com/prometheus/prometheus/cmd/...
$ prometheus --config.file=your_config.yml

However, when using go install to build Prometheus, Prometheus will expect to be able to read its web assets from local filesystem directories under web/ui/static and web/ui/templates. In order for these assets to be found, you will have to run Prometheus from the root of the cloned repository. Note also that these directories do not include the new experimental React UI unless it has been built explicitly using make assets or make build.

An example of the above configuration file can be found here.

You can also clone the repository yourself and build using make build, which will compile in the web assets so that Prometheus can be run from anywhere:

$ mkdir -p $GOPATH/src/github.com/prometheus
$ cd $GOPATH/src/github.com/prometheus
$ git clone https://github.com/prometheus/prometheus.git
$ cd prometheus
$ make build
$ ./prometheus --config.file=your_config.yml

The Makefile provides several targets:

build: build the prometheus and promtool binaries (includes building and compiling in web assets)
test: run the tests
test-short: run the short tests
format: format the source code
vet: check the source code for common errors
assets: build the new experimental React UI

Building the Docker image

The make docker target is designed for use in our CI system. You can build a docker image locally with the following commands:

$ make promu
$ promu crossbuild -p linux/amd64
$ make npm_licenses
$ make common-docker-amd64

NB if you are on a Mac, you will need gnu-tar.

React UI Development

For more information on building, running, and developing on the new React-based UI, see the React app's README.md.

More information

The source code is periodically indexed: Prometheus Core.
You will find a CircleCI configuration in .circleci/config.yml.
See the Community page for how to reach the Prometheus developers and users on various communication channels.

Contributing

Refer to CONTRIBUTING.md

License

Apache License 2.0, see LICENSE.

Comments

TSDB data import tool for OpenMetrics format.

Created a tool to import data formatted according to the Prometheus exposition format. The tool can be accessed via the TSDB CLI.

closes prometheus/prometheus#535

Signed-off-by: Dipack P Panjabi dipack.panjabi@gmail.com

(Port of https://github.com/prometheus/tsdb/pull/671)

opened by dipack95 126
Add mechanism to perform bulk imports

Currently the only way to bulk-import data is a hacky one involving client-side timestamps and scrapes with multiple samples per time series. We should offer an API for bulk import. This relies on https://github.com/prometheus/prometheus/issues/481.

EDIT: It probably won't be an web-based API in Prometheus, but a command-line tool.
kind/enhancement priority/P2 component/tsdb

opened by juliusv 112
Create a section ANNOTATIONS with user-defined payload and generalize RUNBOOK, DESCRIPTION, SUMMARY into fields therein.
RUNBOOK was added in a hurry in #843 for an internal demo of one of our users, which didn't give it enough time to be fully discussed. The demo has been done, so we can reconsider this.

I think we should revert this change, and remove RUNBOOK:

Our general policy is that if it can be done with labels, do it with labels

All notification methods in the alertmanager will need extra code to deal with this

In future, all alertmanager notification templates will need extra code to deal with this

In general, all user code touching the alertmanager will need extra code to deal with this

This presumes a certain workflow in that you have something called a "runbook" (and not any other name - playbook is also common) and that you have exactly one of them

Runbooks are not a fundamental aspect of an alert, are not in use by all of our users and thus I don't believe they meet the bar for first-class support within prometheus. This is especially true considering that they don't add anything that isn't already possible with labels.
opened by brian-brazil 102
Implement strategies to limit memory usage.

Currently, Prometheus simply limits the chunks in memory to a fixed number.

However, this number doesn't directly imply the total memory usage as many other things take memory as well.

Prometheus could measure its own memory consumption and (optionally) evict chunks early if it needs too much memory.

It's non-trivial to measure "actual" memory consumption in a platform independent way.
kind/enhancement

opened by beorn7 90
'@ ' modifier
This PR implements @ <timestamp> modifier as per this design doc.

An example query:

rate(process_cpu_seconds_total[1m]) and topk(7, rate(process_cpu_seconds_total[1h] @ 1234))

which ranks based on last 1h rate and w.r.t. unix timestamp 1234 but actually plots the 1m rate.

Closes #7903

This PR is to be followed up with an easier way to represent the start, end, range of a query in PromQL so that we could do @ <end>, metric[<range>] easily.
opened by codesome 88
Port isolation from old TSDB PR

The original PR was https://github.com/prometheus/tsdb/pull/306 .

I tried to carefully adjust to the new world order, but please give this a very careful review, especially around iterator reuse (marked with a TODO).

On the bright side, I definitely found and fixed a bug in txRing.
prombench

opened by beorn7 78
2.3.0 significatnt memory usage increase.
Bug Report

What did you do? Upgraded to 2.3.0

What did you expect to see? General improvements.

What did you see instead? Under which circumstances? Memory usage, possibly driven by queries, has considerably increased. Upgrade at 09:27, the memory usage drops on the graph after then are from container restarts due to OOM.

container_memory_usage_bytes

Environment

Prometheus in kubernetes 1.9

System information: Standard docker containers, on docker kubelet on linux.

Prometheus version: 2.3.0 insert output of prometheus --version here

kind/bug
opened by tcolgate 77
Support for environment variable substitution in configuration file

I think that would be a good idea to substitute environment variables in the configuration file.

That could be done really easily using os.ExpandEnv on configuration string when loading configuration string.

That would be much easier to substitute environment variables only on configuration values. go -ini provides a valueMapper but yaml.v2 doesn't have such mechanism.

opened by dopuskh3 72
React UI: Implement more sophisticated autocomplete
It would be great to have more sophisticated expression field autocompletion in the new React UI.

Currently it only autocompletes metric names, and only when the expression field doesn't contain any other sub-expressions yet.

Things that would be nice to autocomplete:

metric names anywhere within an expression

label names

label values

function names

etc.

For autocomplete functionality not to annoy users, it needs to be as highly performant, correct, and unobtrusive as possible. Grafana does many things right here already, but they also have a few really annoying bugs, like inserting closing parentheses in incorrect locations of an expression.

Currently @slrtbtfs has indicated interest in building a language-server-based autocomplete implementation.
component/ui priority/P3 kind/feature
opened by juliusv 69
Benchmark tsdb master
DO NOT MERGE

Benchmark 1

Benchmark the following PRs against 2.11.1

For queries: https://github.com/prometheus/tsdb/pull/642

For compaction: https://github.com/prometheus/tsdb/pull/643 https://github.com/prometheus/tsdb/pull/654 https://github.com/prometheus/tsdb/pull/653

Opening block: https://github.com/prometheus/tsdb/pull/645

Results

Did not test for compaction from on-disk blocks. Could not really see the allocation optimizations in compaction, that might be because the savings are mostly in the number of allocations and not the size of allocation (size is what is showed in the dashboards). That would mean CPU to be saved, but couldn't make a huge difference, but a slight increase in gap during compaction.

The gains looked good in

Allocations

CPU (because of allocations?)

RSS was also lower (upto 10 GiB lower! ~60 vs ~70).

Also a tiny-good improvement in query inner_eval times.

Compaction time (this should help the increase in compaction time that https://github.com/prometheus/tsdb/pull/627 is going to bring).

System load.

And bad in

result_sort for the queries. Not sure why.

Benchmark 2

Benchmark https://github.com/prometheus/tsdb/pull/627 (which includes all the PRs from above Benchmark 1) against 2.11.1
opened by codesome 65
M-map full chunks of Head from disk
tl-dr desc for the PR from @krasi-georgiev

When appending to the head and a chunk is full it is flushed to the disk and m-mapped (memory mapped) to free up memory

Prom startup now happens in these stages

Iterate the m-maped chunks from disk and keep a map of series reference to its slice of mmapped chunks.

Iterate the WAL as usual. Whenever we create a new series, look for it's mmapped chunks in the map created before and add it to that series.

If a head chunk is corrupted the currpted one and all chunks after that are deleted and the data after the corruption is recovered from the existing WAL which means that a corruption in m-mapped files results in NO data loss.

Mmaped chunks format - main difference is that the chunk for mmaping now also includes series reference because there is no index for mapping series to chunks. The block chunks are accessed from the index which includes the offsets for the chunks in the chunks file - example - chunks of series ID have offsets 200, 500 etc in the chunk files. In case of mmaped chunks, the offsets are stored in memory and accessed from that. During WAL replay, these offsets are restored by iterating all m-mapped chunks as stated above by matching the series id present in the chunk header and offset of that chunk in that file.

Prombench results

WAL Replay

1h Wal reply time 30% less wal reply time - 4m31 vs 3m36 2h Wal reply time 20% less wal reply time - 8m16 vs 7m

Memory During WAL Replay

High Churn 10-15% less RAM - 32gb vs 28gb 20% less RAM after compaction 34gb vs 27gb No Churn 20-30% less RAM - 23gb vs 18gb 40% less RAM after compaction 32.5gb vs 20gb

Screenshots are in this comment

Prerequisite: https://github.com/prometheus/prometheus/pull/6830 (Merged)

Closes https://github.com/prometheus/prometheus/issues/6377. More info in the linked issue and the doc in that issue and the doc inside that doc inside that issue :)

[x] Add tests

[x] Explore possible ways to get rid of new globals added in head.go

[x] Wait for https://github.com/prometheus/prometheus/pull/6830 to be merged

[x] Fix windows tests

prombench
opened by codesome 64
Add SubQueryStep to distinguish whether the vectorSelector was wrapped in a SubQuery
The Select() of Querier interface supports the unexpandedSeriesSet for VectorSelector. The step in SelectHints is a global parameter, and it can't adapt to the VectorSelector which is embedded in a subquery, for example, query: max_over_time(metric[6m:10s])、step: 20s.

For the inner VectorSelector, it should choose 10s not 20s as the step parameter. So, I suggest adding the SubQueryStep parameter in SelectHints.

Currently, the latest version has already reset the interval, i.e.,

func (ng *Engine) populateSeries(querier storage.Querier, s *parser.EvalStmt) { var evalRange time.Duration parser.Inspect(s.Expr, func(node parser.Node, path []parser.Node) error { switch n := node.(type) { case *parser.VectorSelector: start, end := ng.getTimeRangesForSelector(s, n, path, evalRange) interval := ng.getLastSubqueryInterval(path) if interval == 0 { interval = s.Interval // the interval has been reset } ............. .............

Now, I need to distinguish whether the vectorSelector was wrapped in a subQuery by SubQueryStep, then I can optimize the Select() process of vectorSelector.
opened by wanghaao 1
No error message within logs, just 'expected equal, got "INVALID"' on the Status / Targets view.
What did you do?

Trying to find out why are we missing metrics from some pods.

What did you expect to see?

An error message within the logs explaining what is going on.

What did you see instead? Under which circumstances?

Just 'expected equal, got "INVALID"' message within the Error column on the Status / Targets view.

System information

Kernel Version: 5.4.0-126-generic, OS Image: Ubuntu 18.04.6 LTS

Prometheus version

2.40.5

Prometheus configuration file

No response

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

No response
opened by hawran 1
Add 'resolve_delay' field to alerting rules

This commit adds a new 'resolve_delay' field to Prometheus alerting rules. The 'resolve_delay' field specifies the minimum amount of time that an alert should remain firing, even if the expression does not return any results.

This feature was discussed at a previous dev summit, and it was determined that a feature like this would be useful in order to allow the expression time to stabilize and prevent confusing resolved messages from being propagated through Alertmanager.

This approach is simpler than having two PromQL queries, as was sometimes discussed, and it should be easier to implement.

This commit does not include tests for the 'resolve_delay' field. This is intentional, as the purpose of this commit is to gather comments on the proposed design of the 'resolve_delay' field before implementing tests. Once the design of the 'resolve_delay' field has been finalized, a follow-up commit will be submitted with tests.

See https://github.com/prometheus/prometheus/issues/11570

Signed-off-by: Julien Pivotto roidelapluie@o11y.eu

opened by roidelapluie 0
Pass rule details in evaluation context
Grafana Loki uses Prometheus' rule evaluation engine, and we need to log the rule name and labels during evaluation.

This implementation mimicks the existing mechanism to expose the rule group and filename via the context in rules/manager.go:

ctx = promql.NewOriginContext(ctx, map[string]interface{}{ "ruleGroup": map[string]string{ "file": g.File(), "name": g.Name(), }, })

It diverges slightly to make the context data type-safe. I'm happy to either amend the existing implementation to follow the same pattern, or lose the type-safety and accept a map[string]string as above.
opened by dannykopping 0
histograms: Add remote-write support for Float Histograms

This PR adds remote-write support for the newly introduced float histograms(#11522). This is was a pending feature to add after introducing float histograms. The implementation is mostly based on this PR

opened by marctc 0