mirror of https://github.com/grafana/loki
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
294 lines
13 KiB
294 lines
13 KiB
|
5 years ago
|
---
|
||
|
3 years ago
|
title: Alerting and recording rules
|
||
|
|
menuTitle: Alert
|
||
|
|
description: Learn how the rule evaluates queries for alerting.
|
||
|
|
|
||
|
5 years ago
|
aliases:
|
||
|
3 years ago
|
- /docs/loki/latest/rules/
|
||
|
|
- docs/loki/latest/alert/
|
||
|
|
- docs/loki/latest/alerting/
|
||
|
|
|
||
|
|
weight: 850
|
||
|
|
keywords:
|
||
|
|
- loki
|
||
|
|
- alert
|
||
|
|
- alerting
|
||
|
|
- ruler
|
||
|
5 years ago
|
---
|
||
|
|
|
||
|
3 years ago
|
# Alerting and recording rules
|
||
|
5 years ago
|
|
||
|
4 years ago
|
Grafana Loki includes a component called the ruler. The ruler is responsible for continually evaluating a set of configurable queries and performing an action based on the result.
|
||
|
5 years ago
|
|
||
|
5 years ago
|
This example configuration sources rules from a local disk.
|
||
|
|
|
||
|
|
[Ruler storage](#ruler-storage) provides further details.
|
||
|
5 years ago
|
|
||
|
|
```yaml
|
||
|
|
ruler:
|
||
|
|
storage:
|
||
|
|
type: local
|
||
|
|
local:
|
||
|
|
directory: /tmp/rules
|
||
|
|
rule_path: /tmp/scratch
|
||
|
|
alertmanager_url: http://localhost
|
||
|
|
ring:
|
||
|
|
kvstore:
|
||
|
|
store: inmemory
|
||
|
|
enable_api: true
|
||
|
|
|
||
|
|
```
|
||
|
|
|
||
|
5 years ago
|
We support two kinds of rules: [alerting](#alerting-rules) rules and [recording](#recording-rules) rules.
|
||
|
5 years ago
|
|
||
|
5 years ago
|
## Alerting Rules
|
||
|
5 years ago
|
|
||
|
5 years ago
|
We support [Prometheus-compatible](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) alerting rules. From Prometheus' documentation:
|
||
|
5 years ago
|
|
||
|
5 years ago
|
> Alerting rules allow you to define alert conditions based on Prometheus expression language expressions and to send notifications about firing alerts to an external service.
|
||
|
5 years ago
|
|
||
|
5 years ago
|
Loki alerting rules are exactly the same, except they use LogQL for their expressions.
|
||
|
5 years ago
|
|
||
|
|
### Example
|
||
|
|
|
||
|
5 years ago
|
A complete example of a rules file:
|
||
|
5 years ago
|
|
||
|
|
```yaml
|
||
|
|
groups:
|
||
|
|
- name: should_fire
|
||
|
|
rules:
|
||
|
|
- alert: HighPercentageError
|
||
|
|
expr: |
|
||
|
5 years ago
|
sum(rate({app="foo", env="production"} |= "error" [5m])) by (job)
|
||
|
|
/
|
||
|
|
sum(rate({app="foo", env="production"}[5m])) by (job)
|
||
|
|
> 0.05
|
||
|
5 years ago
|
for: 10m
|
||
|
|
labels:
|
||
|
|
severity: page
|
||
|
|
annotations:
|
||
|
|
summary: High request latency
|
||
|
|
- name: credentials_leak
|
||
|
|
rules:
|
||
|
|
- alert: http-credentials-leaked
|
||
|
|
annotations:
|
||
|
|
message: "{{ $labels.job }} is leaking http basic auth credentials."
|
||
|
|
expr: 'sum by (cluster, job, pod) (count_over_time({namespace="prod"} |~ "http(s?)://(\\w+):(\\w+)@" [5m]) > 0)'
|
||
|
|
for: 10m
|
||
|
|
labels:
|
||
|
|
severity: critical
|
||
|
|
```
|
||
|
|
|
||
|
5 years ago
|
## Recording Rules
|
||
|
5 years ago
|
|
||
|
5 years ago
|
We support [Prometheus-compatible](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#recording-rules) recording rules. From Prometheus' documentation:
|
||
|
5 years ago
|
|
||
|
5 years ago
|
> Recording rules allow you to precompute frequently needed or computationally expensive expressions and save their result as a new set of time series.
|
||
|
|
|
||
|
|
> Querying the precomputed result will then often be much faster than executing the original expression every time it is needed. This is especially useful for dashboards, which need to query the same expression repeatedly every time they refresh.
|
||
|
|
|
||
|
3 years ago
|
Loki allows you to run [metric queries]({{<relref "../logql/metric_queries">}}) over your logs, which means
|
||
|
5 years ago
|
that you can derive a numeric aggregation from your logs, like calculating the number of requests over time from your NGINX access log.
|
||
|
|
|
||
|
|
### Example
|
||
|
5 years ago
|
|
||
|
|
```yaml
|
||
|
5 years ago
|
name: NginxRules
|
||
|
|
interval: 1m
|
||
|
|
rules:
|
||
|
|
- record: nginx:requests:rate1m
|
||
|
|
expr: |
|
||
|
|
sum(
|
||
|
|
rate({container="nginx"}[1m])
|
||
|
|
)
|
||
|
|
labels:
|
||
|
|
cluster: "us-central1"
|
||
|
5 years ago
|
```
|
||
|
|
|
||
|
5 years ago
|
This query (`expr`) will be executed every 1 minute (`interval`), the result of which will be stored in the metric
|
||
|
|
name we have defined (`record`). This metric named `nginx:requests:rate1m` can now be sent to Prometheus, where it will be stored
|
||
|
|
just like any other metric.
|
||
|
|
|
||
|
|
### Remote-Write
|
||
|
|
|
||
|
|
With recording rules, you can run these metric queries continually on an interval, and have the resulting metrics written
|
||
|
|
to a Prometheus-compatible remote-write endpoint. They produce Prometheus metrics from log entries.
|
||
|
|
|
||
|
|
At the time of writing, these are the compatible backends that support this:
|
||
|
|
|
||
|
|
- [Prometheus](https://prometheus.io/docs/prometheus/latest/disabled_features/#remote-write-receiver) (`>=v2.25.0`):
|
||
|
|
Prometheus is generally a pull-based system, but since `v2.25.0` has allowed for metrics to be written directly to it as well.
|
||
|
3 years ago
|
- [Grafana Mimir](/docs/mimir/latest/operators-guide/reference-http-api/#remote-write)
|
||
|
5 years ago
|
- [Thanos (`Receiver`)](https://thanos.io/tip/components/receive.md/)
|
||
|
|
|
||
|
|
Here is an example remote-write configuration for sending to a local Prometheus instance:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
ruler:
|
||
|
|
... other settings ...
|
||
|
|
|
||
|
|
remote_write:
|
||
|
|
enabled: true
|
||
|
|
client:
|
||
|
|
url: http://localhost:9090/api/v1/write
|
||
|
|
```
|
||
|
|
|
||
|
3 years ago
|
Further configuration options can be found under [ruler]({{<relref "../configuration#ruler">}}).
|
||
|
5 years ago
|
|
||
|
4 years ago
|
### Operations
|
||
|
5 years ago
|
|
||
|
3 years ago
|
Please refer to the [Recording Rules]({{<relref "../operations/recording-rules">}}) page.
|
||
|
5 years ago
|
|
||
|
|
## Use cases
|
||
|
|
|
||
|
|
The Ruler's Prometheus compatibility further accentuates the marriage between metrics and logs. For those looking to get started with metrics and alerts based on logs, or wondering why this might be useful, here are a few use cases we think fit very well.
|
||
|
|
|
||
|
5 years ago
|
### Black box monitoring
|
||
|
|
|
||
|
5 years ago
|
We don't always control the source code of applications we run. Load balancers and a myriad of other components, both open source and closed third-party, support our applications while they don't expose the metrics we want. Some don't expose any metrics at all. Loki's alerting and recording rules can produce metrics and alert on the state of the system, bringing the components into our observability stack by using the logs. This is an incredibly powerful way to introduce advanced observability into legacy architectures.
|
||
|
5 years ago
|
|
||
|
|
### Event alerting
|
||
|
|
|
||
|
|
Sometimes you want to know whether _any_ instance of something has occurred. Alerting based on logs can be a great way to handle this, such as finding examples of leaked authentication credentials:
|
||
|
|
```yaml
|
||
|
|
- name: credentials_leak
|
||
|
|
rules:
|
||
|
|
- alert: http-credentials-leaked
|
||
|
|
annotations:
|
||
|
|
message: "{{ $labels.job }} is leaking http basic auth credentials."
|
||
|
|
expr: 'sum by (cluster, job, pod) (count_over_time({namespace="prod"} |~ "http(s?)://(\\w+):(\\w+)@" [5m]) > 0)'
|
||
|
|
for: 10m
|
||
|
|
labels:
|
||
|
|
severity: critical
|
||
|
|
```
|
||
|
|
|
||
|
|
### Alerting on high-cardinality sources
|
||
|
|
|
||
|
|
Another great use case is alerting on high cardinality sources. These are things which are difficult/expensive to record as metrics because the potential label set is huge. A great example of this is per-tenant alerting in multi-tenanted systems like Loki. It's a common balancing act between the desire to have per-tenant metrics and the cardinality explosion that ensues (adding a single _tenant_ label to an existing Prometheus metric would increase it's cardinality by the number of tenants).
|
||
|
|
|
||
|
|
Creating these alerts in LogQL is attractive because these metrics can be extracted at _query time_, meaning we don't suffer the cardinality explosion in our metrics store.
|
||
|
|
|
||
|
5 years ago
|
> **Note** As an example, we can use LogQL v2 to help Loki to monitor _itself_, alerting us when specific tenants have queries that take longer than 10s to complete! To do so, we'd use the following query: `sum by (org_id) (rate({job="loki-prod/query-frontend"} |= "metrics.go" | logfmt | duration > 10s [1m]))`
|
||
|
5 years ago
|
|
||
|
|
## Interacting with the Ruler
|
||
|
|
|
||
|
5 years ago
|
Because the rule files are identical to Prometheus rule files, we can interact with the Loki Ruler via [`cortextool`](https://github.com/grafana/cortex-tools#rules). The CLI is in early development, but it works with both Loki and Cortex. Pass the `--backend=loki` option when using it with Loki.
|
||
|
5 years ago
|
|
||
|
|
> **Note:** Not all commands in cortextool currently support Loki.
|
||
|
|
|
||
|
5 years ago
|
> **Note:** cortextool was intended to run against multi-tenant Loki, commands need an `--id=` flag set to the Loki instance ID or set the environment variable `CORTEX_TENANT_ID`. If Loki is running in single tenant mode, the required ID is `fake` (yes we know this might seem alarming but it's totally fine, no it can't be changed)
|
||
|
|
|
||
|
5 years ago
|
An example workflow is included below:
|
||
|
|
|
||
|
|
```sh
|
||
|
5 years ago
|
# lint the rules.yaml file ensuring it's valid and reformatting it if necessary
|
||
|
|
cortextool rules lint --backend=loki ./output/rules.yaml
|
||
|
|
|
||
|
5 years ago
|
# diff rules against the currently managed ruleset in Loki
|
||
|
|
cortextool rules diff --rule-dirs=./output --backend=loki
|
||
|
|
|
||
|
|
# ensure the remote ruleset matches your local ruleset, creating/updating/deleting remote rules which differ from your local specification.
|
||
|
|
cortextool rules sync --rule-dirs=./output --backend=loki
|
||
|
|
|
||
|
|
# print the remote ruleset
|
||
|
|
cortextool rules print --backend=loki
|
||
|
|
```
|
||
|
|
|
||
|
|
There is also a [github action](https://github.com/grafana/cortex-rules-action) available for `cortex-tool`, so you can add it into your CI/CD pipelines!
|
||
|
|
|
||
|
|
For instance, you can sync rules on master builds via
|
||
|
|
```yaml
|
||
|
|
name: sync-cortex-rules-and-alerts
|
||
|
|
on:
|
||
|
|
push:
|
||
|
|
branches:
|
||
|
|
- master
|
||
|
|
env:
|
||
|
|
CORTEX_ADDRESS: '<fill me in>'
|
||
|
|
CORTEX_TENANT_ID: '<fill me in>'
|
||
|
|
CORTEX_API_KEY: ${{ secrets.API_KEY }}
|
||
|
|
RULES_DIR: 'output/'
|
||
|
|
jobs:
|
||
|
|
sync-loki-alerts:
|
||
|
|
runs-on: ubuntu-18.04
|
||
|
|
steps:
|
||
|
5 years ago
|
- name: Lint Rules
|
||
|
5 years ago
|
uses: grafana/cortex-rules-action@v0.4.0
|
||
|
5 years ago
|
env:
|
||
|
|
ACTION: 'lint'
|
||
|
|
with:
|
||
|
|
args: --backend=loki
|
||
|
5 years ago
|
- name: Diff rules
|
||
|
5 years ago
|
uses: grafana/cortex-rules-action@v0.4.0
|
||
|
5 years ago
|
env:
|
||
|
|
ACTION: 'diff'
|
||
|
|
with:
|
||
|
|
args: --backend=loki
|
||
|
|
- name: Sync rules
|
||
|
|
if: ${{ !contains(steps.diff-rules.outputs.detailed, 'no changes detected') }}
|
||
|
5 years ago
|
uses: grafana/cortex-rules-action@v0.4.0
|
||
|
5 years ago
|
env:
|
||
|
|
ACTION: 'sync'
|
||
|
|
with:
|
||
|
|
args: --backend=loki
|
||
|
|
- name: Print rules
|
||
|
5 years ago
|
uses: grafana/cortex-rules-action@v0.4.0
|
||
|
5 years ago
|
env:
|
||
|
|
ACTION: 'print'
|
||
|
|
```
|
||
|
|
|
||
|
|
## Scheduling and best practices
|
||
|
|
|
||
|
|
One option to scale the Ruler is by scaling it horizontally. However, with multiple Ruler instances running they will need to coordinate to determine which instance will evaluate which rule. Similar to the ingesters, the Rulers establish a hash ring to divide up the responsibilities of evaluating rules.
|
||
|
|
|
||
|
3 years ago
|
The possible configurations are listed fully in the [configuration documentation]({{<relref "../configuration">}}), but in order to shard rules across multiple Rulers, the rules API must be enabled via flag (`-ruler.enable-api`) or config file parameter. Secondly, the Ruler requires it's own ring be configured. From there the Rulers will shard and handle the division of rules automatically. Unlike ingesters, Rulers do not hand over responsibility: all rules are re-sharded randomly every time a Ruler is added to or removed from the ring.
|
||
|
5 years ago
|
|
||
|
5 years ago
|
A full sharding-enabled Ruler example is:
|
||
|
5 years ago
|
|
||
|
|
```yaml
|
||
|
5 years ago
|
ruler:
|
||
|
5 years ago
|
alertmanager_url: <alertmanager_endpoint>
|
||
|
|
enable_alertmanager_v2: true
|
||
|
|
enable_api: true
|
||
|
|
enable_sharding: true
|
||
|
|
ring:
|
||
|
|
kvstore:
|
||
|
|
consul:
|
||
|
|
host: consul.loki-dev.svc.cluster.local:8500
|
||
|
|
store: consul
|
||
|
|
rule_path: /tmp/rules
|
||
|
|
storage:
|
||
|
|
gcs:
|
||
|
|
bucket_name: <loki-rules-bucket>
|
||
|
|
```
|
||
|
|
|
||
|
|
## Ruler storage
|
||
|
|
|
||
|
4 years ago
|
The Ruler supports five kinds of storage: azure, gcs, s3, swift, and local. Most kinds of storage work with the sharded Ruler configuration in an obvious way, i.e. configure all Rulers to use the same backend.
|
||
|
5 years ago
|
|
||
|
3 years ago
|
The local implementation reads the rule files off of the local filesystem. This is a read-only backend that does not support the creation and deletion of rules through the [Ruler API]({{<relref "../api/#ruler">}}). Despite the fact that it reads the local filesystem this method can still be used in a sharded Ruler configuration if the operator takes care to load the same rules to every Ruler. For instance, this could be accomplished by mounting a [Kubernetes ConfigMap](https://kubernetes.io/docs/concepts/configuration/configmap/) onto every Ruler pod.
|
||
|
5 years ago
|
|
||
|
|
A typical local configuration might look something like:
|
||
|
|
```
|
||
|
5 years ago
|
-ruler.storage.type=local
|
||
|
|
-ruler.storage.local.directory=/tmp/loki/rules
|
||
|
5 years ago
|
```
|
||
|
|
|
||
|
|
With the above configuration, the Ruler would expect the following layout:
|
||
|
|
```
|
||
|
|
/tmp/loki/rules/<tenant id>/rules1.yaml
|
||
|
5 years ago
|
/rules2.yaml
|
||
|
5 years ago
|
```
|
||
|
3 years ago
|
Yaml files are expected to be [Prometheus-compatible](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) but include LogQL expressions as specified in the beginning of this doc.
|
||
|
5 years ago
|
|
||
|
|
## Future improvements
|
||
|
|
|
||
|
|
There are a few things coming to increase the robustness of this service. In no particular order:
|
||
|
|
|
||
|
5 years ago
|
- WAL for recording rule.
|
||
|
|
- Backend metric stores adapters for generated alert rule data.
|
||
|
5 years ago
|
|
||
|
|
## Misc Details: Metrics backends vs in-memory
|
||
|
|
|
||
|
|
Currently the Loki Ruler is decoupled from a backing Prometheus store. Generally, the result of evaluating rules as well as the history of the alert's state are stored as a time series. Loki is unable to store/retrieve these in order to allow it to run independently of i.e. Prometheus. As a workaround, Loki keeps a small in memory store whose purpose is to lazy load past evaluations when rescheduling or resharding Rulers. In the future, Loki will support optional metrics backends, allowing storage of these metrics for auditing & performance benefits.
|