@ -35,15 +35,15 @@ The following concepts are key to your understanding of how Grafana Alerting wor
### Alert rules
An [alert rule][alert-rules] consists of one or more queries and expressions that select the data you want to measure. It also contains a condition, which is the threshold that an alert rule must meet or exceed in order to fire.
An [alert rule][alert-rules] consists of one or more queries and expressions that select the data you want to measure. It also contains a condition, which is the threshold that an alert rule must meet or exceed to fire.
Add labels to uniquely identify your alert rule and configure alert routing. Labels link alert rules to notification policies, so you can easily manage which policy should handle which alerts and who gets notified.
Once alert rules are created, they go through various states and transitions.
After alert rules are created, they go through various states and transitions.
### Alert instances
Each alert rule can produce multiple alert instances (also known as alerts) - one alert instance for each time series. This is exceptionally powerful as it allows us to observe multiple series in a single expression.
Each alert rule can produce multiple alert instances (also known as alerts) - one alert instance for each time series. This is exceptionally powerful as it allows you to observe multiple series in a single expression.
```promql
sum by(cpu) (
@ -51,7 +51,7 @@ sum by(cpu) (
)
```
A rule using the PromQL expression above creates as many alert instances as the amount of CPUs we are observing after the first evaluation, enabling a single rule to report the status of each CPU.
A rule using the PromQL expression above creates as many alert instances as the amount of CPUs after the first evaluation, enabling a single rule to report the status of each CPU.
{{<figuresrc="/static/img/docs/alerting/unified/multi-dimensional-alert.png"caption="Multiple alert instances from a single alert rule">}}
@ -29,7 +29,7 @@ A namespace contains one or more groups. The rules within a group are run sequen
### Groups
The rules within a group are run sequentially at a regular interval, meaning no rules will be evaluated at the same time and in order of appearance. The default interval is one (1) minute. You can rename Grafana Mimir or Loki rule namespaces and groups, and edit group evaluation intervals.
The rules within a group are run sequentially at a regular interval, meaning no rules are evaluated at the same time and in order of appearance. The default interval is one (1) minute. You can rename Grafana Mimir or Loki rule namespaces and groups, and edit group evaluation intervals.
> **Note** If you want rules to be evaluated concurrently and with different intervals, consider storing them in different groups.
@ -42,7 +42,7 @@ In the pending period, you select the period in which an alert rule can be in br
Imagine you have an alert rule evaluation interval set at every 30 seconds and the pending period to 90 seconds.
Evaluation will occur as follows:
Evaluation occurs as follows:
[00:30] First evaluation - condition not met.
@ -61,12 +61,12 @@ If the alert rule has a condition that needs to be in breach for a certain amoun
- The rule stays in the "pending" state until the condition has been broken for the required amount of time - pending period.
- Once the required time has passed, the rule goes into a "firing" state.
- After the required time has passed, the rule goes into a "firing" state.
- If the condition is no longer broken during the pending period, the rule goes back to its normal state.
**Note:**
If you want to skip the pending state, you can simply set the pending period to 0. This effectively skips the pending period and your alert rule will start firing as soon as the condition is breached.
If you want to skip the pending state, you can simply set the pending period to 0. This effectively skips the pending period and your alert rule starts firing as soon as the condition is breached.
When an alert rule fires, alert instances are produced, which are then sent to the Alertmanager.
@ -35,10 +35,10 @@ An alert rule can be in either of the following states:
| **Pending** | At least one alert instances returned by the evaluation engine is `Pending`. |
| **Firing** | At least one alert instances returned by the evaluation engine is `Firing`. |
The alert rule state is determined by the “worst case” state of the alert instances produced. For example, if one alert instance is firing, the alert rule state will also be firing.
The alert rule state is determined by the “worst case” state of the alert instances produced. For example, if one alert instance is firing, the alert rule state is also firing.
{{% admonition type="note" %}}
Alerts will transition first to `pending` and then `firing`, thus it will take at least two evaluation cycles before an alert is fired.
Alerts transition first to `pending` and then `firing`, thus it takes at least two evaluation cycles before an alert is fired.
{{% /admonition %}}
## Alert instance state
@ -55,11 +55,11 @@ An alert instance can be in either of the following states:
## Keep last state
An alert rule can be configured to keep the last state when a `NoData` and/or `Error` state is encountered. This will both prevent alerts from firing, and from resolving and re-firing. Just like normal evaluation, the alert rule will transition from `Pending` to `Firing` after the pending period has elapsed.
An alert rule can be configured to keep the last state when a `NoData` and/or `Error` state is encountered. This both prevents alerts from firing, and from resolving and re-firing. Just like normal evaluation, the alert rule transitions from `Pending` to `Firing` after the pending period has elapsed.
## Alert rule health
An alert rule can have one the following health statuses:
An alert rule can have one of the following health statuses:
@ -70,7 +70,7 @@ An alert rule can have one the following health statuses:
## Special alerts for `NoData` and `Error`
When evaluation of an alert rule produces state `NoData` or `Error`, Grafana Alerting will generate alert instances that have the following additional labels:
When evaluation of an alert rule produces state `NoData` or `Error`, Grafana Alerting generates alert instances that have the following additional labels:
Choosing how, when, and where to send your alert notifications is an important part of setting up your alerting system. These decisions will have a direct impact on your ability to resolve issues quickly and not miss anything important.
Choosing how, when, and where to send your alert notifications is an important part of setting up your alerting system. These decisions have a direct impact on your ability to resolve issues quickly and not miss anything important.
As a first step, define your contact points; where to send your alert notifications to. A contact point is a set of one or more integrations that are used to deliver notifications. Add notification templates to contact points for reuse and consistent messaging in your notifications.
@ -58,4 +58,4 @@ All notifications templates are written in [Go's templating language](https://pk
## Silences
You can use silences to mute notifications from one or more firing rules. Silences do not stop alerts from firing or being resolved, or hide firing alerts in the user interface. A silence lasts as long as its duration which can be configured in minutes, hours, days, months or years.
You can use silences to mute notifications from one or more firing rules. Silences do not stop alerts from firing or being resolved, or hide firing alerts in the user interface. A silence lasts as long as its duration, which can be configured in minutes, hours, days, months, or years.
@ -26,7 +26,7 @@ Alertmanagers are visible from the drop-down menu on the Alerting Contact Points
In Grafana, you can use the Cloud Alertmanager, Grafana Alertmanager, or an external Alertmanager. You can also run multiple Alertmanagers; your decision depends on your set up and where your alerts are being generated.
- **Grafana Alertmanager** is an internal Alertmanager that is pre-configured and available for selection by default if you run Grafana on-premises or open-source.
- **Grafana Alertmanager** is an internal Alertmanager that is pre-configured and available for selection by default if you run Grafana on-premises or opensource.
The Grafana Alertmanager can receive alerts from Grafana, but it cannot receive alerts from outside Grafana, for example, from Mimir or Loki. Note that inhibition rules are not supported.
@ -34,7 +34,7 @@ In Grafana, you can use the Cloud Alertmanager, Grafana Alertmanager, or an exte
- **External Alertmanager** can receive all your Grafana, Loki, Mimir, and Prometheus alerts. External Alertmanagers can be configured and administered from within Grafana itself.
Here are two examples of when you may want to [add your own external alertmanager][configure-alertmanager] and send your alerts there instead of the Grafana Alertmanager:
Here are two examples of when you may want to [add your own external Alertmanager][configure-alertmanager] and send your alerts there instead of the Grafana Alertmanager:
1. You may already have Alertmanagers on-premises in your own Cloud infrastructure that you have set up and still want to use, because you have other alert generators, such as Prometheus.
Contact points contain the configuration for sending notifications. A contact point is a list of integrations, each of which sends a notification to a particular email address, service or URL. Contact points can have multiple integrations of the same kind, or a combination of integrations of different kinds. For example, a contact point could contain a Pagerduty integration; an email and Slack integration; or a Pagerduty integration, a Slack integration, and two email integrations. You can also configure a contact point with no integrations; in which case no notifications are sent.
Contact points contain the configuration for sending notifications. A contact point is a list of integrations, each of which sends a notification to a particular email address, service, or URL. Contact points can have multiple integrations of the same kind, or a combination of integrations of different kinds. For example, a contact point could contain a Pagerduty integration; an email and Slack integration; or a Pagerduty integration, a Slack integration, and two email integrations. You can also configure a contact point with no integrations; in which case no notifications are sent.
A contact point cannot send notifications until it has been added to a notification policy. A notification policy can only send alerts to one contact point, but a contact point can be added to a number of notification policies at the same time. When an alert matches a notification policy, the alert is sent to the contact point in that notification policy, which then sends a notification to each integration in its configuration.
@ -31,21 +31,21 @@ Notification policies are _not_ a list, but rather are structured according to a
Each policy consists of a set of label matchers (0 or more) that specify which labels they are or aren't interested in handling.
For more information on label matching, see [how label matching works][labels-and-label-matchers].
For more information on label matching, refer to [how label matching works][labels-and-label-matchers].
{{% admonition type="note" %}}
If you haven't configured any label matchers for your notification policy, your notification policy will match _all_ alert instances. This may prevent child policies from being evaluated unless you have enabled **Continue matching siblings** on the notification policy.
If you haven't configured any label matchers for your notification policy, your notification policy matches_all_ alert instances. This may prevent child policies from being evaluated unless you have enabled **Continue matching siblings** on the notification policy.
{{% /admonition %}}
## Routing
To determine which notification policy will handle which alert instances, you have to start by looking at the existing set of notification policies, starting with the default notification policy.
To determine which notification policy handles which alert instances, you have to start by looking at the existing set of notification policies, starting with the default notification policy.
If no policies other than the default policy are configured, the default policy will handle the alert instance.
If no policies other than the default policy are configured, the default policy handles the alert instance.
If policies other than the default policy are defined, it will evaluate those notification policies in the order they are displayed.
If policies other than the default policy are defined, it evaluates those notification policies in the order they are displayed.
If a notification policy has label matchers that match the labels of the alert instance, it will descend in to its child policies and, if there are any, will continue to look for any child policies that might have label matchers that further narrow down the set of labels, and so forth until no more child policies have been found.
If a notification policy has label matchers that match the labels of the alert instance, it descends in to its child policies and, if there are any, continues to look for any child policies that might have label matchers that further narrow down the set of labels, and so forth until no more child policies have been found.
If no child policies are defined in a notification policy or if none of the child policies have any label matchers that match the alert instance's labels, the default notification policy is used.
@ -63,7 +63,7 @@ Here's a breakdown of how these policies are selected:
**Pod stuck in CrashLoop** does not have a `severity` label, so none of its child policies are matched. It does have a `team=operations` label, so the first policy is matched.
The `team=security` policy is not evaluated since we already found a match and **Continue matching siblings** was not configured for that policy.
The `team=security` policy is not evaluated a match was already found and **Continue matching siblings** was not configured for that policy.
**Disk Usage – 80%** has both a `team` and `severity` label, and matches a child policy of the operations team.
@ -80,15 +80,15 @@ The following properties are inherited by child policies:
- Timing options
- Mute timings
Each of these properties can be overwritten by an individual policy should you wish to override the inherited properties.
Each of these properties can be overwritten by an individual policy if you want to override the inherited properties.
To inherit a contact point from the parent policy, leave it blank. To override the inherited grouping options, enable **Override grouping**. To override the inherited timing options, enable **Override general timings**.
### Inheritance example
The example below shows how the notification policy tree from our previous example allows the child policies of the `team=operations` to inherit its contact point.
The example below shows how the notification policy tree from the previous example allows the child policies of the `team=operations` to inherit its contact point.
In this way, we can avoid having to specify the same contact point multiple times for each child policy.
In this way, you can avoid having to specify the same contact point multiple times for each child policy.
@ -98,15 +98,15 @@ In this way, we can avoid having to specify the same contact point multiple time
Grouping is an important feature of Grafana Alerting as it allows you to batch relevant alerts together into a smaller number of notifications. This is particularly important if notifications are delivered to first-responders, such as engineers on-call, where receiving lots of notifications in a short period of time can be overwhelming and in some cases can negatively impact a first-responders ability to respond to an incident. For example, consider a large outage where many of your systems are down. In this case, grouping can be the difference between receiving 1 phone call and 100 phone calls.
You choose how alerts are grouped together using the Group by option in a notification policy. By default, notification policies in Grafana group alerts together by alert rule using the `alertname` and `grafana_folder` labels (since alert names are not unique across multiple folders). Should you wish to group alerts by something other than the alert rule, change the grouping to any other combination of labels.
Choose how alerts are grouped together using the Group by option in a notification policy. By default, notification policies in Grafana group alerts together by alert rule using the `alertname` and `grafana_folder` labels (since alert names are not unique across multiple folders). If you want to group alerts by something other than the alert rule, change the grouping to any other combination of labels.
#### Disable grouping
Should you wish to receive every alert as a separate notification, you can do so by grouping by a special label called `...`. This is useful when your alerts are being delivered to an automated system instead of a first-responder.
If you want to receive every alert as a separate notification, you can do so by grouping by a special label called `...`. This is useful when your alerts are being delivered to an automated system instead of a first-responder.
#### A single group for all alerts
Should you wish to receive all alerts together in a single notification, you can do so by leaving Group by empty.
If you want to receive all alerts together in a single notification, you can do so by leaving Group by empty.
### Timing options
@ -114,19 +114,19 @@ The timing options decide how often notifications are sent for each group of ale
#### Group wait
Group wait is the amount of time Grafana waits before sending the first notification for a new group of alerts. The longer Group wait is the more time you have for other alerts to arrive. The shorter Group wait is the earlier the first notification will be sent, but at the risk of sending incomplete notifications. You should always choose a Group wait that makes the most sense for your use case.
Group wait is the amount of time Grafana waits before sending the first notification for a new group of alerts. The longer Group wait is the more time you have for other alerts to arrive. The shorter Group wait is the earlier the first notification is sent, but at the risk of sending incomplete notifications. You should always choose a Group wait that makes the most sense for your use case.
**Default** 30 seconds
#### Group interval
Once the first notification has been sent for a new group of alerts, Grafana starts the Group interval timer. This is the amount of time Grafana waits before sending notifications about changes to the group. For example, another firing alert might have just been added to the group while an existing alert might have resolved. If an alert was too late to be included in the first notification due to Group wait, it will be included in subsequent notifications after Group interval. Once Group interval has elapsed, Grafana resets the Group interval timer. This repeats until there are no more alerts in the group after which the group is deleted.
Once the first notification has been sent for a new group of alerts, the Group interval timer starts. This is the amount of wait time before notifications about changes to the group are sent. For example, another firing alert might have just been added to the group while an existing alert might have resolved. If an alert was too late to be included in the first notification due to Group wait, it is included in subsequent notifications after Group interval. Once Group interval has elapsed, Grafana resets the Group interval timer. This repeats until there are no more alerts in the group after which the group is deleted.
**Default** 5 minutes
#### Repeat interval
Repeat interval decides how often notifications are repeated if the group has not changed since the last notification. You can think of these as reminders that some alerts are still firing. Repeat interval is closely related to Group interval, which means your Repeat interval must not only be greater than or equal to Group interval, but also must be a multiple of Group interval. If Repeat interval is not a multiple of Group interval it will be coerced into one. For example, if your Group interval is 5 minutes, and your Repeat interval is 9 minutes, the Repeat interval will be rounded up to the nearest multiple of 5 which is 10 minutes.
Repeat interval decides how often notifications are repeated if the group has not changed since the last notification. You can think of these as reminders that some alerts are still firing. Repeat interval is closely related to Group interval, which means your Repeat interval must not only be greater than or equal to Group interval, but also must be a multiple of Group interval. If Repeat interval is not a multiple of Group interval it is coerced into one. For example, if your Group interval is 5 minutes, and your Repeat interval is 9 minutes, the Repeat interval is rounded up to the nearest multiple of 5 which is 10 minutes.