Distributor: add ingester append timeouts error (#10456)

**What this PR does / why we need it**:
Failing to send samples to ingesters because the request exceeded its
timeout is a very clear signal that ingesters are unable to keep up with
demand. In an incident today we saw that ingesters' push latencies were
increased sharply by an expensive regex query which was starving other
goroutines of time on CPU.

This new alert `loki_distributor_ingester_append_timeouts_total` will
give us a high-signal metric which we can use for alerting.
pull/10458/head
Danny Kopping 2 years ago committed by GitHub
parent 2cc80c59c4
commit 2c84959901
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
  1. 1
      CHANGELOG.md
  2. 5
      docs/sources/setup/upgrade/_index.md
  3. 17
      pkg/distributor/distributor.go

@ -68,6 +68,7 @@
* [10378](https://github.com/grafana/loki/pull/10378) **shantanualsi** Remove deprecated `ruler.wal-cleaer.period`
* [10380](https://github.com/grafana/loki/pull/10380) **shantanualsi** Remove `experimental.ruler.enable-api` in favour of `ruler.enable-api`
* [10395](https://github.com/grafana/loki/pull/10395/) **shantanualshi** Remove deprecated `split_queries_by_interval` and `forward_headers_list` configuration options in the `query_range` section
* [10456](https://github.com/grafana/loki/pull/10456) **dannykopping** Add `loki_distributor_ingester_append_timeouts_total` metric, remove `loki_distributor_ingester_append_failures_total` metric
##### Fixes

@ -105,6 +105,11 @@ You can use `--keep-empty` flag to retain them.
6. `split_queries_by_interval` is removed from `query_range` YAML section. You can instead configure it in [Limits Config](/docs/loki/latest/configuration/#limits_config).
7. `frontend.forward-headers-list` CLI flag and its corresponding YAML setting are removed.
#### Distributor metric changes
The `loki_distributor_ingester_append_failures_total` metric has been removed in favour of `loki_distributor_ingester_append_timeouts_total`.
This new metric will provide a more clear signal that there is an issue with ingesters, and this metric can be used for high-signal alerting.
### Jsonnet
##### Deprecated PodDisruptionBudget definition has been removed

@ -13,7 +13,9 @@ import (
"github.com/go-kit/log"
"github.com/go-kit/log/level"
"github.com/gogo/status"
"github.com/prometheus/prometheus/model/labels"
"google.golang.org/grpc/codes"
"github.com/grafana/dskit/httpgrpc"
"github.com/grafana/dskit/kv"
@ -117,7 +119,7 @@ type Distributor struct {
// metrics
ingesterAppends *prometheus.CounterVec
ingesterAppendFailures *prometheus.CounterVec
ingesterAppendTimeouts *prometheus.CounterVec
replicationFactor prometheus.Gauge
streamShardCount prometheus.Counter
}
@ -179,10 +181,10 @@ func New(
Name: "distributor_ingester_appends_total",
Help: "The total number of batch appends sent to ingesters.",
}, []string{"ingester"}),
ingesterAppendFailures: promauto.With(registerer).NewCounterVec(prometheus.CounterOpts{
ingesterAppendTimeouts: promauto.With(registerer).NewCounterVec(prometheus.CounterOpts{
Namespace: "loki",
Name: "distributor_ingester_append_failures_total",
Help: "The total number of failed batch appends sent to ingesters.",
Name: "distributor_ingester_append_timeouts_total",
Help: "The total number of failed batch appends sent to ingesters due to timeouts.",
}, []string{"ingester"}),
replicationFactor: promauto.With(registerer).NewGauge(prometheus.GaugeOpts{
Namespace: "loki",
@ -645,7 +647,12 @@ func (d *Distributor) sendStreamsErr(ctx context.Context, ingester ring.Instance
_, err = c.(logproto.PusherClient).Push(ctx, req)
d.ingesterAppends.WithLabelValues(ingester.Addr).Inc()
if err != nil {
d.ingesterAppendFailures.WithLabelValues(ingester.Addr).Inc()
if e, ok := status.FromError(err); ok {
switch e.Code() {
case codes.DeadlineExceeded:
d.ingesterAppendTimeouts.WithLabelValues(ingester.Addr).Inc()
}
}
}
return err
}

Loading…
Cancel
Save