apitech/loki - loki - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Salva Corts	8cf921a145	Pass engine opts down to middlewares (#9130 ) What this PR does / why we need it: The following middlewares in the query frontend uses a downstream engine: - `NewQuerySizeLimiterMiddleware` and `NewQuerierSizeLimiterMiddleware` - `NewQueryShardMiddleware` - `NewSplitByRangeMiddleware` These were all creating the downstream engine as follows: ```go logql.NewDownstreamEngine(logql.EngineOpts{LogExecutingQuery: false}, DownstreamHandler{next: next, limits: limits}, limits, logger), ``` As can be seen, the [engine options configured in Loki][1] were not being used at all. In the case of `NewQuerySizeLimiterMiddleware`, `NewQuerierSizeLimiterMiddleware` and `NewQueryShardMiddleware`, the downstream engine was created to get the `MaxLookBackPeriod`. When creating a new Downstream Engine as above, the `MaxLookBackPeriod` [would always be the default][2] (30 seconds). This PR fixes this by passing down the engine config to these middlewares, so this config is used to create the new downstream engines. Which issue(s) this PR fixes: Adresses some pending tasks from https://github.com/grafana/loki/pull/8670#issuecomment-1507031976. Special notes for your reviewer: Checklist - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [ ] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` [1]: `1bcf683513/pkg/querier/querier.go (L52)` [2]: `edc6b0bff7/pkg/logql/engine.go (L136-L140)`	2 years ago
Trevor Whitney	c587b538ed	Fail through to next middleware when querySizeLimit cannot be applied (#9050 ) What this PR does / why we need it: When the query size limiter can't limit the query, fail through to the next middleware instead of erroring. This can happen, for example, when a query spans schemas, which is still a valid query case, so we want to make sure to fall back to existing behavior. --------- Co-authored-by: Owen Diehl <ow.diehl@gmail.com>	2 years ago
Owen Diehl	acb40ed40e	Eager stream merge (#8968 ) This PR introduces a specialized heap based datastructure to merge incoming log results in the frontend. Recently we've experienced an increase in OOMs on frontends due to logs queries which match lots of data. Sharded requests in loki split based on the amount of data we expect and some queries see thousands of sub requests. For log queries, we'll fetch up the `limit` from each shard, return them to the frontend, and merge. High shard counts * limit log lines, especially combined with large log lines (in byte terms) are accumulated on the frontend. Once they all are received, the frontend merges them. This creates opportunity for OOMs as it can hold up a lot of memory. This PR addresses one of these problems by eagerly accumulating responses as they're received and only retaining a total `limit` number of entries. There's still OOM potential due to race conditions between sub requests returning to the query-frontend and the query-frontend merging other sub requests, but this definitely improves the situation. I've been able to consistently run large limited queries that touch TBs of data (i.e. `{cluster=~".+"} \|= "a"`) that previously OOMed frontends. --------- Signed-off-by: Owen Diehl <ow.diehl@gmail.com>	2 years ago
Owen Diehl	62403350a5	remove redundant splitby middleware (#8996 ) Found this double-copied line which a mistake. This PR removes one of them which won't change behavior (besides removing duplicate spans/etc).	2 years ago
Ed Welch	b892cade6a	Loki: Fixes incorrect query result when querying with start time == end time (#8979 ) What this PR does / why we need it: In several places within Loki we need to determine if a query is a `range query` or `instant query`, this is done by checking to see if the start and end time are equal and the `step=0` The downstream handler was not checking for `step=0` and thus it incorrectly mapped a range query to an instant query when a query has a start time equal to and end time. There are a few other things at play here, mainly that we should really error anytime someone tries to run an instant query for logs which would have exposed this error much more easily. But that's something I'd like to handle in a different PR as it will be considered a breaking change depending on how we do it. This PR uses an existing function we have for testing the query type and addresses the issue found in #8885 Which issue(s) this PR fixes: Fixes #8885 Special notes for your reviewer: Checklist - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [ ] Documentation added - [ ] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` --------- Signed-off-by: Edward Welch <edward.welch@grafana.com>	2 years ago
Ed Welch	edc6b0bff7	Loki: Add a limit for the [range] value on range queries (#8343 ) Signed-off-by: Edward Welch <edward.welch@grafana.com> What this PR does / why we need it: Loki does not currently split queries by time to a value smaller than what's in the [range] of a range query. Example ``` sum(rate({job="foo"}[2d])) ``` Imagine now this query being executed over a longer window of a few days with a step of something like 30m. Every step evaluation would query the last [2d] of data. There are use cases where this is desired, specifically if you force the step to match the value in the range, however what is more common is someone accidentally uses `[$__range]` in here instead of `[$__interval]` within Grafana and then sets the query time selector to a large value like 7 days. This PR adds a limit which will fail queries that set the [range] value higher than the configured limit. It's disabled by default. In the future it may be possible for Loki to perform splits within the [range] and remove the need for this limit, but until then this can be an important safeguard in clusters with a lot of data. Which issue(s) this PR fixes: Fixes #8746 Special notes for your reviewer: Checklist - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [ ] Documentation added - [ ] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` --------- Signed-off-by: Edward Welch <edward.welch@grafana.com> Co-authored-by: Karsten Jeschkies <karsten.jeschkies@grafana.com> Co-authored-by: Vladyslav Diachenko <82767850+vlad-diachenko@users.noreply.github.com>	2 years ago
Dylan Guedes	9159c1dac3	Loki: Improve spans usage (#8927 ) What this PR does / why we need it: - At different places, inherit the span/spanlogger from the given context instead of instantiating a new one from scratch, which fix spans being orphaned on a read/write operation. - At different places, turn spans into events. Events are lighter than spans and by having fewer spans in the trace, trace visualization will be cleaner without losing any details. - Adds new spans/events to places that might be a bottleneck for our writes/reads.	2 years ago
Periklis Tsirakidis	1bcf683513	Expose optional label matcher for label values handler (#8824 )	2 years ago
Salva Corts	45775c82f7	Implement `RequiredNumberLabels` query limit (#8918 ) What this PR does / why we need it: As pointed out in https://github.com/grafana/loki/pull/8851, some queries can impose a great workload on a cluster by selecting too many streams. Similarly to the `RequiredLabels` limit introduced at https://github.com/grafana/loki/pull/8851, here we add a new limit `RequiredNumberLabels` to require queries to specify at least N label. For example, if the limit is set to 2, then the query should contain at least 2 label matchers. This limit can be configured per tenant and at query time. ![image](https://user-images.githubusercontent.com/8354290/228271398-4b9bcc49-f539-4e94-86c1-071e519a30a9.png) Which issue(s) this PR fixes: Fixes https://github.com/grafana/loki-private/issues/699 Special notes for your reviewer: Checklist - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [x] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` --------- Co-authored-by: Dylan Guedes <djmgguedes@gmail.com>	2 years ago
Salva Corts	ee69f2bd37	Split index request in 24h intervals (#8909 ) What this PR does / why we need it: At https://github.com/grafana/loki/pull/8670, we applied a time split of 24h intervals to all index stats requests to enforce the `max_query_bytes_read` and `max_querier_bytes_read` limits. When the limit is surpassed, the following message get's displayed: ![image](https://user-images.githubusercontent.com/8354290/227960400-b74a0397-13ef-4143-a1fc-48d885af55c0.png) As can be seen, the reported bytes read by the query are not the same as those reported by Grafana in the lower right corner of the query editor. This is because: 1. The index stats request for enforcing the limit is split in subqueries of 24h. The other index stats rquest is not time split. 2. When enforcing the limit, we are not displaying the bytes in powers of 2, but powers of 10 ([see here][2]). I.e. 1KB is 1000B vs 1KiB is 1024B. This PR adds the same logic to all index stats requests so we also time split by 24 intervals all requests that hit the Index Stats API endpoint. We also use powers of 2 instead of 10 on the message when enforcing `max_query_bytes_read` and `max_querier_bytes_read`. ![image](https://user-images.githubusercontent.com/8354290/227959491-f57cf7d2-de50-4ee6-8737-faeafb528f99.png) Note that the library we use under the hoot to print the bytes rounds up and down to the nearest integer ([see][3]); that's why we see 16GiB compared to the 15.5GB in the Grafana query editor. Which issue(s) this PR fixes: Fixes https://github.com/grafana/loki/issues/8910 Special notes for your reviewer: - I refactored the`newQuerySizeLimiter` function and the rest of the _Tripperwares_ in `rountrip.go` to reuse the new IndexStatsTripperware. So we configure the split-by-time middleware only once. Checklist - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [x] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` [1]: https://grafana.com/docs/loki/latest/api/#index-stats [2]: https://github.com/grafana/loki/blob/main/pkg/querier/queryrange/limits.go#L367-L368 [3]: https://github.com/dustin/go-humanize/blob/master/bytes.go#L75-L78	2 years ago
Salva Corts	336e08fc4b	Salvacorts/max querier size messaging (#8916 ) What this PR does / why we need it: In https://github.com/grafana/loki/pull/8670 we introduced a new limit `max_querier_bytes_read`. When the limit was surpassed the following error message is printed: ``` query too large to execute on a single querier, either because parallelization is not enabled, the query is unshardable, or a shard query is too big to execute: (query: %s, limit: %s). Consider adding more specific stream selectors or reduce the time range of the query ``` As pointed out in [this comment][1], a user would have a hard time figuring out whether the cause was `parallelization is not enabled`, `the query is unshardable` or `a shard query is too big to execute`. This PR improves the error messaging for the `max_querier_bytes_read` limit to raise a different error for each of the causes above. Which issue(s) this PR fixes: Followup for https://github.com/grafana/loki/pull/8670 Special notes for your reviewer: Checklist - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [x] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` [1]: https://github.com/grafana/loki/pull/8670#discussion_r1146008266 --------- Co-authored-by: Danny Kopping <danny.kopping@grafana.com>	2 years ago
Salva Corts	d24fe3e68b	Max bytes read limit (#8670 ) What this PR does / why we need it: This PR implements two new per-tenant limits that are enforced on log and metric queries (both range and instant) when TSDB is used: - `max_query_bytes_read`: Refuse queries that would read more than the configured bytes here. Overall limit regardless of splitting/sharding. The goal is to refuse queries that would take too long. The default value of 0 disables this limit. - `max_querier_bytes_read`: Refuse queries in which any of their subqueries after splitting and sharding would read more than the configured bytes here. The goal is to avoid a querier from running a query that would load too much data in memory and can potentially get OOMed. The default value of 0 disables this limit. These new limits can be configured per tenant and per query (see https://github.com/grafana/loki/pull/8727). The bytes a query would read are estimated through TSDB's index stats. Even though they are not exact, they are good enough to have a rough estimation of whether a query is too big to run or not. For more details on this refer to this discussion in the PR: https://github.com/grafana/loki/pull/8670#discussion_r1124858508. Both limits are implemented in the frontend. Even though we considered implementing `max_querier_bytes_read` in the querier, this way, the limits for pre and post splitting/sharding queries are enforced close to each other on the same component. Moreover, this way we can reduce the number of index stats requests issued to the index gateways by reusing the stats gathered while sharding the query. With regard to how index stats requests are issued: - We parallelize index stats requests by splitting them into queries that span up to 24h since our indices are sharded by 24h periods. On top of that, this prevents a single index gateway from processing a single huge request like `{app=~".+"} for 30d`. - If sharding is enabled and the query is shardable, for `max_querier_bytes_read`, we re-use the stats requests issued by the sharding ware. Specifically, we look at the [bytesPerShard][1] to enforce this limit. Note that once we merge this PR and enable these limits, the load of index stats requests will increase substantially and we may discover bottlenecks in our index gateways and TSDB. After speaking with @owen-d, we think it should be fine as, if needed, we can scale up our index gateways and support caching index stats requests. Here's a demo of this working: <img width="1647" alt="image" src="https://user-images.githubusercontent.com/8354290/226918478-d4b6c2fd-de4d-478a-9c8b-e38fe148fa95.png"> <img width="1647" alt="image" src="https://user-images.githubusercontent.com/8354290/226918798-a71b1db8-ea68-4d00-933b-e5eb1524d240.png"> Which issue(s) this PR fixes: This PR addresses https://github.com/grafana/loki-private/issues/674. Special notes for your reviewer: - @jeschkies has reviewed the changes related to query-time limits. - I've done some refactoring in this PR: - Extracted logic to get stats for a set of matches into a new function [getStatsForMatchers][2]. - Extracted the _Handler_ interface implementation for [queryrangebase.roundTripper][3] into a new type [queryrangebase.roundTripperHandler][4]. This is used to create the handler that skips the rest of configured middlewares when sending an index stat quests ([example][5]). Checklist - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [x] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` [1]: `ff847305af/pkg/querier/queryrange/shard_resolver.go (L179-L186)` [2]: `ff847305af/pkg/querier/queryrange/shard_resolver.go (L72)` [3]: `3d2fff3a2d/pkg/querier/queryrange/queryrangebase/roundtrip.go (L124)` [4]: `3d2fff3a2d/pkg/querier/queryrange/queryrangebase/roundtrip.go (L163)` [5]: `f422e0a52b/pkg/querier/queryrange/roundtrip.go (L521)`	2 years ago
Karsten Jeschkies	94725e7908	Define `RequiredLabels` query limit. (#8851 ) What this PR does / why we need it: Some end-users can impose great workload on a cluster by selecting too many streams in their queries. We should be able to limit them. Therefore we introduce a new limit `RequiredLabelMatchers` which list label names that must be included in the stream selectors. The implementation follows the same approach as for max query limit. Which issue(s) this PR fixes: Fixes #8745 Checklist - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [x] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md`	2 years ago
Karsten Jeschkies	f5f1753851	Print duration in error messages with more readable. (#8816 ) What this PR does / why we need it: The old error messages would print only up to hours. E.g. `169h30s`. This change will print it as `7d1h30s`. See [model.Duration](`66b493f42b/model/time.go (L259-L290)`) for details. Checklist - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [ ] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md`	2 years ago
Christian Haudum	be8b4eece3	Scheduler: Add query fairness control across multiple actors within a tenant (#8752 ) What this PR does / why we need it: This PR wires up the scheduler with the hierarchical queues. It is the last PR to implement https://github.com/grafana/loki/pull/8585. When these changes are in place, the client performing query requests can control their QoS (query fairness) using the `X-Actor-Path` HTTP header. This header controls in which sub-queue of the tenant's scheduler queue the query request is enqueued. The place within the hierarchy where it is enqueued defines the probability with which the request gets dequeued. A common use-case for this QoS control is giving each Grafana user within a tenant their fair share of query execution time. Any documentation is still missing and will be provided by follow-up PRs. Special notes for your reviewer: ```console $ gotest -count=1 -v ./pkg/scheduler/queue/... -test.run=TestQueryFairness === RUN TestQueryFairness === RUN TestQueryFairness/use_hierarchical_queues_=_false dequeue_qos_test.go:109: duration actor a 2.007765568s dequeue_qos_test.go:109: duration actor b 2.209088331s dequeue_qos_test.go:112: total duration 2.209280772s === RUN TestQueryFairness/use_hierarchical_queues_=_true dequeue_qos_test.go:109: duration actor b 605.283144ms dequeue_qos_test.go:109: duration actor a 2.270931324s dequeue_qos_test.go:112: total duration 2.271108551s --- PASS: TestQueryFairness (4.48s) --- PASS: TestQueryFairness/use_hierarchical_queues_=_false (2.21s) --- PASS: TestQueryFairness/use_hierarchical_queues_=_true (2.27s) PASS ok github.com/grafana/loki/pkg/scheduler/queue 4.491s ``` ```console $ gotest -count=5 -v ./pkg/scheduler/queue/... -bench=Benchmark -test.run=^$ -benchtime=10000x -benchmem goos: linux goarch: amd64 pkg: github.com/grafana/loki/pkg/scheduler/queue cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz BenchmarkGetNextRequest BenchmarkGetNextRequest/without_sub-queues BenchmarkGetNextRequest/without_sub-queues-8 10000 29337 ns/op 1600 B/op 100 allocs/op BenchmarkGetNextRequest/without_sub-queues-8 10000 21348 ns/op 1600 B/op 100 allocs/op BenchmarkGetNextRequest/without_sub-queues-8 10000 21595 ns/op 1600 B/op 100 allocs/op BenchmarkGetNextRequest/without_sub-queues-8 10000 21189 ns/op 1600 B/op 100 allocs/op BenchmarkGetNextRequest/without_sub-queues-8 10000 21602 ns/op 1600 B/op 100 allocs/op BenchmarkGetNextRequest/with_1_level_of_sub-queues BenchmarkGetNextRequest/with_1_level_of_sub-queues-8 10000 33770 ns/op 2400 B/op 200 allocs/op BenchmarkGetNextRequest/with_1_level_of_sub-queues-8 10000 33596 ns/op 2400 B/op 200 allocs/op BenchmarkGetNextRequest/with_1_level_of_sub-queues-8 10000 34432 ns/op 2400 B/op 200 allocs/op BenchmarkGetNextRequest/with_1_level_of_sub-queues-8 10000 33760 ns/op 2400 B/op 200 allocs/op BenchmarkGetNextRequest/with_1_level_of_sub-queues-8 10000 33664 ns/op 2400 B/op 200 allocs/op BenchmarkGetNextRequest/with_2_levels_of_sub-queues BenchmarkGetNextRequest/with_2_levels_of_sub-queues-8 10000 71405 ns/op 3200 B/op 300 allocs/op BenchmarkGetNextRequest/with_2_levels_of_sub-queues-8 10000 59472 ns/op 3200 B/op 300 allocs/op BenchmarkGetNextRequest/with_2_levels_of_sub-queues-8 10000 117163 ns/op 3200 B/op 300 allocs/op BenchmarkGetNextRequest/with_2_levels_of_sub-queues-8 10000 106505 ns/op 3200 B/op 300 allocs/op BenchmarkGetNextRequest/with_2_levels_of_sub-queues-8 10000 64374 ns/op 3200 B/op 300 allocs/op BenchmarkQueueRequest BenchmarkQueueRequest-8 10000 168391 ns/op 320588 B/op 1156 allocs/op BenchmarkQueueRequest-8 10000 166203 ns/op 320587 B/op 1156 allocs/op BenchmarkQueueRequest-8 10000 149518 ns/op 320584 B/op 1156 allocs/op BenchmarkQueueRequest-8 10000 219776 ns/op 320583 B/op 1156 allocs/op BenchmarkQueueRequest-8 10000 185198 ns/op 320597 B/op 1156 allocs/op PASS ok github.com/grafana/loki/pkg/scheduler/queue 64.648s ``` Signed-off-by: Christian Haudum <christian.haudum@gmail.com>	2 years ago
Danny Kopping	33e44ed39d	Ruler: remote rule evaluation (#8744 ) What this PR does / why we need it: Adds the ability to evaluate recording & alerting rules against a given `query-frontend`, allowing these queries to be executed with all the parallelisation & optimisation that regular adhoc queries have. This is important because with `local` evaluation all queries are single-threaded, and rules that evaluate a large range/volume of data may timeout or OOM the `ruler` itself, leading to missed metrics or alerts. When `remote` evaluation mode is enabled, the `ruler` effectively just becomes a gRPC client for the `query-frontend`, which will dramatically improve the reliability of the `ruler` and also drastically reduce its resource requirements. Which issue(s) this PR fixes: This PR implements the feature discussed in https://github.com/grafana/loki/pull/8129 (LID 0002: Remote Rule Evaluation).	2 years ago
Ed Welch	a4eb536fb2	Loki: remove `subqueries` from metrics.go logging and replace it with separate split and shard counters (#8761 ) What this PR does / why we need it: Currently the `metrics.go` log line emitted after every query includes a metric called "subqueries". This currently tracks the number of queries created by the split_by_time operations done in Loki, but does not include any counts for subqueries created as a result of sharding. It becomes difficult to make a single subqueries counter that gives useful information to determine how much a query is split by time and sharded by a shard factor, especially now that sharding in TSDB indexes is dynamic. This PR removes and deprecates the `subqueries` stat and instead creates a `splits` and `shards` statistic which records how much a query was split_by_time and the total number of shards created as well. Which issue(s) this PR fixes: Fixes #<issue number> Special notes for your reviewer: Checklist - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [ ] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` --------- Signed-off-by: Edward Welch <edward.welch@grafana.com>	2 years ago
Callum Styan	5a85f6647e	Add initial implementation of per-query limits (#8727 ) What this PR does / why we need it: Sometimes we want to limit the impact of a single query by imposing limits that are stricter than the current tenant limit. E.g. the maximum query length could be seven days but based on the query or an admins decision a query should just have a maximum length of one day. This is where per-request limits come into play. They are passed via the `X-Loki-Query-Limit` header and extracted into the requests context. It is the responsibility of the operator or admin that the header is valid. Which issue(s) this PR fixes: Fixes #8762 Checklist - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [ ] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [x] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` --------- Signed-off-by: Callum Styan <callumstyan@gmail.com> Co-authored-by: Karsten Jeschkies <karsten.jeschkies@grafana.com>	2 years ago
Callum Styan	9a2a038f43	Allow passing of context to query related limits functions (#8689 ) In this PR we're allowing for passing of a `context.Context` via the Limits interfaces (some of which are new, to clean up hardcoding/embedding of `validation.Overrides`) This is based on work/ideas by @jeschkies . Fixes #8694 --------- Signed-off-by: Callum Styan <callumstyan@gmail.com> Co-authored-by: Karsten Jeschkies <karsten.jeschkies@grafana.com>	2 years ago
Bryan Boreham	6fd4b5e89b	Update prometheus/prometheus from 2.41 to 2.42 (#8571 ) What this PR does / why we need it: Brings in the latest updates from upstream. These open up some opportunities for optimisations in TSDB indexing. Dependencies updated: * github.com/Azure/go-autorest/autorest/adal v0.9.21 -> v0.9.22 (comment-only change) * github.com/docker/docker v20.10.21 -> v20.10.23 (fixing filters bug) * golang.org/x/exp fae10dda9338 -> d38c7dcee874 (optimisations in `BinarySearch` function) Indirect dependencies also updated: * github.com/digitalocean/godo v1.91.1 -> v1.95.0 (nothing alarming in [release notes](https://github.com/digitalocean/godo/releases)) * github.com/google/pprof aee1124e3a93 -> 76d1ae5aea2b (no changes pulled into vendor) * golang.org/x/tools v0.4.0 -> v0.5.0 (relating to Go compiler utilities) * google.golang.org/genproto 76db0878b65f -> 31e0e69b6fc2 * k8s.io/api v0.26.0 -> v0.26.1 (comment-only change) * k8s.io/apimachinery v0.26.0 -> v0.26.1 (no changes pulled into vendor) * k8s.io/client-go v0.26.0 -> v0.26.1 (small fixes) Special notes for your reviewer: A couple of interfaces changed; these have required matching changes in Loki code. Those changes are split into separate commits. I also note that most calls to `relabel` ignore when the rule says "drop". Maybe this is wrong?	2 years ago
Garrett	433d5bf913	fix panics when cloning a special query (#8531 ) Signed-off-by: garrettlish <garrett.li.sh@gmail.com>	2 years ago
Owen Diehl	6a7403c4f5	correctly calculate max shards (#8494 )	2 years ago
Ed Welch	9f0834793b	Loki: set a maximum number of shards for "limited" queries instead of fixed number (#8487 ) Signed-off-by: Edward Welch <edward.welch@grafana.com>	2 years ago
Ed Welch	37169ca444	Loki: Process "Limited" queries sequentially and not in parallel (#8482 ) Signed-off-by: Edward Welch <edward.welch@grafana.com>	2 years ago
Christian Haudum	96d5227532	Fix parsing of vector expression (#8448 ) Signed-off-by: Christian Haudum <christian.haudum@gmail.com>	2 years ago
Owen Diehl	b13995e201	logs sharding astmapperware to spans in addition to logs (#8457 )	2 years ago
李国忠	322783e3d8	LogQL: [optimization] syntax: Replace "panic" in "/pkg/logql/syntax" with "error" (#7208 )	2 years ago
Ed Welch	35510ba4eb	Loki: only log "executing query" once per query in the frontend (#8337 ) Signed-off-by: Edward Welch <edward.welch@grafana.com> Co-authored-by: Danny Kopping <danny.kopping@grafana.com>	2 years ago
Irina	4cd1246b88	Logproto: Extract push.proto from logproto package to the separate module (#8259 ) Co-authored-by: Owen Diehl <ow.diehl@gmail.com>	2 years ago
Owen Diehl	07487cd89d	fixes bug with queryIngesterWithin logic in asyncStore ingester stats… (#8145 ) Fixes a previous mistake in the logic calculating when to skip querying ingesters in the async store Statistics method. Notably `through.After` should be `through.Before` when skipping querying ingesters as that's when there's no overlap with the `query-ingesters-within` period: ```go // OLD CODE BELOW if a.queryIngestersWithin != 0 { // don't query ingesters if the query does not overlap with queryIngestersWithin. if !through.After(model.Now().Add(-a.queryIngestersWithin)) { // <----- should be through.Before return a.Store.Stats(ctx, userID, from, through, matchers...) } } ``` I discovered the problem while debugging querier OOMs during a boltdb-shipper -> tsdb migration and ultimately found this happened under the following circumstances: * Queries over high volumes of _recent_ data wouldn't query ingesters for index metadata * Without index metadata, we only checked storage metadata * There is a delay before we ship the index to storage, meaning we don't see it if ingesters are skipped * Calculating ideal shard factors without recent data for queries that only touch recent data _dramatically_ underestimates the desired shard factor * Queries aren't split enough and get scheduled onto too few querier replicas * They oom. We had some fun examples like a single replica trying to download 419,000 chunks/query To be clear, this is still a hypothesis, but a plausible one, especially after finding the boolean logic error this PR fixes. Instead of changing this one line, I took the opportunity to refactor this into a shared utility used by our other `GetChunkRefs` method which is already tested, ensuring the logic works as expected. I also added some more logging visibility into this code so we can understand what the difference is when querying statistics from storage vs ingesters. Finally, I added a helper to prepare our `Stats` objects to be logged which is now used in a few places.	2 years ago
Sandeep Sukhani	24deb6ed3b	fix bugs in logs results caching and its tests (#7925 ) What this PR does / why we need it: When a logs query results in an empty response, we cache it to avoid doing that query again and respond straight away with an empty response. However, we cache a single entry per time split interval with the query itself to keep things simple. For example, if the time split config for tenant `A` is `30m`, then queries for intervals `10m`-`20m` and `21m`-`25m` would have the same cache key. Here is roughly how cache hit is handled: * If the new query is within the cached query bounds, return empty results * If the start of new query is before the start time of the cached query, do a query from `newQuery.Start` to `cachedQuery.Start` * If the response of last query is also empty, set `cachedQuery.Start` = `newQuery.Start` * If the end of new query is after the end time of the cached query, do a query from `cachedQuery.End` to `newQuery.Start` * If the response of last query is also empty, set `cachedQuery.End` = `newQuery.End` * If we have changes in `cachedQuery.Start/End`, update it in the cache. The problem here is when we do queries to fill the gap, we sometimes do queries for the range outside of what the user requested and respond back without reducing the response to what the user requested. For example, if the cached query is from `21m`-`25m` and the user query is from `10m`-`15m`, we will query for the whole gap i.e `10m`-`21m`. If there are logs from `15m`-`21m` in the response, we will unexpectedly send it back to the user. This PR takes care of this issue by extracting the data and sending back only the user's requested logs. I have also found the tests for logs results cache were incorrect. They heavily use [mergeLokiResponse](`e2842c69c5/pkg/querier/queryrange/codec.go (L1014)`) for building the test data, but they were not setting the `limit` in test queries which was causing `mergeLokiResponse` to send back empty results. This means were were always comparing two empty results, which would always be the same. This PR also takes care of fixing it, and relevant changes to correct the test. Checklist - [x] Tests updated - [x] `CHANGELOG.md` updated	2 years ago
Christian Haudum	089ec1b05f	Fix various linter errors These changes have been generated using `make lint`. ```console $ golangci-lint --version golangci-lint has version 1.50.0 built from 704109c6 on 2022-10-04T10:25:07Z ``` Signed-off-by: Christian Haudum <christian.haudum@gmail.com>	2 years ago
Danny Kopping	a4f306399a	Add store & cache download statistics (#7982 )	2 years ago
Susana Ferreira	f93b91bfb5	Add configuration documentation generation tool (#7916 ) What this PR does / why we need it: Add a tool to generate configuration flags documentation based on the flags properties defined on registration on the code. This tool is based on the [Mimir doc generation tool](https://github.com/grafana/mimir/tree/main/tools/doc-generator) and adapted according to Loki configuration specifications. Prior to this PR, the configuration flags documentation was dispersed across two sources: * [_index.md](`5550cd65ec/docs/sources/configuration/_index.md`) * configuration flags registration in the code This meant that there was no single source of truth. In this PR, the previous `_index.md` file is replaced with the new file generated by the tool. The next step includes adding a CI step that validates if the _index.md file was generated according to the flags settings. This will be done in a follow-up PR. NOTE: this is not a documentation update PR. Apart from some minor typo fixes, the documentation changes on the code, were copied from the `_index.md` file. Which issue(s) this PR fixes: Fixes https://github.com/grafana/loki-private/issues/83 Special notes for your reviewer: Files: * [docs/sources/configuration/index.template](`5550cd65ec/docs/sources/configuration/index.template`): template used to generate the final configuration file * [/docs/sources/configuration/_index.md](`c32e5d0acb/docs/sources/configuration/_index.md`): file generated by tool * `loki/pkg` directory files updated with up-to-date documentation from `_index.md` file * [tools/doc-generator](`5550cd65ec/tools/doc-generator`) directory with documentation generation tool. Checklist - [ ] Reviewed the `CONTRIBUTING.md` guide - [ ] Documentation added - [ ] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md`	2 years ago
Sandeep Sukhani	e9c93cd0f5	consider range and offset in queries while looking for schema config for query sharding (#7880 ) What this PR does / why we need it: We disable query sharding when a query touches multiple schema configs since they might have different sharding factors and might handle sharding differently. This check was done purely on the start and end time of the query without considering `range` and `offset`, which would change the length of the actual data being queried. Besides being incorrect, it also causes us to fail queries when migrating between tsdb and non-tsdb stores since both handle query sharding differently. For e.g., if the prev schema is `boltdb-shipper` and starting today, `tsdb` index is being used, so if we do a query `sum(rate({foo="bar"}[24h])` even with start and end within `tsdb` range, we will fail the query with error `incompatible index shard` if dynamic sharding goes with shard factor other than `32`(default shard factor in ingester for inverted index). The reason being the `range` here is `24h` which causes us to process data from previous schema as well. This PR takes care of the issue by factoring in `range` and `offset` in the queries while looking for schema config for query sharding. Checklist - [x] `CHANGELOG.md` updated	2 years ago
Robert Fratto	85392a9728	Update Prometheus dependency to latest release (v2.40.4) (#7826 ) Closes #7811, which is needed for Grafana Agent to update to v2.40 and add support for native histograms. I did not add support for native histograms to Loki, sorry :)	3 years ago
Christian Haudum	feaf9c3232	Log query string on retry alongside the error (#7834 ) What this PR does / why we need it: For better observability of query retries. Signed-off-by: Christian Haudum <christian.haudum@gmail.com>	3 years ago
Trevor Whitney	37b1c0fce0	guard against divide by 0 when splitting parallelism (#7831 ) What this PR does / why we need it: Which issue(s) this PR fixes: We saw a spike in divide by zero panics in the code introduced in #7769. I was able to reproduce this error via a test that calculates `WeightedParallelism` with a start that's after the end. Not sure if this is possible, but we definitely saw this happening in our ops environment, so something is causing it, and the fix should guard against it in any case. Special notes for your reviewer: Checklist - [X] Tests updated Co-authored-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>	3 years ago
Sandeep Sukhani	89d81020ce	fix lint issues from PR 7804 (#7814 ) What this PR does / why we need it: I had enabled auto-merge in PR #7804, but somehow it still merged the PR without all the checks passing. This PR fixes the failing lint and tests.	3 years ago
Sandeep Sukhani	1410808ee9	use grpc for communicating with compactor for query time filtering of data requested for deletion (#7804 ) What this PR does / why we need it: Add grpc support to compactor for getting delete requests and gen number for query time filtering. Since these requests are internal to Loki, it would be good to use grpc instead of HTTP same as all the internal requests we do in Loki. I have added a new config for accepting the grpc address of the compactor. I tried having just the existing config and detecting if it is a grpc server, but it was hard to do it reliably, considering the different deployment modes we support. I think it is safe to keep it the same and eventually deprecate the existing config. Checklist - [x] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated	3 years ago
Danny Kopping	a63ad06509	Querier/Ruler: query blocker (#7785 ) Block malicious or expensive queries using a per-tenant runtime configuration.	3 years ago
Owen Diehl	22089415e8	Split parallelism across Period Configs (#7769 ) One of things we watch while updating non-TSDB period configs to TSDB period configs is the difference in query parallelism. TSDB dynamically shards queries into (potentially) much smaller units of work compared to the static shard factors uses prior. To account for this, we use much higher query parallelism configurations with TSDB period configs. This creates a potential problem when querying across `non-tsdb, tsdb` period boundaries: we may want a query parallelism of 512 for the tsdb portion but only 64 for the non-tsdb portion! However, we only had one limit to specify this per tenant, meaning this would be too high when querying non-tsdb periods or too low when querying tsdb ones. This PR * Introduces `tsdb_max_query_parallelism` (default `512`) to `limits_config` * Uses `tsdb_max_query_parallelism` and `max_query_parallelism` limits to find a better parallelism _per query_ by weighting the two respective configs by the proportion of each query spent on TSDB or non-TSDB period configurations. Signed-off-by: Owen Diehl <ow.diehl@gmail.com>	3 years ago
Dylan Guedes	ad2260aec2	Loki: Fix multitenant querying (#7708 ) What this PR does / why we need it: We recently broke multitenant querying because of recent changes to how timeouts working across Loki. This PR fixes this by: - Adapt the timeout wrapper to work with multitenant queries. It will take the shortest timeout across all given tenants - Adapt the query engine timeout assigning to work with multitenant queries. It will take the shortest timeout between all the tenants - Adapt query sharding to use smallest max query parallelism across given tenants - Add a functional test to ensure multitenant is behaving as expected Signed-off-by: DylanGuedes <djmgguedes@gmail.com> Signed-off-by: Mehmet Burak Devecí <mhmtbrkdvc@gmail.com> Which issue(s) this PR fixes: https://github.com/grafana/loki/issues/7696 Special notes for your reviewer: The regression was probably introduced by https://github.com/grafana/loki/pull/7555	3 years ago
Periklis Tsirakidis	e0a7b28a61	Add single compactor http client for delete and gennumber clients (#7453 )	3 years ago
Sandeep Sukhani	020631ebac	add user-id transformer for logs results cache (#7581 ) What this PR does / why we need it: Adding transformer to results cache same as metrics cache done in PR #7542	3 years ago
Michel Hollands	16761723f4	Add way to override userId for caching (#7542 ) What this PR does / why we need it: Add a way to change the userId used in caching. Which issue(s) this PR fixes: Fixes #<issue number> Special notes for your reviewer: Checklist - [ ] Reviewed the `CONTRIBUTING.md` guide - [ ] Documentation added - [ ] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` Signed-off-by: Michel Hollands <michel.hollands@grafana.com> Signed-off-by: Michel Hollands <michel.hollands@grafana.com>	3 years ago
Ed Welch	45caba4459	Loki: Remove the bypass for "limited" queries (#7510 ) What this PR does / why we need it: Limited queries are queries which don't have a filter expression. Now all log queries will be handled by the LogFilterTripperware which will result in them being split by time (they previously were not) Signed-off-by: Edward Welch <edward.welch@grafana.com> Which issue(s) this PR fixes: We've found that very large timeframe `limited` queries will be sent to queriers and then ingesters and can stall out the read path because of the large time ranges. Splitting these by time avoids this problem by keeping any subquery limited in length to the `split_by_interval`. It's important to note that this will likely increase some extra work done by limited queries as more work for these will be parallelized and can result in extra data being processed. This will be the same as how filter queries are handled now, so it will be no worse than that. In fact this was an optimization based on the premise that it's advantageous not to split/shard limited queries but we are seeing this not be the case when limited queries are run over very large time ranges matching streams with huge volumes and combined with parsers like `\| json` Special notes for your reviewer: Checklist - [ ] Reviewed the `CONTRIBUTING.md` guide - [ ] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` Signed-off-by: Edward Welch <edward.welch@grafana.com>	3 years ago
Travis Patterson	c1bccac141	Results cache fix improvements (#7444 ) Move middleware to someplace more sensible and incorporate feedback missed in the review for this.	3 years ago
Travis Patterson	f2297d3d2a	Fix result cache misses on sharded queries (#7429 ) Headers weren't being propagated from query responses on sharded queries causing result cache misses. This PR: - Adds headers to `logproto.Result` so we can save response headers as they come back from queries - Metric-query results are turned into numbers well away from where responses are sent to the frontend: save all headers in the context so they can be pulled out when the final result is calculated - Propagate headers in the final responses	3 years ago
Owen Diehl	044d06015d	adds result cache key version comparison metrics (#7323 ) Followup to https://github.com/grafana/loki/pull/7300 which gives more metric visibility into the underlying errors when comparing cache key versions on the queriers vs query frontends.	3 years ago

1 2 3 4

192 Commits (a99c73dd97bf55d912d391339a7b82acccabf915)