apitech/loki - loki - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Karsten Jeschkies	fbcaa1d5d8	Lazily decode series protobuf. (#10071 ) What this PR does / why we need it: The protobuf decoding for series responses runs into a memory issue for many series. The response is only merged and passed through the front end. It is more efficient to decode the protobuf encoded series lazily. And `decode`, `merge`, `encode` benchmark shows the benefit of transcoding the protobuf series message into a JSON response directly. This change will not impact production code since it only applies to protobuf encoding messaging between the querier and query frontend that must be explicitly enabled. ``` › go test -v -run=^$ -bench "Benchmark_DecodeMergeEncodeCycle" -memprofile memory_base.prof -count=10 ./pkg/querier/queryrange > before.txt › go test -v -run=^$ -bench "Benchmark_DecodeMergeEncodeCycle" -memprofile memory_base.prof -count=10 ./pkg/querier/queryrange > before.txt › benchstat before.txt after.txt before.txt:5: missing iteration count after.txt:5: missing iteration count goos: linux goarch: amd64 pkg: github.com/grafana/loki/pkg/querier/queryrange cpu: AMD Ryzen 7 3700X 8-Core Processor │ before.txt │ after.txt │ │ sec/op │ sec/op vs base │ _DecodeMergeEncodeCycle-16 2537.7m ± 2% 934.2m ± 1% -63.19% (p=0.000 n=10) │ before.txt │ after.txt │ │ B/op │ B/op vs base │ _DecodeMergeEncodeCycle-16 1723.4Mi ± 0% 641.1Mi ± 0% -62.80% (p=0.000 n=10) │ before.txt │ after.txt │ │ allocs/op │ allocs/op vs base │ _DecodeMergeEncodeCycle-16 20240.6k ± 0% 203.0k ± 0% -99.00% (p=0.000 n=10) ``` Checklist - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [ ] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] If the change is worth mentioning in the release notes, add `add-to-release-notes` label - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/setup/upgrade/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](`d10549e3ec`)	3 years ago
Travis Patterson	77b04b7963	Logging improvements for volume requests (#10099 ) I noticed we were logging volume requests as metric requests in `metrics.go`. This PR logs volume requests on their own with relevant data.	3 years ago
Travis Patterson	bfd196aa66	sort sort log volumes by size (#10045 ) To save some frontend work, this PR sorts all the log volumes by volume rather than name. For range queries, only the first value is used.	3 years ago
Trevor Whitney	a12311c5c8	Allow volume to be aggregated by label (#9988 ) Add a `aggregateBy` query parameter to the `series_volume` endpoint, and rename that endpoint to just `volume`. This allows users to get volumes aggregated into top level labels, rather label+value pairs. --------- Co-authored-by: Travis Patterson <travis.patterson@grafana.com>	3 years ago
Salva Corts	3f161f5c1a	Improve observability for non-indexed labels usage (#9993 ) What this PR does / why we need it: In https://github.com/grafana/loki/pull/9700, we support encoding and decoding metadata for each entry into the chunks. In this PR we: - Update the bytes processed stats to account for the bytes from those non-indexed labels - Add new stats for bytes processed for those non-indexed labels - Add new ingestion metrics to track ingested non-indexed bytes Checklist - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [ ] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] If the change is worth mentioning in the release notes, add `add-to-release-notes` label - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](`d10549e3ec`)	3 years ago
Karsten Jeschkies	efac690328	Use series hash to identify uniques in merge. (#9985 ) What this PR does / why we need it: A run of the protobuf encoding for the querier showed a spike in memory usage in the `codec.MergeResponse` method. The routine was allocating a lot of memory for the `series.String()` method which is then hashed. Instead, we can use a `uint64` hash. ``` › go test -v -run=^$ -bench "Benchmark_MergeResponses$" -count=10 ./pkg/querier/queryrange ... │ before.txt │ after.txt │ │ sec/op │ sec/op vs base │ _MergeResponses-16 4.563 ± 1% 1.197 ± 1% -73.77% (p=0.000 n=10) │ before.txt │ after.txt │ │ B/op │ B/op vs base │ _MergeResponses-16 6578.715Mi ± 0% 2.362Mi ± 0% -99.96% (p=0.000 n=10) │ before.txt │ after.txt │ │ allocs/op │ allocs/op vs base │ _MergeResponses-16 40500.1k ± 0% 100.1k ± 0% -99.75% (p=0.000 n=10) ``` Checklist - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [ ] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] If the change is worth mentioning in the release notes, add `add-to-release-notes` label - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](`d10549e3ec`)	3 years ago
Ed Welch	fa6f9c638a	Loki: add 'post_filter_lines' stat for tracking how many lines are match a queries filter expression(s) (#9983 ) What this PR does / why we need it: To better understand query behavior and labeling strategies of users, logging the number of log lines "post filtering" can be very useful, we already log the total_lines processed in a query so this will allow us to see how many lines in the label selector matched what the query was looking for. Which issue(s) this PR fixes: Fixes #<issue number> Special notes for your reviewer: Checklist - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [ ] Documentation added - [ ] Tests updated - [ ] `CHANGELOG.md` updated - [ ] If the change is worth mentioning in the release notes, add `add-to-release-notes` label - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](`d10549e3ec`) Signed-off-by: Edward Welch <edward.welch@grafana.com>	3 years ago
Karsten Jeschkies	a09cb07e98	Define protobufs for topk and cms. (#9933 ) What this PR does / why we need it: This change introduces Protobuf models for the sketch data structures which will be used to shard topk queries. Checklist - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [ ] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] If the change is worth mentioning in the release notes, add `add-to-release-notes` label - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](`d10549e3ec`) --------- Signed-off-by: Callum Styan <callumstyan@gmail.com> Co-authored-by: Callum Styan <callumstyan@gmail.com>	3 years ago
Karsten Jeschkies	c2c5249676	Remove goroutines from limiter middleware. (#9923 ) What this PR does / why we need it: The `limitedRoundTripper` would spawn thousands if not millions of goroutines. However, they were mostly idle waiting for downstream calls. This would congest the Go scheduler. A semaphore is more suitable to control concurrency as the [example](https://pkg.go.dev/golang.org/x/sync/semaphore#example-package-WorkerPool) in the documentation illustrates.	3 years ago
Salva Corts	aae13c376d	Add metadata to push payload (#9694 ) What this PR does / why we need it: We are adding support for attaching labels to each log line. This is one of the series of the PRs broken up to make it easier to review changes. This PR updates the push payload to send labels with each log entry optionally. The log labels are supposed to be in the same format as the stream labels. Just to put it out, here is how it would look for proto and json push payload with same data: proto(`endpoint`: `(/loki/api/v1/push\|/api/prom/push)`, `Content-Type`: `application/x-protobuf`)(payload built using [push.Stream](`4cd1246b88/pkg/push/types.go (L12)`)): ``` push.Stream{ Entries: []logproto.Entry{ { Timestamp: time.Unix(0, 1688515200000000000), Line: "log line", Labels: `{foo="bar"}`, }, }, Labels: `{app="test"}`, } ``` v1(`endpoint`: `/loki/api/v1/push`, `Content-Type`: `application/json`): ```json { "streams": [{ "stream": { "app": "test" }, "values": [ ["1688515200000000000", "log line", { "foo": "bar" }] ] }] } ``` legacy-json(`/api/prom/push`, `Content-Type`: `application/json`): ```json { "streams": [{ "labels": "{app=\"test\"}", "entries": [{ "ts": "2023-07-05T00:00:00.000000000Z", "line": "log line", "labels": "{foo=\"bar\"}" }] }] } ``` Which issue(s) this PR fixes: Special notes for your reviewer: We may need to add more thoughtful tests. Checklist - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [ ] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](`d10549e3ec`) --------- Co-authored-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>	3 years ago
Trevor Whitney	9e19ff006c	Add targetLabels to SeriesVolume requests (#9878 ) Adds optional `targetLabels` parameter to `series_volume` and `series_volume_range` requests that controls how volumes are aggregated. When provided, volumes are aggregated into the intersections of the provided `targetLabels` only.	3 years ago
Travis Patterson	e2e695e8cd	Get basic statistics from series volume requests (#9832 ) The `series_volume` endpoint returns stats about the call. This PR wires things up such that basic stats are returned: - Execution time - Number of responses	3 years ago
Karsten Jeschkies	f5992274f4	Switch on pointer `logproto.SeriesResponse` in JSON serialization. (#9868 ) What this PR does / why we need it: Commit `b35bbd80d6` introduced a regression in the deserialization code. It would switch on `logproto.SeriesResponse` instead of the pointer type `logproto.SeriesResponse`. Checklist - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [ ] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] If the change is worth mentioning in the release notes, add `add-to-release-notes` label - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](`d10549e3ec`)	3 years ago
Karsten Jeschkies	b35bbd80d6	Support content negotiation between query frontend and querier. (#9813 ) What this PR does / why we need it: Currently, the querier sends results to the query frontend in JSON which is then decoded to Protobuf. It is more efficient to send the results as Protobuf. This will also allow to extend the results with custom data structures. The change is backwards compatible through content negotiation. Special notes for your reviewer: Checklist - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [x] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] If the change is worth mentioning in the release notes, add `add-to-release-notes` label - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](`d10549e3ec`)	3 years ago
Owen Diehl	b251a10234	uses lowercase standard for logging and adds total bytes to shard res… (#9828 ) Minor logging cleanup + helpful to have total bytes on this log line to correlate without needing to derive total bytes from factor * bytes_per_shard	3 years ago
Trevor Whitney	d4cfbaac4f	Implement series volume range queries (#9812 ) This PR adds a `series_volume_range` endpoint which allows for series volume queries over time with a specified step, returning timeseries data in the form of a Prometheus matrix response. The existing `series_volume` endpoint still returns Prometheus vector responses, and hardcodes the step to 0.	3 years ago
Salva Corts	6cc581bd26	Add back cache stats for index stats requests (#9816 ) What this PR does / why we need it: In https://github.com/grafana/loki/pull/9536, we added cache stats for index stats requests. That PR had a bug that inflated the query stats due to reusing the stats context in the query engine. Therefore, we had to revert the PR at https://github.com/grafana/loki/pull/9721. This PR brings back the changes from https://github.com/grafana/loki/pull/9536 but fixes the inflated starts by no longer reusing the same context in the query engine, but rather creating a new one for the shard resolver. I tested it on a dev cluster and seems to be working fine. here's the output for the same query: Stats with the bug from #9536: ``` ... Cache.StatsResult.Requests 980 Cache.StatsResult.EntriesRequested 490 Cache.StatsResult.EntriesFound 0 Cache.StatsResult.EntriesStored 490 Cache.StatsResult.BytesSent 0 B Cache.StatsResult.BytesReceived 0 B ... Summary.BytesProcessedPerSecond 43 GB Summary.LinesProcessedPerSecond 93305142 Summary.TotalBytesProcessed 945 GB Summary.TotalLinesProcessed 2059694183 ``` Stats from _main_ ``` ... Summary.BytesProcessedPerSecond 1.6 GB Summary.LinesProcessedPerSecond 3403718 Summary.TotalBytesProcessed 95 GB Summary.TotalLinesProcessed 207971404 ``` Stats with fix in this PR ``` .. Cache.StatsResult.Requests 132 Cache.StatsResult.EntriesRequested 66 Cache.StatsResult.EntriesFound 0 Cache.StatsResult.EntriesStored 66 Cache.StatsResult.BytesSent 0 B Cache.StatsResult.BytesReceived 0 B ... Summary.BytesProcessedPerSecond 4.3 GB Summary.LinesProcessedPerSecond 9468900 Summary.TotalBytesProcessed 95 GB Summary.TotalLinesProcessed 207793816 ``` As can be seen, with the changes in this PR, the summary stats are no longer inflated. Which issue(s) this PR fixes: Fixes https://github.com/grafana/loki/pull/9536 Special notes for your reviewer: I think it's ok to skip reviewing the changes from the commit cherry-picking the changes from https://github.com/grafana/loki/pull/9536 Checklist - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [ ] Documentation added - [ ] Tests updated - [ ] `CHANGELOG.md` updated - [ ] If the change is worth mentioning in the release notes, add `add-to-release-notes` label - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](`d10549e3ec`)	3 years ago
Owen Diehl	e2a63e582c	adds tsdb-max-bytes-per-shard limit (#9811 ) Adds the per tenant limit `tsdb-max-bytes-per-shard` which is used in configuring the shard size for tsdb subqueries. This effectively gives control over how big subqueries should be (assuming they're shardable). The default is no different (`600MB`) than the previously hardcoded initial value. This should help us iterate to find optimal shard sizes to improve operations in the long term.	3 years ago
Travis Patterson	8ca035ffbf	Log Volume: Do the prometheus-format conversion in dedicated middleware (#9776 ) This PR moves the Prometheus-format conversion to it's own middleware. This keeps it all in one place and consolidates logic that was in many places. It also sets us up to make range queries as a next step.	3 years ago
Sandeep Sukhani	3e1f2fc273	caching: do not try to fill the gap in log results cache when the new query interval does not overlap the cached query interval (#9757 ) What this PR does / why we need it: Currently, when we find a relevant cached negative response for a logs query, we do the following: * If the cached query completely covers the new query: * return back an empty response. * else: * fill the gaps on either/both sides of the cached query. The problem with filling the gaps is that when the cached query does not overlap at all with the new query, we have to extend the query beyond what the query requests for. However, with the logs query, we have a limit on the number of lines we can send back in the response. So, this could result in the query response having logs which were not requested by the query, which then get filtered out by the [response extractor](`b78d3f0552/pkg/querier/queryrange/log_result_cache.go (L299)`), unexpectedly resulting in an empty response. For example, if the query was cached for start=15, end=20 and we get a `backwards` query for start=5, end=10. To fill the gap, the query would be executed for start=5, end=15. Now, if we have logs more than the query `limit` in the range 10-15, we would filter out all the data in the response extractor and send back an empty response to the user. This PR fixes the issue by doing the following changes when handling cache hit: * If the cached query completely covers the new query: * return back an empty response[_existing_]. * else if the cached query does not overlap with the new query: * do the new query as requested. * If the new query results in an empty response and has a higher interval than the cached query: * update the cache * else: * query the data for missing intervals on both/either side[_existing_] * update the cache with extended intervals if the new queries resulted in an empty response[_existing_] Special notes for your reviewer: We could do further improvements in the handling of queries not overlapping with cached query by selectively extending the queries based on query direction and cached query lying before/after the new query. For example, if the new query is doing `backwards` query and the `cachedQuery.End` < `newQuery.Start`, it should be okay to extend the query and do `cachedQuery.End` to `newQuery.End` to fill the cache since query would first fill the most relevant data before hitting the limits. I did not want to complicate the fix so went without implementing this approach. We can revisit later if we feel we need to improve our caching. Checklist - [x] Tests updated - [x] `CHANGELOG.md` updated --------- Co-authored-by: Travis Patterson <travis.patterson@grafana.com>	3 years ago
Susana Ferreira	35465d0297	Fix instant query summary split stats (#9773 ) What this PR does / why we need it: Fix instant query summary statistic's `splits` corresponding to the number of subqueries a query is split into based on `split_queries_by_interval`. * Update rangemapper with a statistics structure to include the number of split queries a query is mapped into. * In the `split_by_range` middleware once the mapped query is returned update the middleware statistics with the number of split queries. This value will then be merged with the statistics of the Loki response. Checklist - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [ ] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] If the change is worth mentioning in the release notes, add `add-to-release-notes` label - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](`d10549e3ec`)	3 years ago
Travis Patterson	4da0f63789	Remove unused Value field (#9774 ) We didn't end up needing the `Value` field because we can express everything we need to as selectors	3 years ago
Travis Patterson	806674fdaa	Add log-volume feature flag (#9762 ) Adds a feature flag for use with the new log-volume endpoints so associated features can be rolled out incrementally.	3 years ago
Trevor Whitney	dbc3040739	Convert SeriesVolume response to prometheus-style (#9703 ) Changes the response type of the label volume stats endpoint to return volumes as prometheus-style timeseries metrics. It currently only supports instant queries, but is a necessary step to eventually supporting range queries.	3 years ago
Salva Corts	b7359c5d53	Revert "Add summary stats and metrics for stats cache (#9536 )" (#9721 ) This reverts commit `af287ac3eb`. There is a bug in this PR that inflates the stats returned for the query since we reuse the stats ctx in the query execution engine.	3 years ago
Travis Patterson	db97058a84	Series volume endpoint (#9704 ) This changes the `label_volume` endpoint to the `series_volume` endpoint. The new endpoint still returns volumes but now it does it for the requested streams defined by the selector names passed rather than individual labels. All relevant non-requested labels are aggregated into the returned results ex: Assume we have the following streams: ``` {cluster="prod", team="A", component="foo"} {cluster="prod", team="B", component="foo"} {cluster="dev", team="A", component="foo"} {cluster="dev", team="B", component="foo"} ``` - requesting `{cluster="prod"}` returns one result for all streams containing `{cluster="prod"}` - requesting `{cluster=~".+"}` returns two results for the streams containing `{cluster="prod"}` and `{cluster="dev"}` - requesting `{cluster=~".+", team=".+"}` returns four results for the streams containing: ``` {cluster="prod", team="A"} {cluster="prod", team="B"} {cluster="dev", team="A"} {cluster="dev", team="B"} ``` --------- Co-authored-by: Trevor Whitney <trevorjwhitney@gmail.com>	3 years ago
Trevor Whitney	4a56445686	Upgrade `golangci-lint` and fix linting errors (#9601 ) What this PR does / why we need it: Upgrade `golangci-lint` and fixes all the errors. The upgrade includes some stricter linting.	3 years ago
Travis Patterson	065bee7e72	Label Volume Endpoint (#9588 ) For a given set of matchers, returns the top N associated label/value pairs by volume. A query for `{cluster=prod}` will return ``` cluster=prod: size (total logs matching this matcher) . . . nth-label=nth-value ``` This is to service use cases where users want to understand where their log volume has come from by label without making multiple requests to the stats endpoint. Note: This PR is a monster but it's mostly plumbing. I've pointed out the most interesting bits that actually get the volumes from ingesters/indexs	3 years ago
Salva Corts	73ac208981	Improve docs for empty value in cache compression config (#9649 ) What this PR does / why we need it: Follow up PR for https://github.com/grafana/loki/pull/9535#discussion_r1218167670 Checklist - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [x] Documentation added - [ ] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](`d10549e3ec`)	3 years ago
Salva Corts	c6fbff26e1	Add config to avoid caching stats for recent data (#9537 ) What this PR does / why we need it: When we query the stats for recent data, we query both the ingesters and the index gateways for the stats. `ebdb2b1800/pkg/storage/async_store.go (L112-L114)` `ebdb2b1800/pkg/storage/async_store.go (L126-L127)` Then we merge all the responses, which means summing up all the stats `ebdb2b1800/pkg/storage/async_store.go (L157-L158)` `ebdb2b1800/pkg/storage/stores/index/stats/stats.go (L23-L26)` Because we have a replication factor of 3, this means that we will get the stats from the ingesters repeated up to 3 times, hence inflating the stats. In the stats cache, we store the stats for a given matcher set for the whole day, then we extract the stats from the cache by the factor of time from the request that is stored in the cache: `336283acad/pkg/querier/queryrange/index_stats_cache.go (L33)` `336283acad/pkg/querier/queryrange/index_stats_cache.go (L40)` Inflated stats for recent data will be cached, so subsequent stats extracted from the cache will be inflated regardless of the time. This PR adds a new per-tenant limit `max_stats_cache_freshness` to not cache requests with an end time that falls within Now minus this duration. Here's a scenario illustrating this. The graphs below show the bytes stats queried in the sharding middleware. We are running a log filter query that won't match any log, every 5 seconds with a length of 3h. ![image](https://github.com/grafana/loki/assets/8354290/45c2e6e9-185c-4a18-b290-47da27fc3e39) As can be seen, after enabling the stats cache and configuring`do_not_cache_request_within` to not cache stats for requests within 30m, the bytes stats used in the sharding middleware stopped increasing. In both cases the stats cache hit ration was 100%. ![image](https://github.com/grafana/loki/assets/8354290/cd35bcb8-0c77-4693-a06b-502741fd6e23) Special notes for your reviewer: - Blocked by https://github.com/grafana/loki/pull/9535 - Note that this PR doesn't fix the root issue of inflated stats form the ingesters, but rather buys us some time to work on that. Checklist - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [x] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](`d10549e3ec`)	3 years ago
Ivana Huckova	eb7dae4583	Loki: Improve error message when step too low (#9641 ) What this PR does / why we need it: In https://github.com/grafana/grafana/pull/69648 we are in Grafana introducing a step editor in Loki. Unfortunately, the error message when user sets too low step parameter is hard to understand, so I am proposing following change to make it more understandable and actionable. Let me know what do you think. --------- Co-authored-by: J Stickler <julie.stickler@grafana.com>	3 years ago
Salva Corts	af287ac3eb	Add summary stats and metrics for stats cache (#9536 ) What this PR does / why we need it: When a query finishes, we return (and log) the following stats: ```go Cache.Chunk.Requests 0 Cache.Chunk.EntriesRequested 0 Cache.Chunk.EntriesFound 0 Cache.Chunk.EntriesStored 0 Cache.Chunk.BytesSent 0 B Cache.Chunk.BytesReceived 0 B Cache.Chunk.DownloadTime 0s Cache.Index.Requests 0 Cache.Index.EntriesRequested 0 Cache.Index.EntriesFound 0 Cache.Index.EntriesStored 0 Cache.Index.BytesSent 0 B Cache.Index.BytesReceived 0 B Cache.Index.DownloadTime 0s Cache.Result.Requests 13 Cache.Result.EntriesRequested 13 Cache.Result.EntriesFound 13 Cache.Result.EntriesStored 0 Cache.Result.BytesSent 0 B Cache.Result.BytesReceived 2.5 kB Cache.Result.DownloadTime 4.600266ms ``` In addition to that, we log the following in metrics.go: ``` level=info ts=2023-05-29T09:17:10.93029945Z caller=metrics.go:152 component=frontend org_id=145265 traceID=52d59b78fe6b9221 sampled=true latency=fast query="{cluster=\"dev-us-central-0\", namespace=~\"loki.\", container=~\"distributor\|ingester \|promtail\|index-gateway\|compactor\"} \|= \"thislinewillnotexist\"" query_hash=1194136170 query_type=filter range_type=range length=3h0m0s start_delta=165h37m24.930289434s end_delta=162h37m24.930289612s step=43s duration=2.473055ms status=200 lim it=30 returned_lines=0 throughput=0B total_bytes=0B lines_per_second=0 total_lines=0 total_entries=0 store_chunks_download_time=0s queue_time=0s splits=13 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes _fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=13 cache_result_hit=13 cache_result_download_time=4.600266ms ``` With the goal of being able to better monitor how the stats cache is performing; this PR adds stats for the index stats cache, similarly to how it's done for the results cache. Here's an example of the new stats being returned and printed: ```go ... Cache.StatsResult.Requests 180 Cache.StatsResult.EntriesRequested 129 Cache.StatsResult.EntriesFound 129 Cache.StatsResult.EntriesStored 51 Cache.StatsResult.BytesSent 0 B Cache.StatsResult.BytesReceived 75 kB ... ``` And the new stats from metrics.go ``` ... caller=metrics.go:155 ... cache_stats_results_req=129 cache_stats_results_hit=129 cache_stats_results_download_ti me=156.864429ms ... ``` Special notes for your reviewer: - Blocked by https://github.com/grafana/loki/pull/9535 - Note the new`stats.GetOrCreateContext` func. It's used inside the `query.Exec` method so we don't overwrite the stats added in the stats middleware. Checklist* - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [ ] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](`d10549e3ec`)	3 years ago
Salva Corts	1694ad0f9b	Stats cache can be configured independently (#9535 ) What this PR does / why we need it: Before this PR, the index stats cache would use the same config as the query results cache. This was a limitation since: 1. We would not be able to point to a different cache for storing the index stats if needed. 2. We would not be able to add specific settings for this cache, without adding it to the results cache. In this PR, we refactor the index stats cache config to be independently configurable. Note that if it's not configured, it will try to use the results cache settings. Which issue(s) this PR fixes: This is needed for: - https://github.com/grafana/loki/pull/9537 - https://github.com/grafana/loki/pull/9536 Special notes for your reviewer: - This PR also refactors all the tripperwares in rountrip.go to reuse the same stats tripperware instead of each one creating their own. - Configuring a new cache in rountrip.go is a requirement for https://github.com/grafana/loki/pull/9536 so the stats summary can distinguish before the stats cache and the results cache. Checklist - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [x] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](`d10549e3ec`)	3 years ago
Salva Corts	87a659a6db	Add span events for index stats and result cache (#9552 ) What this PR does / why we need it: This PR adds events to the traces to have some extra observability for how we compute the index stats. We also add some trace events to the results cache. ![image](https://github.com/grafana/loki/assets/8354290/7566b755-8193-4e46-ba10-37d3377ea31a) ![image](https://github.com/grafana/loki/assets/8354290/d1990150-84b1-4522-9898-6e37c2782c5b) ![image](https://github.com/grafana/loki/assets/8354290/a8c23e7f-a06d-4a47-8cd4-e900fce01e80) ![image](https://github.com/grafana/loki/assets/8354290/d1e15fb6-fb6c-4fe1-9c5f-f1c8164889de) ![image](https://github.com/grafana/loki/assets/8354290/0c0d001e-7083-488c-8809-0446b4b7c852) Special notes for your reviewer: Checklist - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [ ] Documentation added - [ ] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](`d10549e3ec`)	3 years ago
Owen Diehl	2efd059b49	Slight improvements to `GetFactorOfTime` (#9473 ) * correctly returns zero for non-overlapping data * adds tests	3 years ago
Paul Rogers	14370bb8ce	Revert "Augment statistics.." PR 9400. (#9430 ) What this PR does / why we need it: This PR reverts PR 9400. The data collected within that PR was not sufficient. When queries are done, they are filtered before the merge iterator, resulting in an inability to collect an accurate count of duplicated data. Which issue(s) this PR fixes: Special notes for your reviewer: Checklist - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [ ] Documentation added - [ ] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md`	3 years ago
Paul Rogers	1671751cbd	Augment statistics to note how many bytes are in duplicate lines due to replicas (#9400 ) What this PR does / why we need it: This PR is for counting the number of bytes of log lines that were marked as duplicates. This will be utilized to collect better statistics.	3 years ago
Peter Štibraný	90a1d4593e	Update Prometheus dependency (#9205 )	3 years ago
Salva Corts	422560b6b1	Flag to disable index stats cache (#9177 ) What this PR does / why we need it: At https://github.com/grafana/loki/pull/8972 we started caching all index stats requests. If the results cache gets overloaded, it can quickly take down the rest of the loki cell due to all the increased work. This PR adds a new flag so we can easily disable caching index stats requests. Which issue(s) this PR fixes: This PR is a follow up for https://github.com/grafana/loki/pull/8972 Special notes for your reviewer: Checklist - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [x] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md`	3 years ago
Salva Corts	fd16425062	Cache index stats requests (#8972 ) What this PR does / why we need it: As described in https://github.com/grafana/loki/issues/8973, we are substantially increasing the load of index stat requests we sent to our index gateways. Many of these requests should be easily re-used by caching them. This PR adds caching for index stat requests by reusing the results cache. Here's a demo ([source][1]): ![image](https://user-images.githubusercontent.com/8354290/229104609-4dd26f0a-9260-4f21-85ef-ac4a86ebba7a.png) Which issue(s) this PR fixes: Fixes https://github.com/grafana/loki/issues/8973 Special notes for your reviewer: Checklist - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [ ] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` [1]: https://ops.grafana-ops.net/d/afcaef21-e5ad-49e7-ab06-42a9d7d915eb/index-stats?orgId=1&var-datasource=dev-cortex&var-cluster=dev-eu-west-2&var-namespace=loki-dev-009&var-loki_datasource=Grafana%20Logging&from=1680259907288&to=1680260431814&var-operation=All --------- Co-authored-by: Owen Diehl <ow.diehl@gmail.com>	3 years ago
Salva Corts	8cf921a145	Pass engine opts down to middlewares (#9130 ) What this PR does / why we need it: The following middlewares in the query frontend uses a downstream engine: - `NewQuerySizeLimiterMiddleware` and `NewQuerierSizeLimiterMiddleware` - `NewQueryShardMiddleware` - `NewSplitByRangeMiddleware` These were all creating the downstream engine as follows: ```go logql.NewDownstreamEngine(logql.EngineOpts{LogExecutingQuery: false}, DownstreamHandler{next: next, limits: limits}, limits, logger), ``` As can be seen, the [engine options configured in Loki][1] were not being used at all. In the case of `NewQuerySizeLimiterMiddleware`, `NewQuerierSizeLimiterMiddleware` and `NewQueryShardMiddleware`, the downstream engine was created to get the `MaxLookBackPeriod`. When creating a new Downstream Engine as above, the `MaxLookBackPeriod` [would always be the default][2] (30 seconds). This PR fixes this by passing down the engine config to these middlewares, so this config is used to create the new downstream engines. Which issue(s) this PR fixes: Adresses some pending tasks from https://github.com/grafana/loki/pull/8670#issuecomment-1507031976. Special notes for your reviewer: Checklist - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [ ] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` [1]: `1bcf683513/pkg/querier/querier.go (L52)` [2]: `edc6b0bff7/pkg/logql/engine.go (L136-L140)`	3 years ago
Trevor Whitney	c587b538ed	Fail through to next middleware when querySizeLimit cannot be applied (#9050 ) What this PR does / why we need it: When the query size limiter can't limit the query, fail through to the next middleware instead of erroring. This can happen, for example, when a query spans schemas, which is still a valid query case, so we want to make sure to fall back to existing behavior. --------- Co-authored-by: Owen Diehl <ow.diehl@gmail.com>	3 years ago
Owen Diehl	acb40ed40e	Eager stream merge (#8968 ) This PR introduces a specialized heap based datastructure to merge incoming log results in the frontend. Recently we've experienced an increase in OOMs on frontends due to logs queries which match lots of data. Sharded requests in loki split based on the amount of data we expect and some queries see thousands of sub requests. For log queries, we'll fetch up the `limit` from each shard, return them to the frontend, and merge. High shard counts * limit log lines, especially combined with large log lines (in byte terms) are accumulated on the frontend. Once they all are received, the frontend merges them. This creates opportunity for OOMs as it can hold up a lot of memory. This PR addresses one of these problems by eagerly accumulating responses as they're received and only retaining a total `limit` number of entries. There's still OOM potential due to race conditions between sub requests returning to the query-frontend and the query-frontend merging other sub requests, but this definitely improves the situation. I've been able to consistently run large limited queries that touch TBs of data (i.e. `{cluster=~".+"} \|= "a"`) that previously OOMed frontends. --------- Signed-off-by: Owen Diehl <ow.diehl@gmail.com>	3 years ago
Owen Diehl	62403350a5	remove redundant splitby middleware (#8996 ) Found this double-copied line which a mistake. This PR removes one of them which won't change behavior (besides removing duplicate spans/etc).	3 years ago
Ed Welch	b892cade6a	Loki: Fixes incorrect query result when querying with start time == end time (#8979 ) What this PR does / why we need it: In several places within Loki we need to determine if a query is a `range query` or `instant query`, this is done by checking to see if the start and end time are equal and the `step=0` The downstream handler was not checking for `step=0` and thus it incorrectly mapped a range query to an instant query when a query has a start time equal to and end time. There are a few other things at play here, mainly that we should really error anytime someone tries to run an instant query for logs which would have exposed this error much more easily. But that's something I'd like to handle in a different PR as it will be considered a breaking change depending on how we do it. This PR uses an existing function we have for testing the query type and addresses the issue found in #8885 Which issue(s) this PR fixes: Fixes #8885 Special notes for your reviewer: Checklist - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [ ] Documentation added - [ ] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` --------- Signed-off-by: Edward Welch <edward.welch@grafana.com>	3 years ago
Ed Welch	edc6b0bff7	Loki: Add a limit for the [range] value on range queries (#8343 ) Signed-off-by: Edward Welch <edward.welch@grafana.com> What this PR does / why we need it: Loki does not currently split queries by time to a value smaller than what's in the [range] of a range query. Example ``` sum(rate({job="foo"}[2d])) ``` Imagine now this query being executed over a longer window of a few days with a step of something like 30m. Every step evaluation would query the last [2d] of data. There are use cases where this is desired, specifically if you force the step to match the value in the range, however what is more common is someone accidentally uses `[$__range]` in here instead of `[$__interval]` within Grafana and then sets the query time selector to a large value like 7 days. This PR adds a limit which will fail queries that set the [range] value higher than the configured limit. It's disabled by default. In the future it may be possible for Loki to perform splits within the [range] and remove the need for this limit, but until then this can be an important safeguard in clusters with a lot of data. Which issue(s) this PR fixes: Fixes #8746 Special notes for your reviewer: Checklist - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [ ] Documentation added - [ ] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` --------- Signed-off-by: Edward Welch <edward.welch@grafana.com> Co-authored-by: Karsten Jeschkies <karsten.jeschkies@grafana.com> Co-authored-by: Vladyslav Diachenko <82767850+vlad-diachenko@users.noreply.github.com>	3 years ago
Dylan Guedes	9159c1dac3	Loki: Improve spans usage (#8927 ) What this PR does / why we need it: - At different places, inherit the span/spanlogger from the given context instead of instantiating a new one from scratch, which fix spans being orphaned on a read/write operation. - At different places, turn spans into events. Events are lighter than spans and by having fewer spans in the trace, trace visualization will be cleaner without losing any details. - Adds new spans/events to places that might be a bottleneck for our writes/reads.	3 years ago
Periklis Tsirakidis	1bcf683513	Expose optional label matcher for label values handler (#8824 )	3 years ago
Salva Corts	45775c82f7	Implement `RequiredNumberLabels` query limit (#8918 ) What this PR does / why we need it: As pointed out in https://github.com/grafana/loki/pull/8851, some queries can impose a great workload on a cluster by selecting too many streams. Similarly to the `RequiredLabels` limit introduced at https://github.com/grafana/loki/pull/8851, here we add a new limit `RequiredNumberLabels` to require queries to specify at least N label. For example, if the limit is set to 2, then the query should contain at least 2 label matchers. This limit can be configured per tenant and at query time. ![image](https://user-images.githubusercontent.com/8354290/228271398-4b9bcc49-f539-4e94-86c1-071e519a30a9.png) Which issue(s) this PR fixes: Fixes https://github.com/grafana/loki-private/issues/699 Special notes for your reviewer: Checklist - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [x] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` --------- Co-authored-by: Dylan Guedes <djmgguedes@gmail.com>	3 years ago
Salva Corts	ee69f2bd37	Split index request in 24h intervals (#8909 ) What this PR does / why we need it: At https://github.com/grafana/loki/pull/8670, we applied a time split of 24h intervals to all index stats requests to enforce the `max_query_bytes_read` and `max_querier_bytes_read` limits. When the limit is surpassed, the following message get's displayed: ![image](https://user-images.githubusercontent.com/8354290/227960400-b74a0397-13ef-4143-a1fc-48d885af55c0.png) As can be seen, the reported bytes read by the query are not the same as those reported by Grafana in the lower right corner of the query editor. This is because: 1. The index stats request for enforcing the limit is split in subqueries of 24h. The other index stats rquest is not time split. 2. When enforcing the limit, we are not displaying the bytes in powers of 2, but powers of 10 ([see here][2]). I.e. 1KB is 1000B vs 1KiB is 1024B. This PR adds the same logic to all index stats requests so we also time split by 24 intervals all requests that hit the Index Stats API endpoint. We also use powers of 2 instead of 10 on the message when enforcing `max_query_bytes_read` and `max_querier_bytes_read`. ![image](https://user-images.githubusercontent.com/8354290/227959491-f57cf7d2-de50-4ee6-8737-faeafb528f99.png) Note that the library we use under the hoot to print the bytes rounds up and down to the nearest integer ([see][3]); that's why we see 16GiB compared to the 15.5GB in the Grafana query editor. Which issue(s) this PR fixes: Fixes https://github.com/grafana/loki/issues/8910 Special notes for your reviewer: - I refactored the`newQuerySizeLimiter` function and the rest of the _Tripperwares_ in `rountrip.go` to reuse the new IndexStatsTripperware. So we configure the split-by-time middleware only once. Checklist - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (required) - [x] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` [1]: https://grafana.com/docs/loki/latest/api/#index-stats [2]: https://github.com/grafana/loki/blob/main/pkg/querier/queryrange/limits.go#L367-L368 [3]: https://github.com/dustin/go-humanize/blob/master/bytes.go#L75-L78	3 years ago

1 2 3 4 5

232 Commits (89985399817dde28fe5b94de990133d904bc3e5b)