Like Prometheus, but for logs.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
loki/pkg/querier/queryrange/index_stats_cache.go

127 lines
4.0 KiB

package queryrange
import (
"context"
Stats cache can be configured independently (#9535) **What this PR does / why we need it**: Before this PR, the index stats cache would use the same config as the query results cache. This was a limitation since: 1. We would not be able to point to a different cache for storing the index stats if needed. 2. We would not be able to add specific settings for this cache, without adding it to the results cache. In this PR, we refactor the index stats cache config to be independently configurable. Note that if it's not configured, it will try to use the results cache settings. **Which issue(s) this PR fixes**: This is needed for: - https://github.com/grafana/loki/pull/9537 - https://github.com/grafana/loki/pull/9536 **Special notes for your reviewer**: - This PR also refactors all the tripperwares in rountrip.go to reuse the same stats tripperware instead of each one creating their own. - Configuring a new cache in rountrip.go is a requirement for https://github.com/grafana/loki/pull/9536 so the stats summary can distinguish before the stats cache and the results cache. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](https://github.com/grafana/loki/commit/d10549e3ece02120974929894ee333d07755d213)
2 years ago
"flag"
"fmt"
Add config to avoid caching stats for recent data (#9537) **What this PR does / why we need it**: When we query the stats for recent data, we query both the ingesters and the index gateways for the stats. https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L112-L114 https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L126-L127 Then we merge all the responses, which means summing up all the stats https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L157-L158 https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/stores/index/stats/stats.go#L23-L26 Because we have a replication factor of 3, this means that we will get the stats from the ingesters repeated up to 3 times, hence inflating the stats. In the stats cache, we store the stats for a given matcher set for the whole day, then we extract the stats from the cache by the factor of time from the request that is stored in the cache: https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L33 https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L40 Inflated stats for recent data will be cached, so subsequent stats extracted from the cache will be inflated regardless of the time. This PR adds a new per-tenant limit `max_stats_cache_freshness` to not cache requests with an end time that falls within Now minus this duration. Here's a scenario illustrating this. The graphs below show the bytes stats queried in the sharding middleware. We are running a log filter query that won't match any log, every 5 seconds with a length of 3h. ![image](https://github.com/grafana/loki/assets/8354290/45c2e6e9-185c-4a18-b290-47da27fc3e39) As can be seen, after enabling the stats cache and configuring`do_not_cache_request_within` to not cache stats for requests within 30m, the bytes stats used in the sharding middleware stopped increasing. In both cases the stats cache hit ration was 100%. ![image](https://github.com/grafana/loki/assets/8354290/cd35bcb8-0c77-4693-a06b-502741fd6e23) **Special notes for your reviewer**: - Blocked by https://github.com/grafana/loki/pull/9535 - Note that this PR doesn't fix the root issue of inflated stats form the ingesters, but rather buys us some time to work on that. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](https://github.com/grafana/loki/commit/d10549e3ece02120974929894ee333d07755d213)
2 years ago
"time"
"github.com/go-kit/log"
Add config to avoid caching stats for recent data (#9537) **What this PR does / why we need it**: When we query the stats for recent data, we query both the ingesters and the index gateways for the stats. https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L112-L114 https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L126-L127 Then we merge all the responses, which means summing up all the stats https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L157-L158 https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/stores/index/stats/stats.go#L23-L26 Because we have a replication factor of 3, this means that we will get the stats from the ingesters repeated up to 3 times, hence inflating the stats. In the stats cache, we store the stats for a given matcher set for the whole day, then we extract the stats from the cache by the factor of time from the request that is stored in the cache: https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L33 https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L40 Inflated stats for recent data will be cached, so subsequent stats extracted from the cache will be inflated regardless of the time. This PR adds a new per-tenant limit `max_stats_cache_freshness` to not cache requests with an end time that falls within Now minus this duration. Here's a scenario illustrating this. The graphs below show the bytes stats queried in the sharding middleware. We are running a log filter query that won't match any log, every 5 seconds with a length of 3h. ![image](https://github.com/grafana/loki/assets/8354290/45c2e6e9-185c-4a18-b290-47da27fc3e39) As can be seen, after enabling the stats cache and configuring`do_not_cache_request_within` to not cache stats for requests within 30m, the bytes stats used in the sharding middleware stopped increasing. In both cases the stats cache hit ration was 100%. ![image](https://github.com/grafana/loki/assets/8354290/cd35bcb8-0c77-4693-a06b-502741fd6e23) **Special notes for your reviewer**: - Blocked by https://github.com/grafana/loki/pull/9535 - Note that this PR doesn't fix the root issue of inflated stats form the ingesters, but rather buys us some time to work on that. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](https://github.com/grafana/loki/commit/d10549e3ece02120974929894ee333d07755d213)
2 years ago
"github.com/go-kit/log/level"
"github.com/grafana/dskit/tenant"
"github.com/prometheus/common/model"
"github.com/grafana/loki/pkg/logproto"
"github.com/grafana/loki/pkg/querier/queryrange/queryrangebase"
"github.com/grafana/loki/pkg/storage/chunk/cache"
"github.com/grafana/loki/pkg/util"
Add config to avoid caching stats for recent data (#9537) **What this PR does / why we need it**: When we query the stats for recent data, we query both the ingesters and the index gateways for the stats. https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L112-L114 https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L126-L127 Then we merge all the responses, which means summing up all the stats https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L157-L158 https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/stores/index/stats/stats.go#L23-L26 Because we have a replication factor of 3, this means that we will get the stats from the ingesters repeated up to 3 times, hence inflating the stats. In the stats cache, we store the stats for a given matcher set for the whole day, then we extract the stats from the cache by the factor of time from the request that is stored in the cache: https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L33 https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L40 Inflated stats for recent data will be cached, so subsequent stats extracted from the cache will be inflated regardless of the time. This PR adds a new per-tenant limit `max_stats_cache_freshness` to not cache requests with an end time that falls within Now minus this duration. Here's a scenario illustrating this. The graphs below show the bytes stats queried in the sharding middleware. We are running a log filter query that won't match any log, every 5 seconds with a length of 3h. ![image](https://github.com/grafana/loki/assets/8354290/45c2e6e9-185c-4a18-b290-47da27fc3e39) As can be seen, after enabling the stats cache and configuring`do_not_cache_request_within` to not cache stats for requests within 30m, the bytes stats used in the sharding middleware stopped increasing. In both cases the stats cache hit ration was 100%. ![image](https://github.com/grafana/loki/assets/8354290/cd35bcb8-0c77-4693-a06b-502741fd6e23) **Special notes for your reviewer**: - Blocked by https://github.com/grafana/loki/pull/9535 - Note that this PR doesn't fix the root issue of inflated stats form the ingesters, but rather buys us some time to work on that. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](https://github.com/grafana/loki/commit/d10549e3ece02120974929894ee333d07755d213)
2 years ago
"github.com/grafana/loki/pkg/util/validation"
)
type IndexStatsSplitter struct {
cacheKeyLimits
}
// GenerateCacheKey generates a cache key based on the userID, Request and interval.
func (i IndexStatsSplitter) GenerateCacheKey(ctx context.Context, userID string, r queryrangebase.Request) string {
cacheKey := i.cacheKeyLimits.GenerateCacheKey(ctx, userID, r)
return fmt.Sprintf("indexStats:%s", cacheKey)
}
type IndexStatsExtractor struct{}
// Extract favors the ability to cache over exactness of results. It assumes a constant distribution
// of log volumes over a range and will extract subsets proportionally.
func (p IndexStatsExtractor) Extract(start, end int64, res queryrangebase.Response, resStart, resEnd int64) queryrangebase.Response {
factor := util.GetFactorOfTime(start, end, resStart, resEnd)
statsRes := res.(*IndexStatsResponse)
return &IndexStatsResponse{
Response: &logproto.IndexStatsResponse{
Streams: statsRes.Response.GetStreams(),
Chunks: statsRes.Response.GetChunks(),
Bytes: uint64(float64(statsRes.Response.GetBytes()) * factor),
Entries: uint64(float64(statsRes.Response.GetEntries()) * factor),
},
}
}
func (p IndexStatsExtractor) ResponseWithoutHeaders(resp queryrangebase.Response) queryrangebase.Response {
statsRes := resp.(*IndexStatsResponse)
return &IndexStatsResponse{
Response: statsRes.Response,
}
}
Stats cache can be configured independently (#9535) **What this PR does / why we need it**: Before this PR, the index stats cache would use the same config as the query results cache. This was a limitation since: 1. We would not be able to point to a different cache for storing the index stats if needed. 2. We would not be able to add specific settings for this cache, without adding it to the results cache. In this PR, we refactor the index stats cache config to be independently configurable. Note that if it's not configured, it will try to use the results cache settings. **Which issue(s) this PR fixes**: This is needed for: - https://github.com/grafana/loki/pull/9537 - https://github.com/grafana/loki/pull/9536 **Special notes for your reviewer**: - This PR also refactors all the tripperwares in rountrip.go to reuse the same stats tripperware instead of each one creating their own. - Configuring a new cache in rountrip.go is a requirement for https://github.com/grafana/loki/pull/9536 so the stats summary can distinguish before the stats cache and the results cache. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](https://github.com/grafana/loki/commit/d10549e3ece02120974929894ee333d07755d213)
2 years ago
type IndexStatsCacheConfig struct {
queryrangebase.ResultsCacheConfig `yaml:",inline"`
}
// RegisterFlags registers flags.
func (cfg *IndexStatsCacheConfig) RegisterFlags(f *flag.FlagSet) {
cfg.ResultsCacheConfig.RegisterFlagsWithPrefix(f, "frontend.index-stats-results-cache.")
}
func (cfg *IndexStatsCacheConfig) Validate() error {
return cfg.ResultsCacheConfig.Validate()
}
Add config to avoid caching stats for recent data (#9537) **What this PR does / why we need it**: When we query the stats for recent data, we query both the ingesters and the index gateways for the stats. https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L112-L114 https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L126-L127 Then we merge all the responses, which means summing up all the stats https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L157-L158 https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/stores/index/stats/stats.go#L23-L26 Because we have a replication factor of 3, this means that we will get the stats from the ingesters repeated up to 3 times, hence inflating the stats. In the stats cache, we store the stats for a given matcher set for the whole day, then we extract the stats from the cache by the factor of time from the request that is stored in the cache: https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L33 https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L40 Inflated stats for recent data will be cached, so subsequent stats extracted from the cache will be inflated regardless of the time. This PR adds a new per-tenant limit `max_stats_cache_freshness` to not cache requests with an end time that falls within Now minus this duration. Here's a scenario illustrating this. The graphs below show the bytes stats queried in the sharding middleware. We are running a log filter query that won't match any log, every 5 seconds with a length of 3h. ![image](https://github.com/grafana/loki/assets/8354290/45c2e6e9-185c-4a18-b290-47da27fc3e39) As can be seen, after enabling the stats cache and configuring`do_not_cache_request_within` to not cache stats for requests within 30m, the bytes stats used in the sharding middleware stopped increasing. In both cases the stats cache hit ration was 100%. ![image](https://github.com/grafana/loki/assets/8354290/cd35bcb8-0c77-4693-a06b-502741fd6e23) **Special notes for your reviewer**: - Blocked by https://github.com/grafana/loki/pull/9535 - Note that this PR doesn't fix the root issue of inflated stats form the ingesters, but rather buys us some time to work on that. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](https://github.com/grafana/loki/commit/d10549e3ece02120974929894ee333d07755d213)
2 years ago
// statsCacheMiddlewareNowTimeFunc is a function that returns the current time.
// It is used to allow tests to override the current time.
var statsCacheMiddlewareNowTimeFunc = model.Now
// shouldCacheStats returns true if the request should be cached.
// It returns false if:
// - The request end time falls within the max_stats_cache_freshness duration.
func shouldCacheStats(ctx context.Context, req queryrangebase.Request, lim Limits) (bool, error) {
tenantIDs, err := tenant.TenantIDs(ctx)
if err != nil {
return false, err
}
cacheFreshnessCapture := func(id string) time.Duration { return lim.MaxStatsCacheFreshness(ctx, id) }
maxCacheFreshness := validation.MaxDurationPerTenant(tenantIDs, cacheFreshnessCapture)
now := statsCacheMiddlewareNowTimeFunc()
return maxCacheFreshness == 0 || model.Time(req.GetEnd()).Before(now.Add(-maxCacheFreshness)), nil
}
func NewIndexStatsCacheMiddleware(
log log.Logger,
limits Limits,
merger queryrangebase.Merger,
c cache.Cache,
cacheGenNumberLoader queryrangebase.CacheGenNumberLoader,
shouldCache queryrangebase.ShouldCacheFn,
parallelismForReq func(ctx context.Context, tenantIDs []string, r queryrangebase.Request) int,
retentionEnabled bool,
transformer UserIDTransformer,
metrics *queryrangebase.ResultsCacheMetrics,
) (queryrangebase.Middleware, error) {
return queryrangebase.NewResultsCacheMiddleware(
log,
c,
IndexStatsSplitter{cacheKeyLimits{limits, transformer}},
limits,
merger,
IndexStatsExtractor{},
cacheGenNumberLoader,
Add config to avoid caching stats for recent data (#9537) **What this PR does / why we need it**: When we query the stats for recent data, we query both the ingesters and the index gateways for the stats. https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L112-L114 https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L126-L127 Then we merge all the responses, which means summing up all the stats https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L157-L158 https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/stores/index/stats/stats.go#L23-L26 Because we have a replication factor of 3, this means that we will get the stats from the ingesters repeated up to 3 times, hence inflating the stats. In the stats cache, we store the stats for a given matcher set for the whole day, then we extract the stats from the cache by the factor of time from the request that is stored in the cache: https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L33 https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L40 Inflated stats for recent data will be cached, so subsequent stats extracted from the cache will be inflated regardless of the time. This PR adds a new per-tenant limit `max_stats_cache_freshness` to not cache requests with an end time that falls within Now minus this duration. Here's a scenario illustrating this. The graphs below show the bytes stats queried in the sharding middleware. We are running a log filter query that won't match any log, every 5 seconds with a length of 3h. ![image](https://github.com/grafana/loki/assets/8354290/45c2e6e9-185c-4a18-b290-47da27fc3e39) As can be seen, after enabling the stats cache and configuring`do_not_cache_request_within` to not cache stats for requests within 30m, the bytes stats used in the sharding middleware stopped increasing. In both cases the stats cache hit ration was 100%. ![image](https://github.com/grafana/loki/assets/8354290/cd35bcb8-0c77-4693-a06b-502741fd6e23) **Special notes for your reviewer**: - Blocked by https://github.com/grafana/loki/pull/9535 - Note that this PR doesn't fix the root issue of inflated stats form the ingesters, but rather buys us some time to work on that. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](https://github.com/grafana/loki/commit/d10549e3ece02120974929894ee333d07755d213)
2 years ago
func(ctx context.Context, r queryrangebase.Request) bool {
if shouldCache != nil && !shouldCache(ctx, r) {
return false
}
cacheStats, err := shouldCacheStats(ctx, r, limits)
if err != nil {
level.Error(log).Log("msg", "failed to determine if stats should be cached. Won't cache", "err", err)
return false
}
return cacheStats
},
parallelismForReq,
retentionEnabled,
metrics,
)
}