--- title: Upgrade Loki menuTitle: Upgrade description: Upgrading Grafana Loki aliases: - ../upgrading/ weight: 400 --- # Upgrade Loki Every attempt is made to keep Grafana Loki backwards compatible, such that upgrades should be low risk and low friction. Unfortunately Loki is software and software is hard and sometimes we are forced to make decisions between ease of use and ease of maintenance. If we have any expectation of difficulty upgrading, we will document it here. As more versions are released it becomes more likely unexpected problems arise moving between multiple versions at once. If possible try to stay current and do sequential updates. If you want to skip versions, try it in a development environment before attempting to upgrade production. ## Checking for config changes Using docker you can check changes between 2 versions of Loki with a command like this: ``` export OLD_LOKI=2.3.0 export NEW_LOKI=2.4.1 export CONFIG_FILE=loki-local-config.yaml diff --color=always --side-by-side <(docker run --rm -t -v "${PWD}":/config grafana/loki:${OLD_LOKI} -config.file=/config/${CONFIG_FILE} -print-config-stderr 2>&1 | sed '/Starting Loki/q' | tr -d '\r') <(docker run --rm -t -v "${PWD}":/config grafana/loki:${NEW_LOKI} -config.file=/config/${CONFIG_FILE} -print-config-stderr 2>&1 | sed '/Starting Loki/q' | tr -d '\r') | less -R ``` The `tr -d '\r'` is likely not necessary for most people, seems like WSL2 was sneaking in some windows newline characters... The output is incredibly verbose as it shows the entire internal config struct used to run Loki, you can play around with the diff command if you prefer to only show changes or a different style output. ## Main / Unreleased ### Loki #### Configuration `use_boltdb_shipper_as_backup` is removed The setting `use_boltdb_shipper_as_backup` (`-tsdb.shipper.use-boltdb-shipper-as-backup`) was a remnant from the development of the TSDB storage. It was used to allow writing to both TSDB and BoltDB when TSDB was still highly experimental. Since TSDB is now stable and the recommended index type, the setting has become irrelevant and therefore was removed. The previous default value `false` is applied. #### Deprecated configuration options are removed 1. Removes already deprecated `-querier.engine.timeout` CLI flag and the corresponding YAML setting. 1. Also removes the `query_timeout` from the querier YAML section. Instead of configuring `query_timeout` under `querier`, you now configure it in [Limits Config](/docs/loki/latest/configuration/#limits_config). 1. `s3.sse-encryption` is removed. AWS now defaults encryption of all buckets to SSE-S3. Use `sse.type` to set SSE type. 1. `ruler.wal-cleaer.period` is removed. Use `ruler.wal-cleaner.period` instead. 1. `experimental.ruler.enable-api` is removed. Use `ruler.enable-api` instead. 1. `split_queries_by_interval` is removed from `query_range` YAML section. You can instead configure it in [Limits Config](/docs/loki/latest/configuration/#limits_config). 1. `frontend.forward-headers-list` CLI flag and its corresponding YAML setting are removed. 1. `frontend.cache-split-interval` CLI flag is removed. Results caching interval is now determined by `querier.split-queries-by-interval`. 1. `querier.worker-parallelism` CLI flag and its corresponding yaml setting are now removed as it does not offer additional value to already existing `querier.max-concurrent`. We recommend configuring `querier.max-concurrent` to limit the max concurrent requests processed by the queriers. #### Legacy ingester shutdown handler is removed The already deprecated handler `/ingester/flush_shutdown` is removed in favor of `/ingester/shutdown?flush=true`. #### Ingester configuration `max_transfer_retries` is removed. The setting `max_transfer_retries` (`-ingester.max-transfer-retries`) is removed in favor of the Write Ahead log (WAL). It was used to allow transferring chunks to new ingesters when the old ingester was shutting down during a rolling restart. Alternatives to this setting are: - **A. (Preferred)** Enable the WAL and rely on the new ingester to replay the WAL. - Optionally, you can enable `flush_on_shutdown` (`-ingester.flush-on-shutdown`) to flush to long-term storage on shutdowns. - **B.** Manually flush during shutdowns via [the ingester `/shutdown?flush=true` endpoint]({{< relref "../../reference/api#flush-in-memory-chunks-and-shut-down" >}}). #### Distributor metric changes The `loki_distributor_ingester_append_failures_total` metric has been removed in favour of `loki_distributor_ingester_append_timeouts_total`. This new metric will provide a more clear signal that there is an issue with ingesters, and this metric can be used for high-signal alerting. #### Changes to default configuration values {{% responsive-table %}} | configuration | new default | old default | notes | | ------------------------------------------------------ | ----------- | ----------- | -------- | `compactor.delete-max-interval` | 24h | 0 | splits the delete requests into intervals no longer than `delete_max_interval` | | `distributor.max-line-size` | 256KB | 0 | - | | `ingester.sync-period` | 1h | 0 | ensures that the chunk cuts for a given stream are synchronized across the ingesters in the replication set. Helps with deduplicating chunks. | | `ingester.sync-min-utilization` | 0.1 | 0 | - | | `frontend.max-querier-bytes-read` | 150GB | 0 | - | | `frontend.max-cache-freshness` | 10m | 1m | - | | `frontend.max-stats-cache-freshness` | 10m | 0 | - | | `frontend.embedded-cache.max-size-mb` | 100MB | 1GB | embedded results cache size now defaults to 100MB | | `memcached.batchsize` | 256 | 1024 | - | | `memcached.parallelism` | 10 | 100 | - | | `querier.compress-http-responses` | true | false | compress response if the request accepts gzip encoding | | `querier.max-concurrent` | 4 | 10 | Consider increasing this if queriers have access to more CPU resources. Note that you risk running into out of memory errors if you set this to a very high value. | | `querier.split-queries-by-interval` | 1h | 30m | - | | `querier.tsdb-max-query-parallelism` | 128 | 512 | - | | `query-scheduler.max-outstanding-requests-per-tenant` | 32000 | 100 | - | | `validation.max-label-names-per-series` | 15 | 30 | - | {{% /responsive-table %}} #### Write dedupe cache is deprecated Write dedupe cache is deprecated because it not required by the newer single store indexes ([TSDB]({{< relref "../../operations/storage/tsdb" >}}) and [boltdb-shipper]({{< relref "../../operations/storage/boltdb-shipper" >}})). If you using a [legacy index type]({{< relref "../../storage#index-storage" >}}), consider migrating to TSDB (recommended). #### Embedded cache metric changes - The following embedded cache metrics are removed. Instead use `loki_cache_fetched_keys`, `loki_cache_hits`, `loki_cache_request_duration_seconds` which instruments requests made to the configured cache (`embeddedcache`, `memcached` or `redis`). - `querier_cache_added_total` - `querier_cache_gets_total` - `querier_cache_misses_total` - The following embedded cache metrics are renamed: - `querier_cache_added_new_total` is renamed to `loki_embeddedcache_added_new_total` - `querier_cache_evicted_total` is renamed to `loki_embeddedcache_evicted_total` - `querier_cache_entries` is renamed to `loki_embeddedcache_entries` - `querier_cache_memory_bytes` is renamed to `loki_embeddedcache_memory_bytes` - Already deprecated metric `querier_cache_stale_gets_total` is now removed. ## 2.9.0 ### Loki #### Index gateway shuffle sharding The index gateway now supports shuffle sharding of index data when running in "ring" mode. The index data is sharded by tenant where each tenant gets assigned a sub-set of all available instances of the index gateways in the ring. If you configured a high replication factor to accommodate for load, since in the past this was the only option to give a tenant more instances for querying, you should consider reducing the replication factor to a meaningful value for replication (for example, from 12 to 3) and instead set the shard factor for individual tenants as required. If the global shard factor (no per-tenant) is 0 (default value), the global shard factor is set to replication factor. It can still be overwritten per tenant. In the context of the index gateway, sharding is synonymous to replication. #### Index shipper multi-store support In previous releases, if you did not explicitly configure `-boltdb.shipper.shared-store`, `-tsdb.shipper.shared-store`, those values default to the `object_store` configured in the latest `period_config` of the corresponding index type. These defaults are removed in favor of uploading indexes to multiple stores. If you do not explicitly configure a `shared-store`, the boltdb and tsdb indexes will be shipped to the `object_store` configured for that period. #### Shutdown marker file A shutdown marker file can be written by the `/ingester/prepare_shutdown` endpoint. If the new `ingester.shutdown_marker_path` config setting has a value that value is used. If not the`common.path_prefix` config setting is used if it has a value. Otherwise a warning is shown in the logs on startup and the `/ingester/prepare_shutdown` endpoint will return a 500 status code. #### Compactor multi-store support In previous releases, setting `-boltdb.shipper.compactor.shared-store` configured the following: - store used for managing delete requests. - store on which index compaction should be performed. If `-boltdb.shipper.compactor.shared-store` was not set, it used to default to the `object_store` configured in the latest `period_config` that uses either the tsdb or boltdb-shipper index. Compactor now supports index compaction on multiple buckets/object stores. And going forward loki will not set any defaults on `-boltdb.shipper.compactor.shared-store`, this has a couple of side effects detailed as follows: ##### store on which index compaction should be performed: If `-boltdb.shipper.compactor.shared-store` is configured by the user, loki would run index compaction only on the store specified by the config. If not set, compaction would be performed on all the object stores that contain either a boltdb-shipper or tsdb index. ##### store used for managing delete requests: A new config option `-boltdb.shipper.compactor.delete-request-store` decides where delete requests should be stored. This new option takes precedence over `-boltdb.shipper.compactor.shared-store`. In the case where neither of these options are set, the `object_store` configured in the latest `period_config` that uses either a tsdb or boltdb-shipper index is used for storing delete requests to ensure pending requests are processed. #### logfmt parser non-strict parsing logfmt parser now performs non-strict parsing which helps scan semi-structured log lines. It skips invalid tokens and tries to extract as many key/value pairs as possible from the rest of the log line. If you have a use-case that relies on strict parsing where you expect the parser to throw an error, use `| logfmt --strict` to enable strict mode. logfmt parser doesn't include standalone keys(keys without a value) in the resulting label set anymore. You can use `--keep-empty` flag to retain them. ### Jsonnet ##### Deprecated PodDisruptionBudget definition has been removed The `policy/v1beta1` API version of PodDisruptionBudget is no longer served as of Kubernetes v1.25. To support the latest versions of the Kubernetes, it was necessary to replace `policy/v1beta1` with the new definition `policy/v1` that is available since v1.21. No impact is expected if you use Kubernetes v1.21 or newer. Please refer to [official migration guide](https://kubernetes.io/docs/reference/using-api/deprecation-guide/#poddisruptionbudget-v125) for more details. ## 2.8.0 ### Loki #### Change in LogQL behavior When there are duplicate labels in a log line, only the first value will be kept. Previously only the last value was kept. #### Default retention_period has changed This change will affect you if you have: ```yaml compactor: retention_enabled: true ``` And did *not* define a `retention_period` in `limits_config`, thus relying on the previous default of `744h` In this release the default has been changed to `0s`. A value of `0s` is the same as "retain forever" or "disable retention". If, **and only if**, you wish to retain the previous default of 744h, apply this config. ```yaml limits_config: retention_period: 744h ``` **Note:** In previous versions, the zero value of `0` or `0s` will result in **immediate deletion of all logs**, only in 2.8 and forward releases does the zero value disable retention. #### metrics.go log line `subqueries` replaced with `splits` and `shards` The metrics.go log line emitted for every query had an entry called `subqueries` which was intended to represent the amount a query was parallelized on execution. In the current form it only displayed the count of subqueries generated with Loki's split by time logic and did not include counts for shards. There wasn't a clean way to update subqueries to include sharding information and there is value in knowing the difference between the subqueries generated when we split by time vs sharding factors, especially now that TSDB can do dynamic sharding. In 2.8 we no longer include `subqueries` in metrics.go, it does still exist in the statistics API data returned but just for backwards compatibility, the value will always be zero now. Instead, now you can use `splits` to see how many split by time intervals were created and `shards` to see the total number of shards created for a query. Note: currently not every query can be sharded and a shards value of zero is a good indicator the query was not able to be sharded. ### Promtail #### The go build tag `promtail_journal_enabled` was introduced The go build tag `promtail_journal_enabled` should be passed to include Journal support to the promtail binary. If you need Journal support you will need to run go build with tag `promtail_journal_enabled`: ```shell go build --tags=promtail_journal_enabled ./clients/cmd/promtail ``` Introducing this tag aims to relieve Linux/CentOS users with CGO enabled from installing libsystemd-dev/systemd-devel libraries if they don't need Journal support. ### Ruler #### CLI flag `ruler.wal-cleaer.period` deprecated CLI flag `ruler.wal-cleaer.period` is now deprecated and replaced with a typo fix `ruler.wal-cleaner.period`. The yaml configuration remains unchanged: ```yaml ruler: wal_cleaner: period: 5s ``` ### Querier #### query-frontend Kubernetes headless service changed to load balanced service *Note:* This is relevant only if you are using [jsonnet for deploying Loki in Kubernetes](/docs/loki/latest/installation/tanka/) The `query-frontend` Kubernetes service was previously headless and was used for two purposes: * Distributing the Loki query requests amongst all the available Query Frontend pods. * Discover IPs of Query Frontend pods from Queriers to connect as workers. The problem here is that a headless service does not support load balancing and leaves it up to the client to balance the load. Additionally, a load-balanced service does not let us discover the IPs of the underlying pods. To meet both these requirements, we have made the following changes: * Changed the existing `query-frontend` Kubernetes service from headless to load-balanced to have a fair load distribution on all the Query Frontend instances. * Added `query-frontend-headless` to discover QF pod IPs from queriers to connect as workers. If you are deploying Loki with Query Scheduler by setting [query_scheduler_enabled](https://github.com/grafana/loki/blob/cc4ab7487ab3cd3b07c63601b074101b0324083b/production/ksonnet/loki/config.libsonnet#L18) config to `true`, then there is nothing to do here for this change. If you are not using Query Scheduler, then to avoid any issues on the Read path until the rollout finishes, it would be good to follow below steps: * Create just the `query-frontend-headless` service without applying any changes to the `query-frontend` service. * Rollout changes to `queriers`. * Roll out the rest of the changes. ### General #### Store & Cache Statistics Statistics are now logged in `metrics.go` lines about how long it takes to download chunks from the store, as well as how long it takes to download chunks, index query, and result cache responses from cache. Example (note the `*_download_time` fields): ``` level=info ts=2022-12-20T15:27:54.858554127Z caller=metrics.go:147 component=frontend org_id=docker latency=fast query="sum(count_over_time({job=\"generated-logs\"}[1h]))" query_type=metric range_type=range length=6h17m48.865587821s start_delta=6h17m54.858533178s end_delta=5.99294552s step=1m30s duration=5.990829396s status=200 limit=30 returned_lines=0 throughput=123MB total_bytes=738MB total_entries=1 store_chunks_download_time=2.319297059s queue_time=2m21.476090991s subqueries=8 cache_chunk_req=81143 cache_chunk_hit=32390 cache_chunk_bytes_stored=1874098 cache_chunk_bytes_fetched=94289610 cache_chunk_download_time=56.96914ms cache_index_req=994 cache_index_hit=710 cache_index_download_time=1.587842ms cache_result_req=7 cache_result_hit=0 cache_result_download_time=380.555µs ``` These statistics are also displayed when using `--stats` with LogCLI. ## 2.7.0 ### Loki ### Loki Canary Permission The new `push` mode to [Loki canary](/docs/loki/latest/operations/loki-canary/) can push logs that are generated by a Loki canary directly to a given Loki URL. Previously, it only wrote to a local file and you needed some agent, such as promtail, to scrape and push it to Loki. So if you run Loki behind some proxy with different authorization policies to read and write to Loki, then auth credentials we pass to Loki canary now needs to have both `READ` and `WRITE` permissions. ### `engine.timeout` and `querier.query_timeout` are deprecated Previously, we had two configurations to define a query timeout: `engine.timeout` and `querier.query-timeout`. As they were conflicting and `engine.timeout` isn't as expressive as `querier.query-tiomeout`, we're deprecating it and moving it to [Limits Config](/docs/loki/latest/configuration/#limits_config) `limits_config.query_timeout` with same default values. #### `fifocache` has been renamed The in-memory `fifocache` has been renamed to `embedded-cache`. This allows us to replace the implementation (currently a simple FIFO datastructure) with something else in the future without causing confusion #### Evenly spread Memcached pods for chunks across kubernetes nodes We now evenly spread memcached_chunks pods across the available kubernetes nodes, but allowing more than one pod to be scheduled into the same node. If you want to run at most a single pod per node, set `$.memcached.memcached_chunks.use_topology_spread` to false. While we attempt to schedule at most 1 memcached_chunks pod per Kubernetes node with the `topology_spread_max_skew: 1` field, if no more nodes are available then multiple pods will be scheduled on the same node. This can potentially impact your service's reliability so consider tuning these values according to your risk tolerance. #### Evenly spread distributors across kubernetes nodes We now evenly spread distributors across the available kubernetes nodes, but allowing more than one distributors to be scheduled into the same node. If you want to run at most a single distributors per node, set `$._config.distributors.use_topology_spread` to false. While we attempt to schedule at most 1 distributor per Kubernetes node with the `topology_spread_max_skew: 1` field, if no more nodes are available then multiple distributors will be scheduled on the same node. This can potentially impact your service's reliability so consider tuning these values according to your risk tolerance. #### Evenly spread queriers across kubernetes nodes We now evenly spread queriers across the available kubernetes nodes, but allowing more than one querier to be scheduled into the same node. If you want to run at most a single querier per node, set `$._config.querier.use_topology_spread` to false. While we attempt to schedule at most 1 querier per Kubernetes node with the `topology_spread_max_skew: 1` field, if no more nodes are available then multiple queriers will be scheduled on the same node. This can potentially impact your service's reliability so consider tuning these values according to your risk tolerance. #### Default value for `server.http-listen-port` changed This value now defaults to 3100, so the Loki process doesn't require special privileges. Previously, it had been set to port 80, which is a privileged port. If you need Loki to listen on port 80, you can set it back to the previous default using `-server.http-listen-port=80`. #### docker-compose setup has been updated The docker-compose [setup](https://github.com/grafana/loki/blob/main/production/docker) has been updated to **v2.6.0** and includes many improvements. Notable changes include: - authentication (multi-tenancy) is **enabled** by default; you can disable it in `production/docker/config/loki.yaml` by setting `auth_enabled: false` - storage is now using Minio instead of local filesystem - move your current storage into `.data/minio` and it should work transparently - log-generator was added - if you don't need it, simply remove the service from `docker-compose.yaml` or don't start the service #### Configuration for deletes has changed The global `deletion_mode` option in the compactor configuration moved to runtime configurations. - The `deletion_mode` option needs to be removed from your compactor configuration - The `deletion_mode` global override needs to be set to the desired mode: `disabled`, `filter-only`, or `filter-and-delete`. By default, `filter-and-delete` is enabled. - Any `allow_delete` per-tenant overrides need to be removed or changed to `deletion_mode` overrides with the desired mode. #### Metric name for `loki_log_messages_total` changed The name of this metric was changed to `loki_internal_log_messages_total` to reduce ambiguity. The previous name is still present but is deprecated. #### Usage Report / Telemetry config has changed named The configuration for anonymous usage statistics reporting to Grafana has changed from `usage_report` to `analytics`. #### TLS `cipher_suites` and `tls_min_version` have moved These were previously configurable under `server.http_tls_config` and `server.grpc_tls_config` separately. They are now under `server.tls_cipher_suites` and `server.tls_min_version`. These values are also now configurable for individual clients, for example: `distributor.ring.etcd` or `querier.ingester_client.grpc_client_config`. #### `ruler.storage.configdb` has been removed ConfigDB was disallowed as a Ruler storage option back in 2.0. The config struct has finally been removed. #### `ruler.remote_write.client` has been removed Can no longer specify a remote write client for the ruler. ### Promtail #### `gcp_push_target_parsing_errors_total` has a new `reason` label The `gcp_push_target_parsing_errors_total` GCP Push Target metrics has been added a new label named `reason`. This includes detail on what might have caused the parsing to fail. #### Windows event logs: now correctly includes `user_data` The contents of the `user_data` field was erroneously set to the same value as `event_data` in previous versions. This was fixed in [#7461](https://github.com/grafana/loki/pull/7461) and log queries relying on this broken behaviour may be impacted. ## 2.6.0 ### Loki #### Implementation of unwrapped `rate` aggregation changed The implementation of the `rate()` aggregation function changed back to the previous implemention prior to [#5013](https://github.com/grafana/loki/pulls/5013). This means that the rate per second is calculated based on the sum of the extracted values, instead of the average increase over time. If you want the extracted values to be treated as [Counter](https://prometheus.io/docs/concepts/metric_types/#counter) metric, you should use the new `rate_counter()` aggregation function, which calculates the per-second average rate of increase of the vector. #### Default value for `azure.container-name` changed This value now defaults to `loki`, it was previously set to `cortex`. If you are relying on this container name for your chunks or ruler storage, you will have to manually specify `-azure.container-name=cortex` or `-ruler.storage.azure.container-name=cortex` respectively. ## 2.5.0 ### Loki #### `split_queries_by_interval` yaml configuration has moved. It was previously possible to define this value in two places ```yaml query_range: split_queries_by_interval: 10m ``` and/or ``` limits_config: split_queries_by_interval: 10m ``` In 2.5.0 it can only be defined in the `limits_config` section, **Loki will fail to start if you do not remove the `split_queries_by_interval` config from the `query_range` section.** Additionally, it has a new default value of `30m` rather than `0`. The CLI flag is not changed and remains `querier.split-queries-by-interval`. #### Dropped support for old Prometheus rules configuration format Alerting rules previously could be specified in two formats: 1.x format (legacy one, named `v0` internally) and 2.x. We decided to drop support for format `1.x` as it is fairly old and keeping support for it required a lot of code. In case you're still using the legacy format, take a look at [Alerting Rules](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) for instructions on how to write alerting rules in the new format. For reference, the newer format follows a structure similar to the one below: ```yaml groups: - name: example rules: - alert: HighErrorRate expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5 for: 10m labels: severity: page annotations: summary: High request latency ``` Meanwhile, the legacy format is a string in the following format: ``` ALERT IF [ FOR ] [ LABELS