loki/docs/sources/operations/scalability.md

---
title: Scalability
description: Scaling with Grafana Loki
weight: 30
---
# Scalability

When scaling Loki, operators should consider running several Loki processes
partitioned by role (ingester, distributor, querier) rather than a single Loki
process. Grafana Labs' [production setup](https://github.com/grafana/loki/blob/master/production/ksonnet/loki)
contains `.libsonnet` files that demonstrates configuring separate components
and scaling for resource usage.

## Separate Query Scheduler

The Query frontend has an in-memory queue that can be moved out into a separate process similar to the
[Grafana Mimir query-scheduler](/docs/mimir/latest/operators-guide/architecture/components/query-scheduler/). This allows running multiple query frontends.

To run with the Query Scheduler, the frontend needs to be passed the scheduler's address via `-frontend.scheduler-address` and the querier processes needs to be started with `-querier.scheduler-address` set to the same address. Both options can also be defined via the [configuration file]({{< relref "../configuration/_index.md" >}}).

It is not valid to start the querier with both a configured frontend and a scheduler address.

The query scheduler process itself can be started via the `-target=query-scheduler` option of the Loki Docker image. For instance, `docker run grafana/loki:latest -config.file=/mimir/config/mimir.yaml -target=query-scheduler -server.http-listen-port=8009 -server.grpc-listen-port=9009` starts the query scheduler listening on ports `8009` and `9009`.

## Memory ballast

In compute-constrained environments, garbage collection can become a significant performance factor. Frequently-run garbage collection interferes with running the application by using CPU resources. The use of memory ballast can mitigate the issue. Memory ballast allocates extra, but unused virtual memory in order to inflate the quantity of live heap space. Garbage collection is triggered by the growth of heap space usage. The inflated quantity of heap space reduces the perceived growth, so garbage collection occurs less frequently.

Configure memory ballast using the ballast_bytes configuration option.

## Remote rule evaluation

_This feature was first proposed in [`LID-0002`](https://github.com/grafana/loki/pull/8129); it contains the design decisions
which informed the implementation._

By default, the `ruler` component embeds a query engine to evaluate rules. This generally works fine, except when rules
are complex or have to process a large amount of data regularly. Poor performance of the `ruler` manifests as recording rules metrics
with gaps or missed alerts. This situation can be detected by alerting on the `cortex_prometheus_rule_group_iterations_missed_total` metric
when it has a non-zero value.

A solution to this problem is to externalize rule evaluation from the `ruler` process. The `ruler` embedded query engine
is single-threaded, meaning that rules are not split, sharded, or otherwise accelerated like regular Loki queries. The `query-frontend`
component exists explicitly for this purpose and, when combined with a number of `querier` instances, can massively
improve rule evaluation performance and lead to fewer missed iterations.

It is generally recommended to create a separate `query-frontend` deployment and `querier` pool from your existing one - which handles adhoc
queries via Grafana, `logcli`, or the API. Rules should be given priority over adhoc queries because they are used to produce
metrics or alerts which may be crucial to the reliable operation of your service; if you use the same `query-frontend` and `querier` pool
for both, your rules will be executed with the same priority as adhoc queries which could lead to unpredictable performance.

To enable remote rule evaluation, set the following configuration options:

```yaml
ruler:
  evaluation:
    mode: remote
    query_frontend:
      address: dns:///<query-frontend-service>:<grpc-port>
```

See [`here`](/configuration/#ruler) for further configuration options.

When you enable remote rule evaluation, the `ruler` component becomes a gRPC client to the `query-frontend` service; 
this will result in far lower `ruler` resource usage because the majority of the work has been externalized.
The LogQL queries coming from the `ruler` will be executed against the given `query-frontend` service.
Requests will be load-balanced across all `query-frontend` IPs if the `dns:///` prefix is used.

> **Note:** Queries that fail to execute are _not_ retried.

### Limits & Observability

Remote rule evaluation can be tuned with the following options:

- `ruler_remote_evaluation_timeout`: maximum allowable execution time for rule evaluations
- `ruler_remote_evaluation_max_response_size`: maximum allowable response size over gRPC connection from `query-frontend` to `ruler`

Both of these can be specified globally in the [`limits_config`](/configuration/#limits_config) section
or on a [per-tenant basis](/configuration/#runtime-configuration-file). 

Remote rule evaluation exposes a number of metrics:

- `loki_ruler_remote_eval_request_duration_seconds`: time taken for rule evaluation (histogram)
- `loki_ruler_remote_eval_response_bytes`: number of bytes in rule evaluation response (histogram)
- `loki_ruler_remote_eval_response_samples`: number of samples in rule evaluation response (histogram)
- `loki_ruler_remote_eval_success_total`: successful rule evaluations (counter)
- `loki_ruler_remote_eval_failure_total`: unsuccessful rule evaluations with reasons (counter)

Each of these metrics are per-tenant, so cardinality must be taken into consideration.
Sync docs to website (#2378) * update docs location * add workflow * update docs content * update content * cleanup 5 years ago			`---`
			`title: Scalability`
Fix documentation linter errors (#8229) 3 years ago			`description: Scaling with Grafana Loki`
Doc updates: (#3823) - Revise section titles - Add weight metadata to specify ordering of subsections - Remove TOC info from content prose, as it is already in the right side navigation panel - Remove a redundant section about upgrading. 5 years ago			`weight: 30`
Sync docs to website (#2378) * update docs location * add workflow * update docs content * update content * cleanup 5 years ago			`---`
Fix documentation linter errors (#8229) 3 years ago			`# Scalability`
Documentation Rewrite (#982) * docs: create structure of docs overhaul This commit removes all old docs and lays out the table of contents and framework for how the new documentation will be intended to be read. * docs: add design docs back in * docs: add community documentation * docs: add LogQL docs * docs: port existing operations documentation * docs: add new placeholder file for promtail configuration docs * docs: add TOC for operations/storage * docs: add Loki API documentation * docs: port troubleshooting document * docs: add docker-driver documentation * docs: link to configuration from main docker-driver document * docs: update API for new paths * docs: fix broken links in api.md and remove json marker from examples * docs: incorporate api changes from #1009 * docs: port promtail documentation * docs: add TOC to promtail configuration reference * docs: fix promtail spelling errors * docs: add loki configuration reference * docs: add TOC to configuration * docs: add loki configuration example * docs: add Loki overview with brief explanation about each component * docs: add comparisons document * docs: add info on table manager and update storage/README.md * docs: add getting started * docs: incorporate config yaml changes from #755 * docs: fix typo in releases url for promtail * docs: add installation instructions * docs: add more configuration examples * docs: add information on fluentd client fluent-bit has been temporarily removed until the PR for it is merged. * docs: PR review feedback * docs: add architecture document * docs: add missing information from old docs * `localy` typo Co-Authored-By: Ed Welch <ed@oqqer.com> * docs: s/ran/run/g * Typo * Typo * Tyop * Typo * docs: fixed typo * docs: PR feedback * docs: @cyriltovena PR feedback * docs: add more details to promtail url config option * docs: expand promtail's pipelines document with extra detail * docs: remove reference to Stage interface in pipelines.md * docs: fixed some spelling * docs: clarify promtail configuration and scraping * docs: attempt #2 at explaining promtail's usage of machine hostname * docs: spelling fixes * docs: add reference to promtail custom metrics and fix silly typo * docs: cognizant -> aware * docs: typo * docs: typos * docs: add which components expose which API endpoints in microservices mode * docs: change ksonnet installation to tanka * docs: address most @pracucci feedback * docs: fix all spelling errors so reviewers don't have to keep finding them :) * docs: incorporate changes to API endpoints made in #1022 * docs: add missing loki metrics * docs: add missing promtail metrics * docs: @pstribrany feedback * docs: more @pracucci feedback * docs: move metrics into a table * docs: update push path references to /loki/api/v1/push * docs: add detail to further explain limitations of monolithic mode * docs: add alternative names to modes_of_operation diagram * docs: add log ordering requirement * docs: add procedure for updating docs with latest version * docs: separate out stages documentation into one document per stage * docs: list supported stores in storage documentation * docs: add info on duplicate log lines in pipelines * docs: add line_format as key feature to fluentd * docs: hopefully final commit :) 6 years ago
			`When scaling Loki, operators should consider running several Loki processes`
			`partitioned by role (ingester, distributor, querier) rather than a single Loki`
Sync docs to website (#2378) * update docs location * add workflow * update docs content * update content * cleanup 5 years ago			`process. Grafana Labs' [production setup](https://github.com/grafana/loki/blob/master/production/ksonnet/loki)`
Documentation Rewrite (#982) * docs: create structure of docs overhaul This commit removes all old docs and lays out the table of contents and framework for how the new documentation will be intended to be read. * docs: add design docs back in * docs: add community documentation * docs: add LogQL docs * docs: port existing operations documentation * docs: add new placeholder file for promtail configuration docs * docs: add TOC for operations/storage * docs: add Loki API documentation * docs: port troubleshooting document * docs: add docker-driver documentation * docs: link to configuration from main docker-driver document * docs: update API for new paths * docs: fix broken links in api.md and remove json marker from examples * docs: incorporate api changes from #1009 * docs: port promtail documentation * docs: add TOC to promtail configuration reference * docs: fix promtail spelling errors * docs: add loki configuration reference * docs: add TOC to configuration * docs: add loki configuration example * docs: add Loki overview with brief explanation about each component * docs: add comparisons document * docs: add info on table manager and update storage/README.md * docs: add getting started * docs: incorporate config yaml changes from #755 * docs: fix typo in releases url for promtail * docs: add installation instructions * docs: add more configuration examples * docs: add information on fluentd client fluent-bit has been temporarily removed until the PR for it is merged. * docs: PR review feedback * docs: add architecture document * docs: add missing information from old docs * `localy` typo Co-Authored-By: Ed Welch <ed@oqqer.com> * docs: s/ran/run/g * Typo * Typo * Tyop * Typo * docs: fixed typo * docs: PR feedback * docs: @cyriltovena PR feedback * docs: add more details to promtail url config option * docs: expand promtail's pipelines document with extra detail * docs: remove reference to Stage interface in pipelines.md * docs: fixed some spelling * docs: clarify promtail configuration and scraping * docs: attempt #2 at explaining promtail's usage of machine hostname * docs: spelling fixes * docs: add reference to promtail custom metrics and fix silly typo * docs: cognizant -> aware * docs: typo * docs: typos * docs: add which components expose which API endpoints in microservices mode * docs: change ksonnet installation to tanka * docs: address most @pracucci feedback * docs: fix all spelling errors so reviewers don't have to keep finding them :) * docs: incorporate changes to API endpoints made in #1022 * docs: add missing loki metrics * docs: add missing promtail metrics * docs: @pstribrany feedback * docs: more @pracucci feedback * docs: move metrics into a table * docs: update push path references to /loki/api/v1/push * docs: add detail to further explain limitations of monolithic mode * docs: add alternative names to modes_of_operation diagram * docs: add log ordering requirement * docs: add procedure for updating docs with latest version * docs: separate out stages documentation into one document per stage * docs: list supported stores in storage documentation * docs: add info on duplicate log lines in pipelines * docs: add line_format as key feature to fluentd * docs: hopefully final commit :) 6 years ago			contains `.libsonnet` files that demonstrates configuring separate components
			`and scaling for resource usage.`
Document operation with the query scheduler. (#4100) * Support Cortex Frontend V2. * Format code. * Start query scheduler in Docker compose. * Model query flow in sequence diagram. * Model calls to downstreamer. * Move diagram to docs. * Embed generated SVG. * Descripe calls. * Launch Cortex query scheduler. * Disable sharding for now. * Log which framework is used. * Configure querier to use scheduler. * Remove unwanted docs. * Enable query splitting again. * Document scheduler address. * Document new scheduler options for frontend as well. * Document operation with the query scheduler. * Replace Cortex with Loki image. * Remove unused Corte config. * Correct merger errors. * Clear up confusing sentence. * Update docs/sources/operations/scalability.md Co-authored-by: Karen Miller <84039272+KMiller-Grafana@users.noreply.github.com> Co-authored-by: Karen Miller <84039272+KMiller-Grafana@users.noreply.github.com> 4 years ago
			`## Separate Query Scheduler`

Docs: remove Cortex references (#6079) * Docs: remove Cortex references * Bug fix: (after PR 6015) Cortex to Loki update 4 years ago			`The Query frontend has an in-memory queue that can be moved out into a separate process similar to the`
Fix documentation linter errors (#8229) 3 years ago			`[Grafana Mimir query-scheduler](/docs/mimir/latest/operators-guide/architecture/components/query-scheduler/). This allows running multiple query frontends.`
Document operation with the query scheduler. (#4100) * Support Cortex Frontend V2. * Format code. * Start query scheduler in Docker compose. * Model query flow in sequence diagram. * Model calls to downstreamer. * Move diagram to docs. * Embed generated SVG. * Descripe calls. * Launch Cortex query scheduler. * Disable sharding for now. * Log which framework is used. * Configure querier to use scheduler. * Remove unwanted docs. * Enable query splitting again. * Document scheduler address. * Document new scheduler options for frontend as well. * Document operation with the query scheduler. * Replace Cortex with Loki image. * Remove unused Corte config. * Correct merger errors. * Clear up confusing sentence. * Update docs/sources/operations/scalability.md Co-authored-by: Karen Miller <84039272+KMiller-Grafana@users.noreply.github.com> Co-authored-by: Karen Miller <84039272+KMiller-Grafana@users.noreply.github.com> 4 years ago
Fixing broken link and removing out of date blog post. (#8632) Which issue(s) this PR fixes: * Starts to address #8631 * Removes link to outdated blog post that says "Loki is very much alpha software and should not be used in production environments." * Fixes broken link to Configuration file. 3 years ago			To run with the Query Scheduler, the frontend needs to be passed the scheduler's address via `-frontend.scheduler-address` and the querier processes needs to be started with `-querier.scheduler-address` set to the same address. Both options can also be defined via the [configuration file]({{< relref "../configuration/_index.md" >}}).
Document operation with the query scheduler. (#4100) * Support Cortex Frontend V2. * Format code. * Start query scheduler in Docker compose. * Model query flow in sequence diagram. * Model calls to downstreamer. * Move diagram to docs. * Embed generated SVG. * Descripe calls. * Launch Cortex query scheduler. * Disable sharding for now. * Log which framework is used. * Configure querier to use scheduler. * Remove unwanted docs. * Enable query splitting again. * Document scheduler address. * Document new scheduler options for frontend as well. * Document operation with the query scheduler. * Replace Cortex with Loki image. * Remove unused Corte config. * Correct merger errors. * Clear up confusing sentence. * Update docs/sources/operations/scalability.md Co-authored-by: Karen Miller <84039272+KMiller-Grafana@users.noreply.github.com> Co-authored-by: Karen Miller <84039272+KMiller-Grafana@users.noreply.github.com> 4 years ago
[DOC] Fix broken links in docs (#8065) What this PR does / why we need it: This PR fixes several broken links in the Promtail, operations, and install sections of the doc. 3 years ago			`It is not valid to start the querier with both a configured frontend and a scheduler address.`
Document operation with the query scheduler. (#4100) * Support Cortex Frontend V2. * Format code. * Start query scheduler in Docker compose. * Model query flow in sequence diagram. * Model calls to downstreamer. * Move diagram to docs. * Embed generated SVG. * Descripe calls. * Launch Cortex query scheduler. * Disable sharding for now. * Log which framework is used. * Configure querier to use scheduler. * Remove unwanted docs. * Enable query splitting again. * Document scheduler address. * Document new scheduler options for frontend as well. * Document operation with the query scheduler. * Replace Cortex with Loki image. * Remove unused Corte config. * Correct merger errors. * Clear up confusing sentence. * Update docs/sources/operations/scalability.md Co-authored-by: Karen Miller <84039272+KMiller-Grafana@users.noreply.github.com> Co-authored-by: Karen Miller <84039272+KMiller-Grafana@users.noreply.github.com> 4 years ago
Docs: remove Cortex references (#6079) * Docs: remove Cortex references * Bug fix: (after PR 6015) Cortex to Loki update 4 years ago			The query scheduler process itself can be started via the `-target=query-scheduler` option of the Loki Docker image. For instance, `docker run grafana/loki:latest -config.file=/mimir/config/mimir.yaml -target=query-scheduler -server.http-listen-port=8009 -server.grpc-listen-port=9009` starts the query scheduler listening on ports `8009` and `9009`.
Add the option to configure memory ballast for Loki (#5081) * Add memory ballast * Add changelog entry * Add documentation Add additional information to Flag description * Ensure that ballast_bytes can be configured from config file Add documentation * Add Operational scalability documentation related to memory ballast * Restore config.ballast-bytes cli flag * Move config.ballast-bytes cli flag declaration Co-authored-by: Owen Diehl <ow.diehl@gmail.com> 4 years ago
Docs: Update memory ballast documentation (#5106) * Add memory ballast * Add changelog entry * Add documentation Add additional information to Flag description * Ensure that ballast_bytes can be configured from config file Add documentation * Add Operational scalability documentation related to memory ballast * Restore config.ballast-bytes cli flag * Move config.ballast-bytes cli flag declaration * Update memory ballast documentation as per https://github.com/grafana/loki/pull/5081\#pullrequestreview-848460132 Co-authored-by: Owen Diehl <ow.diehl@gmail.com> 4 years ago			`## Memory ballast`
Add the option to configure memory ballast for Loki (#5081) * Add memory ballast * Add changelog entry * Add documentation Add additional information to Flag description * Ensure that ballast_bytes can be configured from config file Add documentation * Add Operational scalability documentation related to memory ballast * Restore config.ballast-bytes cli flag * Move config.ballast-bytes cli flag declaration Co-authored-by: Owen Diehl <ow.diehl@gmail.com> 4 years ago
Docs: Update memory ballast documentation (#5106) * Add memory ballast * Add changelog entry * Add documentation Add additional information to Flag description * Ensure that ballast_bytes can be configured from config file Add documentation * Add Operational scalability documentation related to memory ballast * Restore config.ballast-bytes cli flag * Move config.ballast-bytes cli flag declaration * Update memory ballast documentation as per https://github.com/grafana/loki/pull/5081\#pullrequestreview-848460132 Co-authored-by: Owen Diehl <ow.diehl@gmail.com> 4 years ago			In compute-constrained environments, garbage collection can become a significant performance factor. Frequently-run garbage collection interferes with running the application by using CPU resources. The use of memory ballast can mitigate the issue. Memory ballast allocates extra, but unused virtual memory in order to inflate the quantity of live heap space. Garbage collection is triggered by the growth of heap space usage. The inflated quantity of heap space reduces the perceived growth, so garbage collection occurs less frequently.

			`Configure memory ballast using the ballast_bytes configuration option.`
Ruler: remote rule evaluation hardening (#8785) What this PR does / why we need it: This PR is part 2 of 2 implementing remote rule evaluation. See [part 1](https://github.com/grafana/loki/pull/8744) for more context. 3 years ago
			`## Remote rule evaluation`

			_This feature was first proposed in [`LID-0002`](https://github.com/grafana/loki/pull/8129); it contains the design decisions
			`which informed the implementation._`

			By default, the `ruler` component embeds a query engine to evaluate rules. This generally works fine, except when rules
			are complex or have to process a large amount of data regularly. Poor performance of the `ruler` manifests as recording rules metrics
			with gaps or missed alerts. This situation can be detected by alerting on the `cortex_prometheus_rule_group_iterations_missed_total` metric
			`when it has a non-zero value.`

			A solution to this problem is to externalize rule evaluation from the `ruler` process. The `ruler` embedded query engine
			is single-threaded, meaning that rules are not split, sharded, or otherwise accelerated like regular Loki queries. The `query-frontend`
			component exists explicitly for this purpose and, when combined with a number of `querier` instances, can massively
			`improve rule evaluation performance and lead to fewer missed iterations.`

			It is generally recommended to create a separate `query-frontend` deployment and `querier` pool from your existing one - which handles adhoc
			queries via Grafana, `logcli`, or the API. Rules should be given priority over adhoc queries because they are used to produce
			metrics or alerts which may be crucial to the reliable operation of your service; if you use the same `query-frontend` and `querier` pool
			`for both, your rules will be executed with the same priority as adhoc queries which could lead to unpredictable performance.`

			`To enable remote rule evaluation, set the following configuration options:`

			```yaml
			`ruler:`
			`evaluation:`
			`mode: remote`
			`query_frontend:`
			`address: dns:///<query-frontend-service>:<grpc-port>`
			```

			See [`here`](/configuration/#ruler) for further configuration options.

			When you enable remote rule evaluation, the `ruler` component becomes a gRPC client to the `query-frontend` service;
			this will result in far lower `ruler` resource usage because the majority of the work has been externalized.
			The LogQL queries coming from the `ruler` will be executed against the given `query-frontend` service.
			Requests will be load-balanced across all `query-frontend` IPs if the `dns:///` prefix is used.

			`> Note: Queries that fail to execute are _not_ retried.`

			`### Limits & Observability`

			`Remote rule evaluation can be tuned with the following options:`

			- `ruler_remote_evaluation_timeout`: maximum allowable execution time for rule evaluations
			- `ruler_remote_evaluation_max_response_size`: maximum allowable response size over gRPC connection from `query-frontend` to `ruler`

			Both of these can be specified globally in the [`limits_config`](/configuration/#limits_config) section
			`or on a [per-tenant basis](/configuration/#runtime-configuration-file).`

			`Remote rule evaluation exposes a number of metrics:`

			- `loki_ruler_remote_eval_request_duration_seconds`: time taken for rule evaluation (histogram)
			- `loki_ruler_remote_eval_response_bytes`: number of bytes in rule evaluation response (histogram)
			- `loki_ruler_remote_eval_response_samples`: number of samples in rule evaluation response (histogram)
			- `loki_ruler_remote_eval_success_total`: successful rule evaluations (counter)
			- `loki_ruler_remote_eval_failure_total`: unsuccessful rule evaluations with reasons (counter)

			`Each of these metrics are per-tenant, so cardinality must be taken into consideration.`