@ -378,4 +378,4 @@ There are a few things coming to increase the robustness of this service. In no
## Misc Details: Metrics backends vs in-memory
Currently the Loki Ruler is decoupled from a backing Prometheus store. Generally, the result of evaluating rules as well as the history of the alert's state are stored as a time series. Loki is unable to store/retrieve these in order to allow it to run independently of i.e. Prometheus. As a workaround, Loki keeps a small in memory store whose purpose is to lazy load past evaluations when rescheduling or resharding Rulers. In the future, Loki will support optional metrics backends, allowing storage of these metrics for auditing & performance benefits.
Currently the Loki Ruler is decoupled from a backing Prometheus store. Generally, the result of evaluating rules as well as the history of the alert's state are stored as a time series. Loki is unable to store/retrieve these in order to allow it to run independently of i.e. Prometheus. As a workaround, Loki keeps a small in memory store whose purpose is to lazy load past evaluations when rescheduling or resharding Rulers. In the future, Loki will support optional metrics backends, allowing storage of these metrics for auditing and performance benefits.
@ -97,7 +97,7 @@ Introduction of the WAL requires that ingesters have persistent disks which are
### Implementation goals
- Use underlying prometheus wal pkg when possible for consistency & to mitigate undifferentiated heavy lifting. Interfaces handle page alignment & use []byte.
- Use underlying prometheus wal pkg when possible for consistency and to mitigate undifferentiated heavy lifting. Interfaces handle page alignment and use []byte.
- Ensure this package handles arbitrarily long records (log lines in Loki’s case).
- Ensure our in memory representations can be efficiently moved to/from `[]byte` in order to generate conversions for fast/efficient loading from checkpoints.
- Ensure chunks which have already been flushed to storage are kept around for `ingester.retain-period`, even after a WAL replay.
@ -110,7 +110,7 @@ Since we're not checkpointing from the WAL records but instead doing a memory du
#### Don't build checkpoints from memory, instead write new WAL elements
Instead of building checkpoints from memory, this would build the same efficiencies into two distinct WAL Record types: `Blocks` and `FlushedChunks`. The former is a record type which will contain an entire compressed block after it's cut and the latter will contain an entire chunk + the sequence of blocks it holds when it's flushed. This may offer good enough amortization of writes because block cuts are assumed to be evenly distributed & chunk flushes have the same property and use jitter for synchronization.
Instead of building checkpoints from memory, this would build the same efficiencies into two distinct WAL Record types: `Blocks` and `FlushedChunks`. The former is a record type which will contain an entire compressed block after it's cut and the latter will contain an entire chunk + the sequence of blocks it holds when it's flushed. This may offer good enough amortization of writes because block cuts are assumed to be evenly distributed and chunk flushes have the same property and use jitter for synchronization.
This could be used to drop WAL records which have already elapsed the `ingester.retain-period`, allowing for faster WAL replays and more efficient loading.
Thus _all_ blocks must have their metadata checked against a query. In this example, a query for the bounds `[ts1,ts2]` would need to decompress and scan the `[ts1, ts2]` range across both of them, but a query against `[ts3, ts4]` would only decompress & scan _one_ block.
Thus _all_ blocks must have their metadata checked against a query. In this example, a query for the bounds `[ts1,ts2]` would need to decompress and scan the `[ts1, ts2]` range across both of them, but a query against `[ts3, ts4]` would only decompress and scan _one_ block.
```
Figure 4
@ -125,7 +125,7 @@ The performance losses against the current approach includes:
2) Blocks may contain overlapping data (although ordering is still guaranteed within each block).
3) Head block scans are now `O(n log(n))` instead of `O(n)`
### Flushing & Chunk Creation
### Flushing and Chunk Creation
Loki regularly combines multiple blocks into a chunk and "flushes" it to storage. In order to ensure that reads over flushed chunks remain as performant as possible, we will re-order a possibly-overlapping set of blocks into a set of blocks that maintain monotonically increasing order between them. From the perspective of the rest of Loki’s components (queriers/rulers fetching chunks from storage), nothing has changed.
@ -174,7 +174,7 @@ This ends the initial design portion of this document. Below, I'll describe some
#### Variance Budget
The intended approach of a "sliding validity" window for each stream is simple and effective at preventing misuse & bad actors from writing across the entire acceptable range for incoming timestamps. However, we may in the future wish to take a more sophisticated approach, introducing per tenant "variance" budgets, likely derived from the stream limit. This ingester limit could, for example use an incremental (online) standard deviation/variance algorithm such as [Welford's](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance), which would allow writing to larger ranges than option (2) in the _Chunk Durations_ section.
The intended approach of a "sliding validity" window for each stream is simple and effective at preventing misuse and bad actors from writing across the entire acceptable range for incoming timestamps. However, we may in the future wish to take a more sophisticated approach, introducing per tenant "variance" budgets, likely derived from the stream limit. This ingester limit could, for example use an incremental (online) standard deviation/variance algorithm such as [Welford's](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance), which would allow writing to larger ranges than option (2) in the _Chunk Durations_ section.
#### LSM Tree
@ -185,7 +185,7 @@ Much of the proposed approach mirrors an [LSM-Tree](http://www.benstopford.com/2
- Keep open a wider validity window for incoming logs
**At the cost of**
- being susceptible to disk-related complexity & problems
- being susceptible to disk-related complexity and problems
##### MemTable (head block)
@ -197,7 +197,7 @@ Once a Memtable (head block) in an LSM-Tree hits a predefined size, it is flushe
##### Block Index
Incoming reads in an LSM-Tree may need access to the SSTable entries in addition to the currently active memtable (head block). In order to improve this, we may cache the metadata including block offsets, start & end timestamps within an SSTable (block || MemChunk) in memory to mitigate lookups, seeking, and loading unnecessary data from disk.
Incoming reads in an LSM-Tree may need access to the SSTable entries in addition to the currently active memtable (head block). In order to improve this, we may cache the metadata including block offsets, start and end timestamps within an SSTable (block || MemChunk) in memory to mitigate lookups, seeking, and loading unnecessary data from disk.
@ -47,7 +47,7 @@ In order to mitigate the chance of _losing_ data on any single ingester, the dis
**Caveat: If a write is acknowledged by 2 out of 3 ingesters, we can tolerate the loss of one ingester but not two, as this would result in data loss.**
Replication factor isn't the only thing that prevents data loss, though, and arguably these days its main purpose is to allow writes to continue uninterrupted during rollouts & restarts. The `ingester` component now includes a [write ahead log](https://en.wikipedia.org/wiki/Write-ahead_logging) which persists incoming writes to disk to ensure they're not lost as long as the disk isn't corrupted. The complementary nature of replication factor and WAL ensures data isn't lost unless there are significant failures in both mechanisms (i.e. multiple ingesters die and lose/corrupt their disks).
Replication factor isn't the only thing that prevents data loss, though, and arguably these days its main purpose is to allow writes to continue uninterrupted during rollouts and restarts. The `ingester` component now includes a [write ahead log](https://en.wikipedia.org/wiki/Write-ahead_logging) which persists incoming writes to disk to ensure they're not lost as long as the disk isn't corrupted. The complementary nature of replication factor and WAL ensures data isn't lost unless there are significant failures in both mechanisms (i.e. multiple ingesters die and lose/corrupt their disks).
Ingesters temporarily store data in memory. In the event of a crash, there could be data loss. The Write Ahead Log (WAL) helps fill this gap in reliability.
The WAL in Grafana Loki records incoming data and stores it on the local file system in order to guarantee persistence of acknowledged data in the event of a process crash. Upon restart, Loki will "replay" all of the data in the log before registering itself as ready for subsequent writes. This allows Loki to maintain the performance & cost benefits of buffering data in memory _and_ durability benefits (it won't lose data once a write has been acknowledged).
The WAL in Grafana Loki records incoming data and stores it on the local file system in order to guarantee persistence of acknowledged data in the event of a process crash. Upon restart, Loki will "replay" all of the data in the log before registering itself as ready for subsequent writes. This allows Loki to maintain the performance and cost benefits of buffering data in memory _and_ durability benefits (it won't lose data once a write has been acknowledged).
This section will use Kubernetes as a reference deployment paradigm in the examples.
## Disclaimer & WAL nuances
## Disclaimer and WAL nuances
The Write Ahead Log in Loki takes a few particular tradeoffs compared to other WALs you may be familiar with. The WAL aims to add additional durability guarantees, but _not at the expense of availability_. Particularly, there are two scenarios where the WAL sacrifices these guarantees.
Loki releases (this includes [Promtail](/clients/promtail), [Loki Canary](/operations/loki-canary/), etc) use the following
naming scheme: `MAJOR`.`MINOR`.`PATCH`.
- `MAJOR` (roughly once a year): these releases include large new features & possible backwards-compatibility breaks.
- `MAJOR` (roughly once a year): these releases include large new features and possible backwards-compatibility breaks.
- `MINOR` (roughly once a quarter): these releases include new features which generally do not break backwards-compatibility, but from time to time we might introduce _minor_ breaking changes, and we will specify these in our upgrade docs.
- `PATCH` (roughly once or twice a month): these releases include bug & security fixes which do not break backwards-compatibility.
- `PATCH` (roughly once or twice a month): these releases include bug and security fixes which do not break backwards-compatibility.
> **NOTE:** While our naming scheme resembles [Semantic Versioning](https://semver.org/), at this time we do not strictly follow its
guidelines to the letter. Our goal is to provide regular releases that are as stable as possible, and we take backwards-compatibility
@ -20,7 +20,7 @@ This guide will walk you through migrating from the old, two target, scalable co
We recommend having a Grafana instance available to monitor both the existing and new clusters, to make sure there is no data loss during the migration process. The `loki` chart ships with self-monitoring features, including dashboards. These are useful for monitoring the health of the cluster during migration.
**To Migrate from a "read & write" to a "backend, read & write" deployment**
**To Migrate from a "read and write" to a "backend, read and write" deployment**
1. Make sure your deployment is using a new enough version of Loki
@ -175,7 +175,7 @@ Note, there are a few other DynamoDB provisioning options including DynamoDB aut
## Upgrading Schemas
When a new schema is released and you want to gain the advantages it provides, you can! Loki can transparently query & merge data from across schema boundaries so there is no disruption of service and upgrading is easy.
When a new schema is released and you want to gain the advantages it provides, you can! Loki can transparently query and merge data from across schema boundaries so there is no disruption of service and upgrading is easy.
First, you'll want to create a new [period_config]({{< relref "../configure#period_config" >}}) entry in your [schema_config]({{< relref "../configure#schema_config" >}}). The important thing to remember here is to set this at some point in the _future_ and then roll out the config file changes to Loki. This allows the table manager to create the required table in advance of writes and ensures that existing data isn't queried as if it adheres to the new schema.
@ -359,7 +359,7 @@ storage_config:
# Providing a user assigned ID will override use_managed_identity
user_assigned_id: <user-assigned-identity-id>
request_timeout: 0
# Configure this if you are using private azure cloud like azure stack hub and will use this endpoint suffix to compose container & blob storage URL. Ex: https://account_name.endpoint_suffix/container_name/blob_name
# Configure this if you are using private azure cloud like azure stack hub and will use this endpoint suffix to compose container and blob storage URL. Ex: https://account_name.endpoint_suffix/container_name/blob_name