loki/docs/clients/promtail/troubleshooting.md

# Troubleshooting Promtail

This document describes known failure modes of `promtail` on edge cases and the
adopted trade-offs.

## A tailed file is truncated while `promtail` is not running

Given the following order of events:

1. `promtail` is tailing `/app.log`
2. `promtail` current position for `/app.log` is `100` (byte offset)
3. `promtail` is stopped
4. `/app.log` is truncated and new logs are appended to it
5. `promtail` is restarted

When `promtail` is restarted, it reads the previous position (`100`) from the
positions file. Two scenarios are then possible:

- `/app.log` size is less than the position before truncating
- `/app.log` size is greater than or equal to the position before truncating

If the `/app.log` file size is less than the previous position, then the file is
detected as truncated and logs will be tailed starting from position `0`.
Otherwise, if the `/app.log` file size is greater than or equal to the previous
position, `promtail` can't detect it was truncated while not running and will
continue tailing the file from position `100`.

Generally speaking, `promtail` uses only the path to the file as key in the
positions file. Whenever `promtail` is started, for each file path referenced in
the positions file, `promtail` will read the file from the beginning if the file
size is less than the offset stored in the position file, otherwise it will
continue from the offset, regardless the file has been truncated or rolled
multiple times while `promtail` was not running.

## Loki is unavailable

For each tailing file, `promtail` reads a line, process it through the
configured `pipeline_stages` and push the log entry to Loki. Log entries are
batched together before getting pushed to Loki, based on the max batch duration
`client.batch-wait` and size `client.batch-size-bytes`, whichever comes first.

In case of any error while sending a log entries batch, `promtail` adopts a
"retry then discard" strategy:

- `promtail` retries to send log entry to the ingester up to `maxretries` times
- If all retries fail, `promtail` discards the batch of log entries (_which will
  be lost_) and proceeds with the next one

You can configure the `maxretries` and the delay between two retries via the
`backoff_config` in the promtail config file:

```yaml
clients:
  - url: INGESTER-URL
    backoff_config:
      minbackoff: 100ms
      maxbackoff: 5s
      maxretries: 5
```

## Log entries pushed after a `promtail` crash / panic / abruptly termination

When `promtail` shuts down gracefully, it saves the last read offsets in the
positions file, so that on a subsequent restart it will continue tailing logs
without duplicates neither losses.

In the event of a crash or abruptly termination, `promtail` can't save the last
read offsets in the positions file. When restarted, `promtail` will read the
positions file saved at the last sync period and will continue tailing the files
from there. This means that if new log entries have been read and pushed to the
ingester between the last sync period and the crash, these log entries will be
sent again to the ingester on `promtail` restart.

However, for each log stream (set of unique labels) the Loki ingester skips all
log entries received out of timestamp order. For this reason, even if duplicated
logs may be sent from `promtail` to the ingester, entries whose timestamp is
older than the latest received will be discarded to avoid having duplicated
logs. To leverage this, it's important that your `pipeline_stages` include
the `timestamp` stage, parsing the log entry timestamp from the log line instead
of relying on the default behaviour of setting the timestamp as the point in
time when the line is read by `promtail`.
Documentation Rewrite (#982) * docs: create structure of docs overhaul This commit removes all old docs and lays out the table of contents and framework for how the new documentation will be intended to be read. * docs: add design docs back in * docs: add community documentation * docs: add LogQL docs * docs: port existing operations documentation * docs: add new placeholder file for promtail configuration docs * docs: add TOC for operations/storage * docs: add Loki API documentation * docs: port troubleshooting document * docs: add docker-driver documentation * docs: link to configuration from main docker-driver document * docs: update API for new paths * docs: fix broken links in api.md and remove json marker from examples * docs: incorporate api changes from #1009 * docs: port promtail documentation * docs: add TOC to promtail configuration reference * docs: fix promtail spelling errors * docs: add loki configuration reference * docs: add TOC to configuration * docs: add loki configuration example * docs: add Loki overview with brief explanation about each component * docs: add comparisons document * docs: add info on table manager and update storage/README.md * docs: add getting started * docs: incorporate config yaml changes from #755 * docs: fix typo in releases url for promtail * docs: add installation instructions * docs: add more configuration examples * docs: add information on fluentd client fluent-bit has been temporarily removed until the PR for it is merged. * docs: PR review feedback * docs: add architecture document * docs: add missing information from old docs * `localy` typo Co-Authored-By: Ed Welch <ed@oqqer.com> * docs: s/ran/run/g * Typo * Typo * Tyop * Typo * docs: fixed typo * docs: PR feedback * docs: @cyriltovena PR feedback * docs: add more details to promtail url config option * docs: expand promtail's pipelines document with extra detail * docs: remove reference to Stage interface in pipelines.md * docs: fixed some spelling * docs: clarify promtail configuration and scraping * docs: attempt #2 at explaining promtail's usage of machine hostname * docs: spelling fixes * docs: add reference to promtail custom metrics and fix silly typo * docs: cognizant -> aware * docs: typo * docs: typos * docs: add which components expose which API endpoints in microservices mode * docs: change ksonnet installation to tanka * docs: address most @pracucci feedback * docs: fix all spelling errors so reviewers don't have to keep finding them :) * docs: incorporate changes to API endpoints made in #1022 * docs: add missing loki metrics * docs: add missing promtail metrics * docs: @pstribrany feedback * docs: more @pracucci feedback * docs: move metrics into a table * docs: update push path references to /loki/api/v1/push * docs: add detail to further explain limitations of monolithic mode * docs: add alternative names to modes_of_operation diagram * docs: add log ordering requirement * docs: add procedure for updating docs with latest version * docs: separate out stages documentation into one document per stage * docs: list supported stores in storage documentation * docs: add info on duplicate log lines in pipelines * docs: add line_format as key feature to fluentd * docs: hopefully final commit :) 6 years ago			`# Troubleshooting Promtail`

			This document describes known failure modes of `promtail` on edge cases and the
			`adopted trade-offs.`

			## A tailed file is truncated while `promtail` is not running

			`Given the following order of events:`

			1. `promtail` is tailing `/app.log`
			2. `promtail` current position for `/app.log` is `100` (byte offset)
			3. `promtail` is stopped
			4. `/app.log` is truncated and new logs are appended to it
			5. `promtail` is restarted

			When `promtail` is restarted, it reads the previous position (`100`) from the
			`positions file. Two scenarios are then possible:`

			- `/app.log` size is less than the position before truncating
			- `/app.log` size is greater than or equal to the position before truncating

			If the `/app.log` file size is less than the previous position, then the file is
			detected as truncated and logs will be tailed starting from position `0`.
			Otherwise, if the `/app.log` file size is greater than or equal to the previous
			position, `promtail` can't detect it was truncated while not running and will
			continue tailing the file from position `100`.

			Generally speaking, `promtail` uses only the path to the file as key in the
			positions file. Whenever `promtail` is started, for each file path referenced in
			the positions file, `promtail` will read the file from the beginning if the file
			`size is less than the offset stored in the position file, otherwise it will`
			`continue from the offset, regardless the file has been truncated or rolled`
			multiple times while `promtail` was not running.

			`## Loki is unavailable`

			For each tailing file, `promtail` reads a line, process it through the
			configured `pipeline_stages` and push the log entry to Loki. Log entries are
			`batched together before getting pushed to Loki, based on the max batch duration`
			`client.batch-wait` and size `client.batch-size-bytes`, whichever comes first.

			In case of any error while sending a log entries batch, `promtail` adopts a
			`"retry then discard" strategy:`

			- `promtail` retries to send log entry to the ingester up to `maxretries` times
			- If all retries fail, `promtail` discards the batch of log entries (_which will
			`be lost_) and proceeds with the next one`

			You can configure the `maxretries` and the delay between two retries via the
			`backoff_config` in the promtail config file:

			```yaml
			`clients:`
			`- url: INGESTER-URL`
			`backoff_config:`
			`minbackoff: 100ms`
			`maxbackoff: 5s`
			`maxretries: 5`
			```

			## Log entries pushed after a `promtail` crash / panic / abruptly termination

			When `promtail` shuts down gracefully, it saves the last read offsets in the
			`positions file, so that on a subsequent restart it will continue tailing logs`
			`without duplicates neither losses.`

			In the event of a crash or abruptly termination, `promtail` can't save the last
			read offsets in the positions file. When restarted, `promtail` will read the
			`positions file saved at the last sync period and will continue tailing the files`
			`from there. This means that if new log entries have been read and pushed to the`
			`ingester between the last sync period and the crash, these log entries will be`
			sent again to the ingester on `promtail` restart.

			`However, for each log stream (set of unique labels) the Loki ingester skips all`
			`log entries received out of timestamp order. For this reason, even if duplicated`
			logs may be sent from `promtail` to the ingester, entries whose timestamp is
			`older than the latest received will be discarded to avoid having duplicated`
			logs. To leverage this, it's important that your `pipeline_stages` include
			the `timestamp` stage, parsing the log entry timestamp from the log line instead
			`of relying on the default behaviour of setting the timestamp as the point in`
			time when the line is read by `promtail`.