**What this PR does / why we need it**:
This PR fixes a bug in the shutdown process of the query frontend. In
order to shut down the frontend gracefully without cancelling any
in-flight operations/sub-requests, the frontend service has to wait
until all in-flight operations are finished. Only then its sub-services
may be stopped.
In order to control the deterministic order of the shutdown of the
service and its child services, child services need to be started with
their own context (not with their parent context). Otherwise, if the
context of the frontend is cancelled, also the contexts of the child
services are cancelled, leading to a parallel stopping of the parent and
its child services.
During the time of the shutdown, the `RoundTripGRCP` function still
needs to accept any round-tripping requests. This is, because the
frontend round tripper is wrapped into a query range tripper ware, which
splits incoming (HTTP) query requests into multiple (gRPC) query
requests if needed.
**Which issue(s) this PR fixes**:
This fixes a race condition where queries fail, because the query
frontend shut down before the queriers that processed in-flight request
could send the query response to the frontend. This manifested in an
increased error rate of queries during restarts (rollouts) of the query
frontend.
**Special notes for your reviewer**:
**Checklist**
- [x] Reviewed the `CONTRIBUTING.md` guide
- [x] Documentation added
- [x] Tests updated
- [x] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Christian Haudum <christian.haudum@gmail.com>
Co-authored-by: Danny Kopping <danny.kopping@grafana.com>
f.StringVar(&cfg.SchedulerAddress,"frontend.scheduler-address","","DNS hostname used for finding query-schedulers.")
f.DurationVar(&cfg.DNSLookupPeriod,"frontend.scheduler-dns-lookup-period",10*time.Second,"How often to resolve the scheduler-address, in order to look for new query-scheduler instances. Also used to determine how often to poll the scheduler-ring for addresses if the scheduler-ring is configured.")
f.IntVar(&cfg.WorkerConcurrency,"frontend.scheduler-worker-concurrency",5,"Number of concurrent workers forwarding queries to single query-scheduler.")
f.DurationVar(&cfg.GracefulShutdownTimeout,"frontend.graceful-shutdown-timeout",5*time.Minute,"Time to wait for inflight requests to finish before forcefully shutting down. This needs to be aligned with the query timeout and the graceful termination period of the process orchestrator.")
f.Var((*flagext.StringSlice)(&cfg.InfNames),"frontend.instance-interface-names","Name of network interface to read address from. This address is sent to query-scheduler and querier, which uses it to send the query response back to query-frontend.")