Try fixing potential deadlocks in discovery

Manager.ApplyConfig() uses multiple locks:
- Provider.mu
- Manager.targetsMtx

Manager.cleaner() uses the same locks but in the opposite order:
- First it locks Manager.targetsMtx
- The it locks Provider.mu

I've seen a few strange cases of Prometheus hanging up on shutdown and never compliting that shutdown.
From a few traces I was given it appears that while Prometheus is still running only discovery.Manager and notifier.Manager are running running.
From that trace it also seems like they are stuck on a lock from two functions:
- cleaner waits on a RLock()
- ApplyConfig waits on a Lock()

I cannot reproduce it but I suspect this is a race between locks. Imagine this scenario:
- Manager.ApplyConfig() is called
- Manager.ApplyConfig locks Provider.mu.Lock()
- at the same time cleaner() is called on the same Provider instance and it calls Manager.targetsMtx.Lock()
- Manager.ApplyConfig() now calls Manager.targetsMtx.Lock() but that lock is already held by cleaner() function so ApplyConfig() hangs there
- at the same time cleaner() now wants to lock Provider.mu.Rlock() but that lock is already held by Manager.ApplyConfig()
- we end up with both functions locking each other out without any way to break that lock

Re-order lock calls to try to avoid this scenario.
I tried writing a test case for it but couldn't hit this issue.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
pull/16587/head
Lukasz Mierzwa 8 months ago
parent 8b0d33e5b2
commit 7d55ee8cc8
  1. 4
      discovery/manager.go

@ -306,13 +306,13 @@ func (m *Manager) startProvider(ctx context.Context, p *Provider) {
// cleaner cleans resources associated with provider.
func (m *Manager) cleaner(p *Provider) {
m.targetsMtx.Lock()
p.mu.RLock()
m.targetsMtx.Lock()
for s := range p.subs {
delete(m.targets, poolKey{s, p.name})
}
p.mu.RUnlock()
m.targetsMtx.Unlock()
p.mu.RUnlock()
if p.done != nil {
p.done()
}

Loading…
Cancel
Save