prometheus

Try fixing potential deadlocks in discovery

Manager.ApplyConfig() uses multiple locks:
- Provider.mu
- Manager.targetsMtx

Manager.cleaner() uses the same locks but in the opposite order:
- First it locks Manager.targetsMtx
- The it locks Provider.mu

I've seen a few strange cases of Prometheus hanging up on shutdown and never compliting that shutdown.
From a few traces I was given it appears that while Prometheus is still running only discovery.Manager and notifier.Manager are running running.
From that trace it also seems like they are stuck on a lock from two functions:
- cleaner waits on a RLock()
- ApplyConfig waits on a Lock()

I cannot reproduce it but I suspect this is a race between locks. Imagine this scenario:
- Manager.ApplyConfig() is called
- Manager.ApplyConfig locks Provider.mu.Lock()
- at the same time cleaner() is called on the same Provider instance and it calls Manager.targetsMtx.Lock()
- Manager.ApplyConfig() now calls Manager.targetsMtx.Lock() but that lock is already held by cleaner() function so ApplyConfig() hangs there
- at the same time cleaner() now wants to lock Provider.mu.Rlock() but that lock is already held by Manager.ApplyConfig()
- we end up with both functions locking each other out without any way to break that lock

Re-order lock calls to try to avoid this scenario.
I tried writing a test case for it but couldn't hit this issue.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>

pull/16587/head

Lukasz Mierzwa 10 months ago

parent 8b0d33e5b2

commit 7d55ee8cc8

1 changed files with 2 additions and 2 deletions

								
									4

discovery/manager.go

								Unescape
								Escape
							
									View File
								
					@ -306,13 +306,13 @@ func (m *Manager) startProvider(ctx context.Context, p *Provider) {
				
					// cleaner cleans resources associated with provider.

					func (m *Manager) cleaner(p *Provider) {
				
						m.targetsMtx.Lock()
				
						p.mu.RLock()
				
						m.targetsMtx.Lock()
				
						for s := range p.subs {
				
							delete(m.targets, poolKey{s, p.name})
				
						}
				
						p.mu.RUnlock()
				
						m.targetsMtx.Unlock()
				
						p.mu.RUnlock()
				
						if p.done != nil {
				
							p.done()
				
						}

Try fixing potential deadlocks in discovery

4 discovery/manager.go Unescape Escape View File

4

discovery/manager.go

Unescape Escape View File