bufmgr: Implement buffer content locks independently of lwlocks

Until now buffer content locks were implemented using lwlocks. That has the
obvious advantage of not needing a separate efficient implementation of
locks. However, the time for a dedicated buffer content lock implementation
has come:

1) Hint bits are currently set while holding only a share lock. This leads to
   having to copy pages while they are being written out if checksums are
   enabled, which is not cheap. We would like to add AIO writes, however once
   many buffers can be written out at the same time, it gets a lot more
   expensive to copy them, particularly because that copy needs to reside in
   shared buffers (for worker mode to have access to the buffer).

   In addition, modifying buffers while they are being written out can cause
   issues with unbuffered/direct-IO, as some filesystems (like btrfs) do not
   like that, due to filesystem internal checksums getting corrupted.

   The solution to this is to require a new share-exclusive lock-level to set
   hint bits and to write out buffers, making those operations mutually
   exclusive. We could introduce such a lock-level into the generic lwlock
   implementation, however it does not look like there would be other users,
   and it does add some overhead into important code paths.

2) For AIO writes we need to be able to race-freely check whether a buffer is
   undergoing IO and whether an exclusive lock on the page can be acquired. That
   is rather hard to do efficiently when the buffer state and the lock state
   are separate atomic variables. This is a major hindrance to allowing writes
   to be done asynchronously.

3) Buffer locks are by far the most frequently taken locks. Optimizing them
   specifically for their use case is worth the effort. E.g. by merging
   content locks into buffer locks we will be able to release a buffer lock
   and pin in one atomic operation.

4) There are more complicated optimizations, like long-lived "super pinned &
   locked" pages, that cannot realistically be implemented with the generic
   lwlock implementation.

Therefore implement content locks inside bufmgr.c. The lockstate is stored as
part of BufferDesc.state. The implementation of buffer content locks is fairly
similar to lwlocks, with a few important differences:

1) An additional lock-level share-exclusive has been added. This lock-level
   conflicts with exclusive locks and itself, but not share locks.

2) Error recovery for content locks is implemented as part of the already
   existing private-refcount tracking mechanism in combination with resowners,
   instead of a bespoke mechanism as the case for lwlocks. This means we do
   not need to add dedicated error-recovery code paths to release all content
   locks (like done with LWLockReleaseAll() for lwlocks).

3) The lock state is embedded in BufferDesc.state instead of having its own
   struct.

4) The wakeup logic is a tad more complicated due to needing to support the
   additional lock-level

This commit unfortunately introduces some code that is very similar to the
code in lwlock.c, however the code is not equivalent enough to easily merge
it. The future wins that this commit makes possible seem worth the cost.

As of this commit nothing uses the new share-exclusive lock mode. It will be
used in a future commit. It seemed too complicated to introduce the lock-level
in a separate commit.

It's worth calling out one wart in this commit: Despite content locks not
being lwlocks anymore, they continue to use PGPROC->lw* - that seemed better
than duplicating the relevant infrastructure.

Another thing worth pointing out is that, after this change, content locks are
not reported as LWLock wait events anymore, but as new wait events in the
"Buffer" wait event class (see also 6c5c393b74). The old BufferContent lwlock
tranche has been removed.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Reviewed-by: Greg Burd <greg@burd.me>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
master
Andres Freund 6 days ago
parent dac328c8a6
commit fcb9c977aa
  1. 5
      src/backend/storage/buffer/buf_init.c
  2. 913
      src/backend/storage/buffer/bufmgr.c
  3. 4
      src/backend/utils/activity/wait_event_names.txt
  4. 73
      src/include/storage/buf_internals.h
  5. 32
      src/include/storage/bufmgr.h
  6. 1
      src/include/storage/lwlocklist.h
  7. 8
      src/include/storage/proc.h

@ -17,6 +17,7 @@
#include "storage/aio.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
#include "storage/proclist.h"
BufferDescPadded *BufferDescriptors;
char *BufferBlocks;
@ -128,9 +129,7 @@ BufferManagerShmemInit(void)
pgaio_wref_clear(&buf->io_wref);
LWLockInitialize(BufferDescriptorGetContentLock(buf),
LWTRANCHE_BUFFER_CONTENT);
proclist_init(&buf->lock_waiters);
ConditionVariableInit(BufferDescriptorGetIOCV(buf));
}
}

File diff suppressed because it is too large Load Diff

@ -287,6 +287,9 @@ ABI_compatibility:
Section: ClassName - WaitEventBuffer
BUFFER_CLEANUP "Waiting to acquire an exclusive pin on a buffer. Buffer pin waits can be protracted if another process holds an open cursor that last read data from the buffer in question."
BUFFER_SHARED "Waiting to acquire a shared lock on a buffer."
BUFFER_SHARE_EXCLUSIVE "Waiting to acquire a share exclusive lock on a buffer."
BUFFER_EXCLUSIVE "Waiting to acquire a exclusive lock on a buffer."
ABI_compatibility:
@ -375,7 +378,6 @@ NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer
NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
ReplicationOriginState "Waiting to read or update the progress of one replication origin."
ReplicationSlotIO "Waiting for I/O on a replication slot."
LockFastPath "Waiting to read or update a process' fast-path lock information."

@ -23,6 +23,7 @@
#include "storage/condition_variable.h"
#include "storage/lwlock.h"
#include "storage/procnumber.h"
#include "storage/proclist_types.h"
#include "storage/shmem.h"
#include "storage/smgr.h"
#include "storage/spin.h"
@ -35,22 +36,23 @@
* State of the buffer itself (in order):
* - 18 bits refcount
* - 4 bits usage count
* - 10 bits of flags
* - 12 bits of flags
* - 18 bits share-lock count
* - 1 bit share-exclusive locked
* - 1 bit exclusive locked
*
* Combining these values allows to perform some operations without locking
* the buffer header, by modifying them together with a CAS loop.
*
* NB: A future commit will use a significant portion of the remaining bits to
* implement buffer locking as part of the state variable.
*
* The definition of buffer state components is below.
*/
#define BUF_REFCOUNT_BITS 18
#define BUF_USAGECOUNT_BITS 4
#define BUF_FLAG_BITS 10
#define BUF_FLAG_BITS 12
#define BUF_LOCK_BITS (18+2)
StaticAssertDecl(BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS + BUF_FLAG_BITS == 32,
"parts of buffer state space need to equal 32");
StaticAssertDecl(BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS + BUF_FLAG_BITS + BUF_LOCK_BITS <= 64,
"parts of buffer state space need to be <= 64");
/* refcount related definitions */
#define BUF_REFCOUNT_ONE 1
@ -71,6 +73,19 @@ StaticAssertDecl(BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS + BUF_FLAG_BITS == 32,
#define BUF_FLAG_MASK \
(((UINT64CONST(1) << BUF_FLAG_BITS) - 1) << BUF_FLAG_SHIFT)
/* lock state related definitions */
#define BM_LOCK_SHIFT \
(BUF_FLAG_SHIFT + BUF_FLAG_BITS)
#define BM_LOCK_VAL_SHARED \
(UINT64CONST(1) << (BM_LOCK_SHIFT))
#define BM_LOCK_VAL_SHARE_EXCLUSIVE \
(UINT64CONST(1) << (BM_LOCK_SHIFT + MAX_BACKENDS_BITS))
#define BM_LOCK_VAL_EXCLUSIVE \
(UINT64CONST(1) << (BM_LOCK_SHIFT + MAX_BACKENDS_BITS + 1))
#define BM_LOCK_MASK \
((((uint64) MAX_BACKENDS) << BM_LOCK_SHIFT) | BM_LOCK_VAL_SHARE_EXCLUSIVE | BM_LOCK_VAL_EXCLUSIVE)
/* Get refcount and usagecount from buffer state */
#define BUF_STATE_GET_REFCOUNT(state) \
((uint32)((state) & BUF_REFCOUNT_MASK))
@ -107,6 +122,17 @@ StaticAssertDecl(BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS + BUF_FLAG_BITS == 32,
#define BM_CHECKPOINT_NEEDED BUF_DEFINE_FLAG( 8)
/* permanent buffer (not unlogged, or init fork) */
#define BM_PERMANENT BUF_DEFINE_FLAG( 9)
/* content lock has waiters */
#define BM_LOCK_HAS_WAITERS BUF_DEFINE_FLAG(10)
/* waiter for content lock has been signalled but not yet run */
#define BM_LOCK_WAKE_IN_PROGRESS BUF_DEFINE_FLAG(11)
StaticAssertDecl(MAX_BACKENDS_BITS <= BUF_REFCOUNT_BITS,
"MAX_BACKENDS_BITS needs to be <= BUF_REFCOUNT_BITS");
StaticAssertDecl(MAX_BACKENDS_BITS <= (BUF_LOCK_BITS - 2),
"MAX_BACKENDS_BITS needs to be <= BUF_LOCK_BITS - 2");
/*
* The maximum allowed value of usage_count represents a tradeoff between
@ -120,8 +146,6 @@ StaticAssertDecl(BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS + BUF_FLAG_BITS == 32,
StaticAssertDecl(BM_MAX_USAGE_COUNT < (UINT64CONST(1) << BUF_USAGECOUNT_BITS),
"BM_MAX_USAGE_COUNT doesn't fit in BUF_USAGECOUNT_BITS bits");
StaticAssertDecl(MAX_BACKENDS_BITS <= BUF_REFCOUNT_BITS,
"MAX_BACKENDS_BITS needs to be <= BUF_REFCOUNT_BITS");
/*
* Buffer tag identifies which disk block the buffer contains.
@ -265,9 +289,6 @@ BufMappingPartitionLockByIndex(uint32 index)
* it is held. However, existing buffer pins may be released while the buffer
* header spinlock is held, using an atomic subtraction.
*
* The LWLock can take care of itself. The buffer header lock is *not* used
* to control access to the data in the buffer!
*
* If we have the buffer pinned, its tag can't change underneath us, so we can
* examine the tag without locking the buffer header. Also, in places we do
* one-time reads of the flags without bothering to lock the buffer header;
@ -280,6 +301,15 @@ BufMappingPartitionLockByIndex(uint32 index)
* wait_backend_pgprocno and setting flag bit BM_PIN_COUNT_WAITER. At present,
* there can be only one such waiter per buffer.
*
* The content of buffers is protected via the buffer content lock,
* implemented as part of the buffer state. Note that the buffer header lock
* is *not* used to control access to the data in the buffer! We used to use
* an LWLock to implement the content lock, but having a dedicated
* implementation of content locks allows us to implement some otherwise hard
* things (e.g. race-freely checking if AIO is in progress before locking a
* buffer exclusively) and enables otherwise impossible optimizations
* (e.g. unlocking and unpinning a buffer in one atomic operation).
*
* We use this same struct for local buffer headers, but the locks are not
* used and not all of the flag bits are useful either. To avoid unnecessary
* overhead, manipulations of the state field should be done without actual
@ -321,7 +351,12 @@ typedef struct BufferDesc
int wait_backend_pgprocno;
PgAioWaitRef io_wref; /* set iff AIO is in progress */
LWLock content_lock; /* to lock access to buffer contents */
/*
* List of PGPROCs waiting for the buffer content lock. Protected by the
* buffer header spinlock.
*/
proclist_head lock_waiters;
} BufferDesc;
/*
@ -408,12 +443,6 @@ BufferDescriptorGetIOCV(const BufferDesc *bdesc)
return &(BufferIOCVArray[bdesc->buf_id]).cv;
}
static inline LWLock *
BufferDescriptorGetContentLock(const BufferDesc *bdesc)
{
return (LWLock *) (&bdesc->content_lock);
}
/*
* Functions for acquiring/releasing a shared buffer header's spinlock. Do
* not apply these to local buffers!
@ -491,18 +520,18 @@ extern PGDLLIMPORT CkptSortItem *CkptBufferIds;
/* ResourceOwner callbacks to hold buffer I/Os and pins */
extern PGDLLIMPORT const ResourceOwnerDesc buffer_io_resowner_desc;
extern PGDLLIMPORT const ResourceOwnerDesc buffer_pin_resowner_desc;
extern PGDLLIMPORT const ResourceOwnerDesc buffer_resowner_desc;
/* Convenience wrappers over ResourceOwnerRemember/Forget */
static inline void
ResourceOwnerRememberBuffer(ResourceOwner owner, Buffer buffer)
{
ResourceOwnerRemember(owner, Int32GetDatum(buffer), &buffer_pin_resowner_desc);
ResourceOwnerRemember(owner, Int32GetDatum(buffer), &buffer_resowner_desc);
}
static inline void
ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer)
{
ResourceOwnerForget(owner, Int32GetDatum(buffer), &buffer_pin_resowner_desc);
ResourceOwnerForget(owner, Int32GetDatum(buffer), &buffer_resowner_desc);
}
static inline void
ResourceOwnerRememberBufferIO(ResourceOwner owner, Buffer buffer)

@ -203,7 +203,20 @@ extern PGDLLIMPORT int32 *LocalRefCount;
typedef enum BufferLockMode
{
BUFFER_LOCK_UNLOCK,
/*
* A share lock conflicts with exclusive locks.
*/
BUFFER_LOCK_SHARE,
/*
* A share-exclusive lock conflicts with itself and exclusive locks.
*/
BUFFER_LOCK_SHARE_EXCLUSIVE,
/*
* An exclusive lock conflicts with every other lock type.
*/
BUFFER_LOCK_EXCLUSIVE,
} BufferLockMode;
@ -302,7 +315,24 @@ extern void BufferGetTag(Buffer buffer, RelFileLocator *rlocator,
extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
extern void UnlockBuffers(void);
extern void LockBuffer(Buffer buffer, BufferLockMode mode);
extern void UnlockBuffer(Buffer buffer);
extern void LockBufferInternal(Buffer buffer, BufferLockMode mode);
/*
* Handling BUFFER_LOCK_UNLOCK in bufmgr.c leads to sufficiently worse branch
* prediction to impact performance. Therefore handle that switch here, where
* most of the time `mode` will be a constant and thus can be optimized out by
* the compiler.
*/
static inline void
LockBuffer(Buffer buffer, BufferLockMode mode)
{
if (mode == BUFFER_LOCK_UNLOCK)
UnlockBuffer(buffer);
else
LockBufferInternal(buffer, mode);
}
extern bool ConditionalLockBuffer(Buffer buffer);
extern void LockBufferForCleanup(Buffer buffer);
extern bool ConditionalLockBufferForCleanup(Buffer buffer);

@ -105,7 +105,6 @@ PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
PG_LWLOCKTRANCHE(REPLICATION_ORIGIN_STATE, ReplicationOriginState)
PG_LWLOCKTRANCHE(REPLICATION_SLOT_IO, ReplicationSlotIO)
PG_LWLOCKTRANCHE(LOCK_FASTPATH, LockFastPath)

@ -242,7 +242,13 @@ struct PGPROC
*/
bool recoveryConflictPending;
/* Info about LWLock the process is currently waiting for, if any. */
/*
* Info about LWLock the process is currently waiting for, if any.
*
* This is currently used both for lwlocks and buffer content locks, which
* is acceptable, although not pretty, because a backend can't wait for
* both types of locks at the same time.
*/
uint8 lwWaiting; /* see LWLockWaitState */
uint8 lwWaitMode; /* lwlock mode being waited for */
proclist_node lwWaitLink; /* position in LW lock wait list */

Loading…
Cancel
Save