@ -128,11 +128,11 @@ independently. If it is necessary to lock more than one partition at a time,
they must be locked in partition-number order to avoid risk of deadlock.
they must be locked in partition-number order to avoid risk of deadlock.
* A separate system-wide spinlock, buffer_strategy_lock, provides mutual
* A separate system-wide spinlock, buffer_strategy_lock, provides mutual
exclusion for operations that access the buffer free list or select
exclusion for operations that select buffers for replacement. A spinlock is
buffers for replacement. A spinlock is used here rather than a lightweight
used here rather than a lightweight lock for efficiency; no other locks of any
lock for efficiency; no other locks of any sort should be acquired while
sort should be acquired while buffer_strategy_lock is held. This is essential
buffer_strategy_lock is held. This is essential to allow buffer replacement
to allow buffer replacement to happen in multiple backends with reasonable
to happen in multiple backends with reasonable concurrency.
concurrency.
* Each buffer header contains a spinlock that must be taken when examining
* Each buffer header contains a spinlock that must be taken when examining
or changing fields of that buffer header. This allows operations such as
or changing fields of that buffer header. This allows operations such as
@ -158,18 +158,8 @@ unset by sleeping on the buffer's condition variable.
Normal Buffer Replacement Strategy
Normal Buffer Replacement Strategy
----------------------------------
----------------------------------
There is a "free list" of buffers that are prime candidates for replacement.
To choose a victim buffer to recycle we use a simple clock-sweep algorithm. It
In particular, buffers that are completely free (contain no valid page) are
works like this:
always in this list. We could also throw buffers into this list if we
consider their pages unlikely to be needed soon; however, the current
algorithm never does that. The list is singly-linked using fields in the
buffer headers; we maintain head and tail pointers in global variables.
(Note: although the list links are in the buffer headers, they are
considered to be protected by the buffer_strategy_lock, not the buffer-header
spinlocks.) To choose a victim buffer to recycle when there are no free
buffers available, we use a simple clock-sweep algorithm, which avoids the
need to take system-wide locks during common operations. It works like
this:
Each buffer header contains a usage counter, which is incremented (up to a
Each buffer header contains a usage counter, which is incremented (up to a
small limit value) whenever the buffer is pinned. (This requires only the
small limit value) whenever the buffer is pinned. (This requires only the
@ -184,20 +174,14 @@ The algorithm for a process that needs to obtain a victim buffer is:
1. Obtain buffer_strategy_lock.
1. Obtain buffer_strategy_lock.
2. If buffer free list is nonempty, remove its head buffer. Release
2. Select the buffer pointed to by nextVictimBuffer, and circularly advance
buffer_strategy_lock. If the buffer is pinned or has a nonzero usage count,
nextVictimBuffer for next time. Release buffer_strategy_lock.
it cannot be used; ignore it go back to step 1. Otherwise, pin the buffer,
and return it.
3. Otherwise, the buffer free list is empty. Select the buffer pointed to by
3. If the selected buffer is pinned or has a nonzero usage count, it cannot
nextVictimBuffer, and circularly advance nextVictimBuffer for next time.
Release buffer_strategy_lock.
4. If the selected buffer is pinned or has a nonzero usage count, it cannot
be used. Decrement its usage count (if nonzero), reacquire
be used. Decrement its usage count (if nonzero), reacquire
buffer_strategy_lock, and return to step 3 to examine the next buffer.
buffer_strategy_lock, and return to step 3 to examine the next buffer.
5 . Pin the selected buffer, and return.
4. Pin the selected buffer, and return.
(Note that if the selected buffer is dirty, we will have to write it out
(Note that if the selected buffer is dirty, we will have to write it out
before we can recycle it; if someone else pins the buffer meanwhile we will
before we can recycle it; if someone else pins the buffer meanwhile we will
@ -234,7 +218,7 @@ the ring strategy effectively degrades to the normal strategy.
VACUUM uses a ring like sequential scans, however, the size of this ring is
VACUUM uses a ring like sequential scans, however, the size of this ring is
controlled by the vacuum_buffer_usage_limit GUC. Dirty pages are not removed
controlled by the vacuum_buffer_usage_limit GUC. Dirty pages are not removed
from the ring. Instead, WAL is flushed if needed to allow reuse of the
from the ring. Instead, the WAL is flushed if needed to allow reuse of the
buffers. Before introducing the buffer ring strategy in 8.3, VACUUM's buffers
buffers. Before introducing the buffer ring strategy in 8.3, VACUUM's buffers
were sent to the freelist, which was effectively a buffer ring of 1 buffer,
were sent to the freelist, which was effectively a buffer ring of 1 buffer,
resulting in excessive WAL flushing.
resulting in excessive WAL flushing.