|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
|
|
|
* buf_internals.h
|
|
|
|
* Internal definitions for buffer manager and the buffer replacement
|
|
|
|
* strategy.
|
|
|
|
*
|
|
|
|
*
|
|
|
|
* Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
|
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
|
|
|
*
|
|
|
|
* src/include/storage/buf_internals.h
|
|
|
|
*
|
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
#ifndef BUFMGR_INTERNALS_H
|
|
|
|
#define BUFMGR_INTERNALS_H
|
|
|
|
|
|
|
|
#include "port/atomics.h"
|
|
|
|
#include "storage/buf.h"
|
Allow to trigger kernel writeback after a configurable number of writes.
Currently writes to the main data files of postgres all go through the
OS page cache. This means that some operating systems can end up
collecting a large number of dirty buffers in their respective page
caches. When these dirty buffers are flushed to storage rapidly, be it
because of fsync(), timeouts, or dirty ratios, latency for other reads
and writes can increase massively. This is the primary reason for
regular massive stalls observed in real world scenarios and artificial
benchmarks; on rotating disks stalls on the order of hundreds of seconds
have been observed.
On linux it is possible to control this by reducing the global dirty
limits significantly, reducing the above problem. But global
configuration is rather problematic because it'll affect other
applications; also PostgreSQL itself doesn't always generally want this
behavior, e.g. for temporary files it's undesirable.
Several operating systems allow some control over the kernel page
cache. Linux has sync_file_range(2), several posix systems have msync(2)
and posix_fadvise(2). sync_file_range(2) is preferable because it
requires no special setup, whereas msync() requires the to-be-flushed
range to be mmap'ed. For the purpose of flushing dirty data
posix_fadvise(2) is the worst alternative, as flushing dirty data is
just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
from the page cache. Thus the feature is enabled by default only on
linux, but can be enabled on all systems that have any of the above
APIs.
While desirable and likely possible this patch does not contain an
implementation for windows.
With the infrastructure added, writes made via checkpointer, bgwriter
and normal user backends can be flushed after a configurable number of
writes. Each of these sources of writes controlled by a separate GUC,
checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
respectively; they're separate because the number of flushes that are
good are separate, and because the performance considerations of
controlled flushing for each of these are different.
A later patch will add checkpoint sorting - after that flushes from the
ckeckpoint will almost always be desirable. Bgwriter flushes are most of
the time going to be random, which are slow on lots of storage hardware.
Flushing in backends works well if the storage and bgwriter can keep up,
but if not it can have negative consequences. This patch is likely to
have negative performance consequences without checkpoint sorting, but
unfortunately so has sorting without flush control.
Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
10 years ago
|
|
|
#include "storage/bufmgr.h"
|
Improve control logic for bgwriter hibernation mode.
Commit 6d90eaaa89a007e0d365f49d6436f35d2392cfeb added a hibernation mode
to the bgwriter to reduce the server's idle-power consumption. However,
its interaction with the detailed behavior of BgBufferSync's feedback
control loop wasn't very well thought out. That control loop depends
primarily on the rate of buffer allocation, not the rate of buffer
dirtying, so the hibernation mode has to be designed to operate only when
no new buffer allocations are happening. Also, the check for whether the
system is effectively idle was not quite right and would fail to detect
a constant low level of activity, thus allowing the bgwriter to go into
hibernation mode in a way that would let the cycle time vary quite a bit,
possibly further confusing the feedback loop. To fix, move the wakeup
support from MarkBufferDirty and SetBufferCommitInfoNeedsSave into
StrategyGetBuffer, and prevent the bgwriter from entering hibernation mode
unless no buffer allocations have happened recently.
In addition, fix the delaying logic to remove the problem of possibly not
responding to signals promptly, which was basically caused by trying to use
the process latch's is_set flag for multiple purposes. I can't prove it
but I'm suspicious that that hack was responsible for the intermittent
"postmaster does not shut down" failures we've been seeing in the buildfarm
lately. In any case it did nothing to improve the readability or
robustness of the code.
In passing, express the hibernation sleep time as a multiplier on
BgWriterDelay, not a constant. I'm not sure whether there's any value in
exposing the longer sleep time as an independently configurable setting,
but we can at least make it act like this for little extra code.
14 years ago
|
|
|
#include "storage/latch.h"
|
|
|
|
#include "storage/lwlock.h"
|
|
|
|
#include "storage/shmem.h"
|
|
|
|
#include "storage/smgr.h"
|
|
|
|
#include "storage/spin.h"
|
|
|
|
#include "utils/relcache.h"
|
|
|
|
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
10 years ago
|
|
|
/*
|
|
|
|
* Buffer state is a single 32-bit variable where following data is combined.
|
|
|
|
*
|
|
|
|
* - 18 bits refcount
|
|
|
|
* - 4 bits usage count
|
|
|
|
* - 10 bits of flags
|
|
|
|
*
|
|
|
|
* Combining these values allows to perform some operations without locking
|
|
|
|
* the buffer header, by modifying them together with a CAS loop.
|
|
|
|
*
|
|
|
|
* The definition of buffer state components is below.
|
|
|
|
*/
|
|
|
|
#define BUF_REFCOUNT_ONE 1
|
|
|
|
#define BUF_REFCOUNT_MASK ((1U << 18) - 1)
|
|
|
|
#define BUF_USAGECOUNT_MASK 0x003C0000U
|
|
|
|
#define BUF_USAGECOUNT_ONE (1U << 18)
|
|
|
|
#define BUF_USAGECOUNT_SHIFT 18
|
|
|
|
#define BUF_FLAG_MASK 0xFFC00000U
|
|
|
|
|
|
|
|
/* Get refcount and usagecount from buffer state */
|
|
|
|
#define BUF_STATE_GET_REFCOUNT(state) ((state) & BUF_REFCOUNT_MASK)
|
|
|
|
#define BUF_STATE_GET_USAGECOUNT(state) (((state) & BUF_USAGECOUNT_MASK) >> BUF_USAGECOUNT_SHIFT)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Flags for buffer descriptors
|
|
|
|
*
|
|
|
|
* Note: BM_TAG_VALID essentially means that there is a buffer hashtable
|
|
|
|
* entry associated with the buffer's tag.
|
|
|
|
*/
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
8 years ago
|
|
|
#define BM_LOCKED (1U << 22) /* buffer header is locked */
|
|
|
|
#define BM_DIRTY (1U << 23) /* data needs writing */
|
|
|
|
#define BM_VALID (1U << 24) /* data is valid */
|
|
|
|
#define BM_TAG_VALID (1U << 25) /* tag is assigned */
|
|
|
|
#define BM_IO_IN_PROGRESS (1U << 26) /* read or write in progress */
|
|
|
|
#define BM_IO_ERROR (1U << 27) /* previous I/O failed */
|
|
|
|
#define BM_JUST_DIRTIED (1U << 28) /* dirtied since write started */
|
|
|
|
#define BM_PIN_COUNT_WAITER (1U << 29) /* have waiter for sole pin */
|
|
|
|
#define BM_CHECKPOINT_NEEDED (1U << 30) /* must write for checkpoint */
|
|
|
|
#define BM_PERMANENT (1U << 31) /* permanent buffer (not unlogged,
|
|
|
|
* or init fork) */
|
|
|
|
/*
|
|
|
|
* The maximum allowed value of usage_count represents a tradeoff between
|
|
|
|
* accuracy and speed of the clock-sweep buffer management algorithm. A
|
|
|
|
* large value (comparable to NBuffers) would approximate LRU semantics.
|
|
|
|
* But it can take as many as BM_MAX_USAGE_COUNT+1 complete cycles of
|
|
|
|
* clock sweeps to find a free buffer, so in practice we don't want the
|
|
|
|
* value to be very large.
|
|
|
|
*/
|
|
|
|
#define BM_MAX_USAGE_COUNT 5
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Buffer tag identifies which disk block the buffer contains.
|
|
|
|
*
|
|
|
|
* Note: the BufferTag data must be sufficient to determine where to write the
|
|
|
|
* block, without reference to pg_class or pg_tablespace entries. It's
|
|
|
|
* possible that the backend flushing the buffer doesn't even believe the
|
|
|
|
* relation is visible yet (its xact may have started before the xact that
|
|
|
|
* created the rel). The storage manager must be able to cope anyway.
|
|
|
|
*
|
|
|
|
* Note: if there's any pad bytes in the struct, INIT_BUFFERTAG will have
|
|
|
|
* to be fixed to zero them, since this struct is used as a hash key.
|
|
|
|
*/
|
|
|
|
typedef struct buftag
|
|
|
|
{
|
|
|
|
RelFileNode rnode; /* physical relation identifier */
|
|
|
|
ForkNumber forkNum;
|
|
|
|
BlockNumber blockNum; /* blknum relative to begin of reln */
|
|
|
|
} BufferTag;
|
|
|
|
|
|
|
|
#define CLEAR_BUFFERTAG(a) \
|
|
|
|
( \
|
|
|
|
(a).rnode.spcNode = InvalidOid, \
|
|
|
|
(a).rnode.dbNode = InvalidOid, \
|
|
|
|
(a).rnode.relNode = InvalidOid, \
|
|
|
|
(a).forkNum = InvalidForkNumber, \
|
|
|
|
(a).blockNum = InvalidBlockNumber \
|
|
|
|
)
|
|
|
|
|
|
|
|
#define INIT_BUFFERTAG(a,xx_rnode,xx_forkNum,xx_blockNum) \
|
|
|
|
( \
|
|
|
|
(a).rnode = (xx_rnode), \
|
|
|
|
(a).forkNum = (xx_forkNum), \
|
|
|
|
(a).blockNum = (xx_blockNum) \
|
|
|
|
)
|
|
|
|
|
|
|
|
#define BUFFERTAGS_EQUAL(a,b) \
|
|
|
|
( \
|
|
|
|
RelFileNodeEquals((a).rnode, (b).rnode) && \
|
|
|
|
(a).blockNum == (b).blockNum && \
|
|
|
|
(a).forkNum == (b).forkNum \
|
|
|
|
)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The shared buffer mapping table is partitioned to reduce contention.
|
|
|
|
* To determine which partition lock a given tag requires, compute the tag's
|
|
|
|
* hash code with BufTableHashCode(), then apply BufMappingPartitionLock().
|
|
|
|
* NB: NUM_BUFFER_PARTITIONS must be a power of 2!
|
|
|
|
*/
|
|
|
|
#define BufTableHashPartition(hashcode) \
|
|
|
|
((hashcode) % NUM_BUFFER_PARTITIONS)
|
|
|
|
#define BufMappingPartitionLock(hashcode) \
|
|
|
|
(&MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET + \
|
|
|
|
BufTableHashPartition(hashcode)].lock)
|
|
|
|
#define BufMappingPartitionLockByIndex(i) \
|
|
|
|
(&MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET + (i)].lock)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* BufferDesc -- shared descriptor/state data for a single shared buffer.
|
|
|
|
*
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
10 years ago
|
|
|
* Note: Buffer header lock (BM_LOCKED flag) must be held to examine or change
|
|
|
|
* the tag, state or wait_backend_pid fields. In general, buffer header lock
|
|
|
|
* is a spinlock which is combined with flags, refcount and usagecount into
|
|
|
|
* single atomic variable. This layout allow us to do some operations in a
|
|
|
|
* single atomic operation, without actually acquiring and releasing spinlock;
|
|
|
|
* for instance, increase or decrease refcount. buf_id field never changes
|
|
|
|
* after initialization, so does not need locking. freeNext is protected by
|
|
|
|
* the buffer_strategy_lock not buffer header lock. The LWLock can take care
|
|
|
|
* of itself. The buffer header lock is *not* used to control access to the
|
|
|
|
* data in the buffer!
|
|
|
|
*
|
|
|
|
* It's assumed that nobody changes the state field while buffer header lock
|
|
|
|
* is held. Thus buffer header lock holder can do complex updates of the
|
|
|
|
* state variable in single write, simultaneously with lock release (cleaning
|
|
|
|
* BM_LOCKED flag). On the other hand, updating of state without holding
|
|
|
|
* buffer header lock is restricted to CAS, which insure that BM_LOCKED flag
|
|
|
|
* is not set. Atomic increment/decrement, OR/AND etc. are not allowed.
|
|
|
|
*
|
|
|
|
* An exception is that if we have the buffer pinned, its tag can't change
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
10 years ago
|
|
|
* underneath us, so we can examine the tag without locking the buffer header.
|
|
|
|
* Also, in places we do one-time reads of the flags without bothering to
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
10 years ago
|
|
|
* lock the buffer header; this is generally for situations where we don't
|
|
|
|
* expect the flag bit being tested to be changing.
|
|
|
|
*
|
|
|
|
* We can't physically remove items from a disk page if another backend has
|
|
|
|
* the buffer pinned. Hence, a backend may need to wait for all other pins
|
|
|
|
* to go away. This is signaled by storing its own PID into
|
|
|
|
* wait_backend_pid and setting flag bit BM_PIN_COUNT_WAITER. At present,
|
|
|
|
* there can be only one such waiter per buffer.
|
|
|
|
*
|
|
|
|
* We use this same struct for local buffer headers, but the locks are not
|
|
|
|
* used and not all of the flag bits are useful either. To avoid unnecessary
|
|
|
|
* overhead, manipulations of the state field should be done without actual
|
Fix fallback implementation of pg_atomic_write_u32().
I somehow had assumed that in the spinlock (in turn possibly using
semaphores) based fallback atomics implementation 32 bit writes could be
done without a lock. As far as the write goes that's correct, since
postgres supports only platforms with single-copy atomicity for aligned
32bit writes. But writing without holding the spinlock breaks
read-modify-write operations like pg_atomic_compare_exchange_u32(),
since they'll potentially "miss" a concurrent write, which can't happen
in actual hardware implementations.
In 9.6+ when using the fallback atomics implementation this could lead
to buffer header locks not being properly marked as released, and
potentially some related state corruption. I don't see a related danger
in 9.5 (earliest release with the API), because pg_atomic_write_u32()
wasn't used in a concurrent manner there.
The state variable of local buffers, before this change, were
manipulated using pg_atomic_write_u32(), to avoid unnecessary
synchronization overhead. As that'd not be the case anymore, introduce
and use pg_atomic_unlocked_write_u32(), which does not correctly
interact with RMW operations.
This bug only caused issues when postgres is compiled on platforms
without atomics support (i.e. no common new platform), or when compiled
with --disable-atomics, which explains why this wasn't noticed in
testing.
Reported-By: Tom Lane
Discussion: <14947.1475690465@sss.pgh.pa.us>
Backpatch: 9.5-, where the atomic operations API was introduced.
9 years ago
|
|
|
* atomic operations (i.e. only pg_atomic_read_u32() and
|
|
|
|
* pg_atomic_unlocked_write_u32()).
|
|
|
|
*
|
|
|
|
* Be careful to avoid increasing the size of the struct when adding or
|
|
|
|
* reordering members. Keeping it below 64 bytes (the most common CPU
|
|
|
|
* cache line size) is fairly important for performance.
|
|
|
|
*/
|
Align buffer descriptors to cache line boundaries.
Benchmarks has shown that aligning the buffer descriptor array to
cache lines is important for scalability; especially on bigger,
multi-socket, machines.
Currently the array sometimes already happens to be aligned by
happenstance, depending how large previous shared memory allocations
were. That can lead to wildly varying performance results after minor
configuration changes.
In addition to aligning the start of descriptor array, also force the
size of individual descriptors to be of a common cache line size (64
bytes). That happens to already be the case on 64bit platforms, but
this way we can change the struct BufferDesc more easily.
As the alignment primarily matters in highly concurrent workloads
which probably all are 64bit these days, and the space wastage of
element alignment would be a bit more noticeable on 32bit systems, we
don't force the stride to be cacheline sized on 32bit platforms for
now. If somebody does actual performance testing, we can reevaluate
that decision by changing the definition of BUFFERDESC_PADDED_SIZE.
Discussion: 20140202151319.GD32123@awork2.anarazel.de
Per discussion with Bruce Momjan, Tom Lane, Robert Haas, and Peter
Geoghegan.
11 years ago
|
|
|
typedef struct BufferDesc
|
|
|
|
{
|
|
|
|
BufferTag tag; /* ID of page contained in buffer */
|
|
|
|
int buf_id; /* buffer's index number (from 0) */
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
10 years ago
|
|
|
|
|
|
|
/* state of the tag, containing flags, refcount and usagecount */
|
|
|
|
pg_atomic_uint32 state;
|
|
|
|
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
8 years ago
|
|
|
int wait_backend_pid; /* backend PID of pin-count waiter */
|
|
|
|
int freeNext; /* link in freelist chain */
|
|
|
|
|
|
|
|
LWLock content_lock; /* to lock access to buffer contents */
|
|
|
|
} BufferDesc;
|
|
|
|
|
Align buffer descriptors to cache line boundaries.
Benchmarks has shown that aligning the buffer descriptor array to
cache lines is important for scalability; especially on bigger,
multi-socket, machines.
Currently the array sometimes already happens to be aligned by
happenstance, depending how large previous shared memory allocations
were. That can lead to wildly varying performance results after minor
configuration changes.
In addition to aligning the start of descriptor array, also force the
size of individual descriptors to be of a common cache line size (64
bytes). That happens to already be the case on 64bit platforms, but
this way we can change the struct BufferDesc more easily.
As the alignment primarily matters in highly concurrent workloads
which probably all are 64bit these days, and the space wastage of
element alignment would be a bit more noticeable on 32bit systems, we
don't force the stride to be cacheline sized on 32bit platforms for
now. If somebody does actual performance testing, we can reevaluate
that decision by changing the definition of BUFFERDESC_PADDED_SIZE.
Discussion: 20140202151319.GD32123@awork2.anarazel.de
Per discussion with Bruce Momjan, Tom Lane, Robert Haas, and Peter
Geoghegan.
11 years ago
|
|
|
/*
|
|
|
|
* Concurrent access to buffer headers has proven to be more efficient if
|
|
|
|
* they're cache line aligned. So we force the start of the BufferDescriptors
|
|
|
|
* array to be on a cache line boundary and force the elements to be cache
|
|
|
|
* line sized.
|
|
|
|
*
|
|
|
|
* XXX: As this is primarily matters in highly concurrent workloads which
|
|
|
|
* probably all are 64bit these days, and the space wastage would be a bit
|
|
|
|
* more noticeable on 32bit systems, we don't force the stride to be cache
|
|
|
|
* line sized on those. If somebody does actual performance testing, we can
|
|
|
|
* reevaluate.
|
|
|
|
*
|
|
|
|
* Note that local buffer descriptors aren't forced to be aligned - as there's
|
|
|
|
* no concurrent access to those it's unlikely to be beneficial.
|
|
|
|
*
|
|
|
|
* We use 64bit as the cache line size here, because that's the most common
|
|
|
|
* size. Making it bigger would be a waste of memory. Even if running on a
|
|
|
|
* platform with either 32 or 128 byte line sizes, it's good to align to
|
|
|
|
* boundaries and avoid false sharing.
|
|
|
|
*/
|
|
|
|
#define BUFFERDESC_PAD_TO_SIZE (SIZEOF_VOID_P == 8 ? 64 : 1)
|
|
|
|
|
|
|
|
typedef union BufferDescPadded
|
|
|
|
{
|
|
|
|
BufferDesc bufferdesc;
|
|
|
|
char pad[BUFFERDESC_PAD_TO_SIZE];
|
|
|
|
} BufferDescPadded;
|
|
|
|
|
|
|
|
#define GetBufferDescriptor(id) (&BufferDescriptors[(id)].bufferdesc)
|
|
|
|
#define GetLocalBufferDescriptor(id) (&LocalBufferDescriptors[(id)])
|
|
|
|
|
|
|
|
#define BufferDescriptorGetBuffer(bdesc) ((bdesc)->buf_id + 1)
|
|
|
|
|
|
|
|
#define BufferDescriptorGetIOLock(bdesc) \
|
|
|
|
(&(BufferIOLWLockArray[(bdesc)->buf_id]).lock)
|
|
|
|
#define BufferDescriptorGetContentLock(bdesc) \
|
|
|
|
((LWLock*) (&(bdesc)->content_lock))
|
|
|
|
|
|
|
|
extern PGDLLIMPORT LWLockMinimallyPadded *BufferIOLWLockArray;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The freeNext field is either the index of the next freelist entry,
|
|
|
|
* or one of these special values:
|
|
|
|
*/
|
|
|
|
#define FREENEXT_END_OF_LIST (-1)
|
|
|
|
#define FREENEXT_NOT_IN_LIST (-2)
|
|
|
|
|
|
|
|
/*
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
10 years ago
|
|
|
* Functions for acquiring/releasing a shared buffer header's spinlock. Do
|
|
|
|
* not apply these to local buffers!
|
|
|
|
*/
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
10 years ago
|
|
|
extern uint32 LockBufHdr(BufferDesc *desc);
|
|
|
|
#define UnlockBufHdr(desc, s) \
|
|
|
|
do { \
|
|
|
|
pg_write_barrier(); \
|
|
|
|
pg_atomic_write_u32(&(desc)->state, (s) & (~BM_LOCKED)); \
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
10 years ago
|
|
|
} while (0)
|
|
|
|
|
|
|
|
|
Allow to trigger kernel writeback after a configurable number of writes.
Currently writes to the main data files of postgres all go through the
OS page cache. This means that some operating systems can end up
collecting a large number of dirty buffers in their respective page
caches. When these dirty buffers are flushed to storage rapidly, be it
because of fsync(), timeouts, or dirty ratios, latency for other reads
and writes can increase massively. This is the primary reason for
regular massive stalls observed in real world scenarios and artificial
benchmarks; on rotating disks stalls on the order of hundreds of seconds
have been observed.
On linux it is possible to control this by reducing the global dirty
limits significantly, reducing the above problem. But global
configuration is rather problematic because it'll affect other
applications; also PostgreSQL itself doesn't always generally want this
behavior, e.g. for temporary files it's undesirable.
Several operating systems allow some control over the kernel page
cache. Linux has sync_file_range(2), several posix systems have msync(2)
and posix_fadvise(2). sync_file_range(2) is preferable because it
requires no special setup, whereas msync() requires the to-be-flushed
range to be mmap'ed. For the purpose of flushing dirty data
posix_fadvise(2) is the worst alternative, as flushing dirty data is
just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
from the page cache. Thus the feature is enabled by default only on
linux, but can be enabled on all systems that have any of the above
APIs.
While desirable and likely possible this patch does not contain an
implementation for windows.
With the infrastructure added, writes made via checkpointer, bgwriter
and normal user backends can be flushed after a configurable number of
writes. Each of these sources of writes controlled by a separate GUC,
checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
respectively; they're separate because the number of flushes that are
good are separate, and because the performance considerations of
controlled flushing for each of these are different.
A later patch will add checkpoint sorting - after that flushes from the
ckeckpoint will almost always be desirable. Bgwriter flushes are most of
the time going to be random, which are slow on lots of storage hardware.
Flushing in backends works well if the storage and bgwriter can keep up,
but if not it can have negative consequences. This patch is likely to
have negative performance consequences without checkpoint sorting, but
unfortunately so has sorting without flush control.
Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
10 years ago
|
|
|
/*
|
|
|
|
* The PendingWriteback & WritebackContext structure are used to keep
|
|
|
|
* information about pending flush requests to be issued to the OS.
|
|
|
|
*/
|
|
|
|
typedef struct PendingWriteback
|
|
|
|
{
|
|
|
|
/* could store different types of pending flushes here */
|
|
|
|
BufferTag tag;
|
|
|
|
} PendingWriteback;
|
|
|
|
|
|
|
|
/* struct forward declared in bufmgr.h */
|
|
|
|
typedef struct WritebackContext
|
|
|
|
{
|
|
|
|
/* pointer to the max number of writeback requests to coalesce */
|
|
|
|
int *max_pending;
|
|
|
|
|
|
|
|
/* current number of pending writeback requests */
|
|
|
|
int nr_pending;
|
|
|
|
|
|
|
|
/* pending requests */
|
|
|
|
PendingWriteback pending_writebacks[WRITEBACK_MAX_PENDING_FLUSHES];
|
|
|
|
} WritebackContext;
|
|
|
|
|
|
|
|
/* in buf_init.c */
|
Align buffer descriptors to cache line boundaries.
Benchmarks has shown that aligning the buffer descriptor array to
cache lines is important for scalability; especially on bigger,
multi-socket, machines.
Currently the array sometimes already happens to be aligned by
happenstance, depending how large previous shared memory allocations
were. That can lead to wildly varying performance results after minor
configuration changes.
In addition to aligning the start of descriptor array, also force the
size of individual descriptors to be of a common cache line size (64
bytes). That happens to already be the case on 64bit platforms, but
this way we can change the struct BufferDesc more easily.
As the alignment primarily matters in highly concurrent workloads
which probably all are 64bit these days, and the space wastage of
element alignment would be a bit more noticeable on 32bit systems, we
don't force the stride to be cacheline sized on 32bit platforms for
now. If somebody does actual performance testing, we can reevaluate
that decision by changing the definition of BUFFERDESC_PADDED_SIZE.
Discussion: 20140202151319.GD32123@awork2.anarazel.de
Per discussion with Bruce Momjan, Tom Lane, Robert Haas, and Peter
Geoghegan.
11 years ago
|
|
|
extern PGDLLIMPORT BufferDescPadded *BufferDescriptors;
|
Allow to trigger kernel writeback after a configurable number of writes.
Currently writes to the main data files of postgres all go through the
OS page cache. This means that some operating systems can end up
collecting a large number of dirty buffers in their respective page
caches. When these dirty buffers are flushed to storage rapidly, be it
because of fsync(), timeouts, or dirty ratios, latency for other reads
and writes can increase massively. This is the primary reason for
regular massive stalls observed in real world scenarios and artificial
benchmarks; on rotating disks stalls on the order of hundreds of seconds
have been observed.
On linux it is possible to control this by reducing the global dirty
limits significantly, reducing the above problem. But global
configuration is rather problematic because it'll affect other
applications; also PostgreSQL itself doesn't always generally want this
behavior, e.g. for temporary files it's undesirable.
Several operating systems allow some control over the kernel page
cache. Linux has sync_file_range(2), several posix systems have msync(2)
and posix_fadvise(2). sync_file_range(2) is preferable because it
requires no special setup, whereas msync() requires the to-be-flushed
range to be mmap'ed. For the purpose of flushing dirty data
posix_fadvise(2) is the worst alternative, as flushing dirty data is
just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
from the page cache. Thus the feature is enabled by default only on
linux, but can be enabled on all systems that have any of the above
APIs.
While desirable and likely possible this patch does not contain an
implementation for windows.
With the infrastructure added, writes made via checkpointer, bgwriter
and normal user backends can be flushed after a configurable number of
writes. Each of these sources of writes controlled by a separate GUC,
checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
respectively; they're separate because the number of flushes that are
good are separate, and because the performance considerations of
controlled flushing for each of these are different.
A later patch will add checkpoint sorting - after that flushes from the
ckeckpoint will almost always be desirable. Bgwriter flushes are most of
the time going to be random, which are slow on lots of storage hardware.
Flushing in backends works well if the storage and bgwriter can keep up,
but if not it can have negative consequences. This patch is likely to
have negative performance consequences without checkpoint sorting, but
unfortunately so has sorting without flush control.
Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
10 years ago
|
|
|
extern PGDLLIMPORT WritebackContext BackendWritebackContext;
|
|
|
|
|
|
|
|
/* in localbuf.c */
|
|
|
|
extern BufferDesc *LocalBufferDescriptors;
|
|
|
|
|
|
|
|
/* in bufmgr.c */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Structure to sort buffers per file on checkpoints.
|
|
|
|
*
|
|
|
|
* This structure is allocated per buffer in shared memory, so it should be
|
|
|
|
* kept as small as possible.
|
|
|
|
*/
|
|
|
|
typedef struct CkptSortItem
|
|
|
|
{
|
|
|
|
Oid tsId;
|
|
|
|
Oid relNode;
|
|
|
|
ForkNumber forkNum;
|
|
|
|
BlockNumber blockNum;
|
|
|
|
int buf_id;
|
|
|
|
} CkptSortItem;
|
|
|
|
|
|
|
|
extern CkptSortItem *CkptBufferIds;
|
|
|
|
|
|
|
|
/*
|
Allow to trigger kernel writeback after a configurable number of writes.
Currently writes to the main data files of postgres all go through the
OS page cache. This means that some operating systems can end up
collecting a large number of dirty buffers in their respective page
caches. When these dirty buffers are flushed to storage rapidly, be it
because of fsync(), timeouts, or dirty ratios, latency for other reads
and writes can increase massively. This is the primary reason for
regular massive stalls observed in real world scenarios and artificial
benchmarks; on rotating disks stalls on the order of hundreds of seconds
have been observed.
On linux it is possible to control this by reducing the global dirty
limits significantly, reducing the above problem. But global
configuration is rather problematic because it'll affect other
applications; also PostgreSQL itself doesn't always generally want this
behavior, e.g. for temporary files it's undesirable.
Several operating systems allow some control over the kernel page
cache. Linux has sync_file_range(2), several posix systems have msync(2)
and posix_fadvise(2). sync_file_range(2) is preferable because it
requires no special setup, whereas msync() requires the to-be-flushed
range to be mmap'ed. For the purpose of flushing dirty data
posix_fadvise(2) is the worst alternative, as flushing dirty data is
just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
from the page cache. Thus the feature is enabled by default only on
linux, but can be enabled on all systems that have any of the above
APIs.
While desirable and likely possible this patch does not contain an
implementation for windows.
With the infrastructure added, writes made via checkpointer, bgwriter
and normal user backends can be flushed after a configurable number of
writes. Each of these sources of writes controlled by a separate GUC,
checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
respectively; they're separate because the number of flushes that are
good are separate, and because the performance considerations of
controlled flushing for each of these are different.
A later patch will add checkpoint sorting - after that flushes from the
ckeckpoint will almost always be desirable. Bgwriter flushes are most of
the time going to be random, which are slow on lots of storage hardware.
Flushing in backends works well if the storage and bgwriter can keep up,
but if not it can have negative consequences. This patch is likely to
have negative performance consequences without checkpoint sorting, but
unfortunately so has sorting without flush control.
Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
10 years ago
|
|
|
* Internal buffer management routines
|
|
|
|
*/
|
Allow to trigger kernel writeback after a configurable number of writes.
Currently writes to the main data files of postgres all go through the
OS page cache. This means that some operating systems can end up
collecting a large number of dirty buffers in their respective page
caches. When these dirty buffers are flushed to storage rapidly, be it
because of fsync(), timeouts, or dirty ratios, latency for other reads
and writes can increase massively. This is the primary reason for
regular massive stalls observed in real world scenarios and artificial
benchmarks; on rotating disks stalls on the order of hundreds of seconds
have been observed.
On linux it is possible to control this by reducing the global dirty
limits significantly, reducing the above problem. But global
configuration is rather problematic because it'll affect other
applications; also PostgreSQL itself doesn't always generally want this
behavior, e.g. for temporary files it's undesirable.
Several operating systems allow some control over the kernel page
cache. Linux has sync_file_range(2), several posix systems have msync(2)
and posix_fadvise(2). sync_file_range(2) is preferable because it
requires no special setup, whereas msync() requires the to-be-flushed
range to be mmap'ed. For the purpose of flushing dirty data
posix_fadvise(2) is the worst alternative, as flushing dirty data is
just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
from the page cache. Thus the feature is enabled by default only on
linux, but can be enabled on all systems that have any of the above
APIs.
While desirable and likely possible this patch does not contain an
implementation for windows.
With the infrastructure added, writes made via checkpointer, bgwriter
and normal user backends can be flushed after a configurable number of
writes. Each of these sources of writes controlled by a separate GUC,
checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
respectively; they're separate because the number of flushes that are
good are separate, and because the performance considerations of
controlled flushing for each of these are different.
A later patch will add checkpoint sorting - after that flushes from the
ckeckpoint will almost always be desirable. Bgwriter flushes are most of
the time going to be random, which are slow on lots of storage hardware.
Flushing in backends works well if the storage and bgwriter can keep up,
but if not it can have negative consequences. This patch is likely to
have negative performance consequences without checkpoint sorting, but
unfortunately so has sorting without flush control.
Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
10 years ago
|
|
|
/* bufmgr.c */
|
|
|
|
extern void WritebackContextInit(WritebackContext *context, int *max_pending);
|
Allow to trigger kernel writeback after a configurable number of writes.
Currently writes to the main data files of postgres all go through the
OS page cache. This means that some operating systems can end up
collecting a large number of dirty buffers in their respective page
caches. When these dirty buffers are flushed to storage rapidly, be it
because of fsync(), timeouts, or dirty ratios, latency for other reads
and writes can increase massively. This is the primary reason for
regular massive stalls observed in real world scenarios and artificial
benchmarks; on rotating disks stalls on the order of hundreds of seconds
have been observed.
On linux it is possible to control this by reducing the global dirty
limits significantly, reducing the above problem. But global
configuration is rather problematic because it'll affect other
applications; also PostgreSQL itself doesn't always generally want this
behavior, e.g. for temporary files it's undesirable.
Several operating systems allow some control over the kernel page
cache. Linux has sync_file_range(2), several posix systems have msync(2)
and posix_fadvise(2). sync_file_range(2) is preferable because it
requires no special setup, whereas msync() requires the to-be-flushed
range to be mmap'ed. For the purpose of flushing dirty data
posix_fadvise(2) is the worst alternative, as flushing dirty data is
just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
from the page cache. Thus the feature is enabled by default only on
linux, but can be enabled on all systems that have any of the above
APIs.
While desirable and likely possible this patch does not contain an
implementation for windows.
With the infrastructure added, writes made via checkpointer, bgwriter
and normal user backends can be flushed after a configurable number of
writes. Each of these sources of writes controlled by a separate GUC,
checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
respectively; they're separate because the number of flushes that are
good are separate, and because the performance considerations of
controlled flushing for each of these are different.
A later patch will add checkpoint sorting - after that flushes from the
ckeckpoint will almost always be desirable. Bgwriter flushes are most of
the time going to be random, which are slow on lots of storage hardware.
Flushing in backends works well if the storage and bgwriter can keep up,
but if not it can have negative consequences. This patch is likely to
have negative performance consequences without checkpoint sorting, but
unfortunately so has sorting without flush control.
Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
10 years ago
|
|
|
extern void IssuePendingWritebacks(WritebackContext *context);
|
|
|
|
extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag);
|
|
|
|
|
|
|
|
/* freelist.c */
|
Allow Pin/UnpinBuffer to operate in a lockfree manner.
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
10 years ago
|
|
|
extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
|
|
|
|
uint32 *buf_state);
|
|
|
|
extern void StrategyFreeBuffer(BufferDesc *buf);
|
|
|
|
extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
|
|
|
|
BufferDesc *buf);
|
|
|
|
|
|
|
|
extern int StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
|
|
|
|
extern void StrategyNotifyBgWriter(int bgwprocno);
|
Improve control logic for bgwriter hibernation mode.
Commit 6d90eaaa89a007e0d365f49d6436f35d2392cfeb added a hibernation mode
to the bgwriter to reduce the server's idle-power consumption. However,
its interaction with the detailed behavior of BgBufferSync's feedback
control loop wasn't very well thought out. That control loop depends
primarily on the rate of buffer allocation, not the rate of buffer
dirtying, so the hibernation mode has to be designed to operate only when
no new buffer allocations are happening. Also, the check for whether the
system is effectively idle was not quite right and would fail to detect
a constant low level of activity, thus allowing the bgwriter to go into
hibernation mode in a way that would let the cycle time vary quite a bit,
possibly further confusing the feedback loop. To fix, move the wakeup
support from MarkBufferDirty and SetBufferCommitInfoNeedsSave into
StrategyGetBuffer, and prevent the bgwriter from entering hibernation mode
unless no buffer allocations have happened recently.
In addition, fix the delaying logic to remove the problem of possibly not
responding to signals promptly, which was basically caused by trying to use
the process latch's is_set flag for multiple purposes. I can't prove it
but I'm suspicious that that hack was responsible for the intermittent
"postmaster does not shut down" failures we've been seeing in the buildfarm
lately. In any case it did nothing to improve the readability or
robustness of the code.
In passing, express the hibernation sleep time as a multiplier on
BgWriterDelay, not a constant. I'm not sure whether there's any value in
exposing the longer sleep time as an independently configurable setting,
but we can at least make it act like this for little extra code.
14 years ago
|
|
|
|
|
|
|
extern Size StrategyShmemSize(void);
|
|
|
|
extern void StrategyInitialize(bool init);
|
|
|
|
extern bool have_free_buffer(void);
|
|
|
|
|
|
|
|
/* buf_table.c */
|
|
|
|
extern Size BufTableShmemSize(int size);
|
|
|
|
extern void InitBufTable(int size);
|
|
|
|
extern uint32 BufTableHashCode(BufferTag *tagPtr);
|
|
|
|
extern int BufTableLookup(BufferTag *tagPtr, uint32 hashcode);
|
|
|
|
extern int BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
|
|
|
|
extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
|
|
|
|
|
|
|
|
/* localbuf.c */
|
|
|
|
extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
|
|
|
|
ForkNumber forkNum,
|
|
|
|
BlockNumber blockNum);
|
|
|
|
extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
|
|
|
|
BlockNumber blockNum, bool *foundPtr);
|
|
|
|
extern void MarkLocalBufferDirty(Buffer buffer);
|
|
|
|
extern void DropRelFileNodeLocalBuffers(RelFileNode rnode, ForkNumber forkNum,
|
|
|
|
BlockNumber firstDelBlock);
|
|
|
|
extern void DropRelFileNodeAllLocalBuffers(RelFileNode rnode);
|
|
|
|
extern void AtEOXact_LocalBuffers(bool isCommit);
|
|
|
|
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
8 years ago
|
|
|
#endif /* BUFMGR_INTERNALS_H */
|