|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
|
|
|
* bufmgr.h
|
|
|
|
* POSTGRES buffer manager definitions.
|
|
|
|
*
|
|
|
|
*
|
|
|
|
* Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
|
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
|
|
|
*
|
|
|
|
* src/include/storage/bufmgr.h
|
|
|
|
*
|
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
#ifndef BUFMGR_H
|
|
|
|
#define BUFMGR_H
|
|
|
|
|
|
|
|
#include "storage/block.h"
|
|
|
|
#include "storage/buf.h"
|
|
|
|
#include "storage/bufpage.h"
|
|
|
|
#include "storage/relfilenode.h"
|
|
|
|
#include "utils/relcache.h"
|
|
|
|
#include "utils/snapmgr.h"
|
|
|
|
|
|
|
|
typedef void *Block;
|
|
|
|
|
|
|
|
/* Possible arguments for GetAccessStrategy() */
|
|
|
|
typedef enum BufferAccessStrategyType
|
|
|
|
{
|
|
|
|
BAS_NORMAL, /* Normal random access */
|
|
|
|
BAS_BULKREAD, /* Large read-only scan (hint bit updates are
|
|
|
|
* ok) */
|
|
|
|
BAS_BULKWRITE, /* Large multi-block write (e.g. COPY IN) */
|
|
|
|
BAS_VACUUM /* VACUUM */
|
|
|
|
} BufferAccessStrategyType;
|
|
|
|
|
Unite ReadBufferWithFork, ReadBufferWithStrategy, and ZeroOrReadBuffer
functions into one ReadBufferExtended function, that takes the strategy
and mode as argument. There's three modes, RBM_NORMAL which is the default
used by plain ReadBuffer(), RBM_ZERO, which replaces ZeroOrReadBuffer, and
a new mode RBM_ZERO_ON_ERROR, which allows callers to read corrupt pages
without throwing an error. The FSM needs the new mode to recover from
corrupt pages, which could happend if we crash after extending an FSM file,
and the new page is "torn".
Add fork number to some error messages in bufmgr.c, that still lacked it.
17 years ago
|
|
|
/* Possible modes for ReadBufferExtended() */
|
|
|
|
typedef enum
|
|
|
|
{
|
|
|
|
RBM_NORMAL, /* Normal read */
|
Fix race condition between hot standby and restoring a full-page image.
There was a window in RestoreBackupBlock where a page would be zeroed out,
but not yet locked. If a backend pinned and locked the page in that window,
it saw the zeroed page instead of the old page or new page contents, which
could lead to missing rows in a result set, or errors.
To fix, replace RBM_ZERO with RBM_ZERO_AND_LOCK, which atomically pins,
zeroes, and locks the page, if it's not in the buffer cache already.
In stable branches, the old RBM_ZERO constant is renamed to RBM_DO_NOT_USE,
to avoid breaking any 3rd party extensions that might use RBM_ZERO. More
importantly, this avoids renumbering the other enum values, which would
cause even bigger confusion in extensions that use ReadBufferExtended, but
haven't been recompiled.
Backpatch to all supported versions; this has been racy since hot standby
was introduced.
11 years ago
|
|
|
RBM_ZERO_AND_LOCK, /* Don't read from disk, caller will
|
|
|
|
* initialize. Also locks the page. */
|
|
|
|
RBM_ZERO_AND_CLEANUP_LOCK, /* Like RBM_ZERO_AND_LOCK, but locks the page
|
|
|
|
* in "cleanup" mode */
|
Fix multiple bugs in index page locking during hot-standby WAL replay.
In ordinary operation, VACUUM must be careful to take a cleanup lock on
each leaf page of a btree index; this ensures that no indexscans could
still be "in flight" to heap tuples due to be deleted. (Because of
possible index-tuple motion due to concurrent page splits, it's not enough
to lock only the pages we're deleting index tuples from.) In Hot Standby,
the WAL replay process must likewise lock every leaf page. There were
several bugs in the code for that:
* The replay scan might come across unused, all-zero pages in the index.
While btree_xlog_vacuum itself did the right thing (ie, nothing) with
such pages, xlogutils.c supposed that such pages must be corrupt and
would throw an error. This accounts for various reports of replication
failures with "PANIC: WAL contains references to invalid pages". To
fix, add a ReadBufferMode value that instructs XLogReadBufferExtended
not to complain when we're doing this.
* btree_xlog_vacuum performed the extra locking if standbyState ==
STANDBY_SNAPSHOT_READY, but that's not the correct test: we won't open up
for hot standby queries until the database has reached consistency, and
we don't want to do the extra locking till then either, for fear of reading
corrupted pages (which bufmgr.c would complain about). Fix by exporting a
new function from xlog.c that will report whether we're actually in hot
standby replay mode.
* To ensure full coverage of the index in the replay scan, btvacuumscan
would emit a dummy WAL record for the last page of the index, if no
vacuuming work had been done on that page. However, if the last page
of the index is all-zero, that would result in corruption of said page,
since the functions called on it weren't prepared to handle that case.
There's no need to lock any such pages, so change the logic to target
the last normal leaf page instead.
The first two of these bugs were diagnosed by Andres Freund, the other one
by me. Fixes based on ideas from Heikki Linnakangas and myself.
This has been wrong since Hot Standby was introduced, so back-patch to 9.0.
12 years ago
|
|
|
RBM_ZERO_ON_ERROR, /* Read, but return an all-zeros page on error */
|
|
|
|
RBM_NORMAL_NO_LOG /* Don't log page as invalid during WAL
|
|
|
|
* replay; otherwise same as RBM_NORMAL */
|
Unite ReadBufferWithFork, ReadBufferWithStrategy, and ZeroOrReadBuffer
functions into one ReadBufferExtended function, that takes the strategy
and mode as argument. There's three modes, RBM_NORMAL which is the default
used by plain ReadBuffer(), RBM_ZERO, which replaces ZeroOrReadBuffer, and
a new mode RBM_ZERO_ON_ERROR, which allows callers to read corrupt pages
without throwing an error. The FSM needs the new mode to recover from
corrupt pages, which could happend if we crash after extending an FSM file,
and the new page is "torn".
Add fork number to some error messages in bufmgr.c, that still lacked it.
17 years ago
|
|
|
} ReadBufferMode;
|
|
|
|
|
Allow to trigger kernel writeback after a configurable number of writes.
Currently writes to the main data files of postgres all go through the
OS page cache. This means that some operating systems can end up
collecting a large number of dirty buffers in their respective page
caches. When these dirty buffers are flushed to storage rapidly, be it
because of fsync(), timeouts, or dirty ratios, latency for other reads
and writes can increase massively. This is the primary reason for
regular massive stalls observed in real world scenarios and artificial
benchmarks; on rotating disks stalls on the order of hundreds of seconds
have been observed.
On linux it is possible to control this by reducing the global dirty
limits significantly, reducing the above problem. But global
configuration is rather problematic because it'll affect other
applications; also PostgreSQL itself doesn't always generally want this
behavior, e.g. for temporary files it's undesirable.
Several operating systems allow some control over the kernel page
cache. Linux has sync_file_range(2), several posix systems have msync(2)
and posix_fadvise(2). sync_file_range(2) is preferable because it
requires no special setup, whereas msync() requires the to-be-flushed
range to be mmap'ed. For the purpose of flushing dirty data
posix_fadvise(2) is the worst alternative, as flushing dirty data is
just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
from the page cache. Thus the feature is enabled by default only on
linux, but can be enabled on all systems that have any of the above
APIs.
While desirable and likely possible this patch does not contain an
implementation for windows.
With the infrastructure added, writes made via checkpointer, bgwriter
and normal user backends can be flushed after a configurable number of
writes. Each of these sources of writes controlled by a separate GUC,
checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
respectively; they're separate because the number of flushes that are
good are separate, and because the performance considerations of
controlled flushing for each of these are different.
A later patch will add checkpoint sorting - after that flushes from the
ckeckpoint will almost always be desirable. Bgwriter flushes are most of
the time going to be random, which are slow on lots of storage hardware.
Flushing in backends works well if the storage and bgwriter can keep up,
but if not it can have negative consequences. This patch is likely to
have negative performance consequences without checkpoint sorting, but
unfortunately so has sorting without flush control.
Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
9 years ago
|
|
|
/* forward declared, to avoid having to expose buf_internals.h here */
|
|
|
|
struct WritebackContext;
|
|
|
|
|
Skip WAL for new relfilenodes, under wal_level=minimal.
Until now, only selected bulk operations (e.g. COPY) did this. If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY. See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules. Maintainers of table access
methods should examine that section.
To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL. A new GUC,
wal_skip_threshold, guides that choice. If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold. Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT from commands like COPY.
Internally, this requires a reliable determination of whether
RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the
specification of rd_createSubid such that the field is zero when a new
rel has an old rd_node. Make relcache.c retain entries for certain
dropped relations until end of transaction.
Back-patch to 9.5 (all supported versions). This introduces a new WAL
record type, XLOG_GIST_ASSIGN_LSN, without bumping XLOG_PAGE_MAGIC. As
always, update standby systems before master systems. This changes
sizeof(RelationData) and sizeof(IndexStmt), breaking binary
compatibility for affected extensions. (The most recent commit to
affect the same class of extensions was
089e4d405d0f3b94c74a2c6a54357a84a681754b.)
Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
Haas. Heikki Linnakangas and Michael Paquier implemented earlier
designs that materially clarified the problem. Reviewed, in earlier
designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout.
Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
5 years ago
|
|
|
/* forward declared, to avoid including smgr.h here */
|
|
|
|
struct SMgrRelationData;
|
|
|
|
|
|
|
|
/* in globals.c ... this duplicates miscadmin.h */
|
|
|
|
extern PGDLLIMPORT int NBuffers;
|
|
|
|
|
|
|
|
/* in bufmgr.c */
|
|
|
|
extern bool zero_damaged_pages;
|
|
|
|
extern int bgwriter_lru_maxpages;
|
|
|
|
extern double bgwriter_lru_multiplier;
|
|
|
|
extern bool track_io_timing;
|
|
|
|
extern int effective_io_concurrency;
|
|
|
|
extern int maintenance_io_concurrency;
|
|
|
|
|
Allow to trigger kernel writeback after a configurable number of writes.
Currently writes to the main data files of postgres all go through the
OS page cache. This means that some operating systems can end up
collecting a large number of dirty buffers in their respective page
caches. When these dirty buffers are flushed to storage rapidly, be it
because of fsync(), timeouts, or dirty ratios, latency for other reads
and writes can increase massively. This is the primary reason for
regular massive stalls observed in real world scenarios and artificial
benchmarks; on rotating disks stalls on the order of hundreds of seconds
have been observed.
On linux it is possible to control this by reducing the global dirty
limits significantly, reducing the above problem. But global
configuration is rather problematic because it'll affect other
applications; also PostgreSQL itself doesn't always generally want this
behavior, e.g. for temporary files it's undesirable.
Several operating systems allow some control over the kernel page
cache. Linux has sync_file_range(2), several posix systems have msync(2)
and posix_fadvise(2). sync_file_range(2) is preferable because it
requires no special setup, whereas msync() requires the to-be-flushed
range to be mmap'ed. For the purpose of flushing dirty data
posix_fadvise(2) is the worst alternative, as flushing dirty data is
just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
from the page cache. Thus the feature is enabled by default only on
linux, but can be enabled on all systems that have any of the above
APIs.
While desirable and likely possible this patch does not contain an
implementation for windows.
With the infrastructure added, writes made via checkpointer, bgwriter
and normal user backends can be flushed after a configurable number of
writes. Each of these sources of writes controlled by a separate GUC,
checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
respectively; they're separate because the number of flushes that are
good are separate, and because the performance considerations of
controlled flushing for each of these are different.
A later patch will add checkpoint sorting - after that flushes from the
ckeckpoint will almost always be desirable. Bgwriter flushes are most of
the time going to be random, which are slow on lots of storage hardware.
Flushing in backends works well if the storage and bgwriter can keep up,
but if not it can have negative consequences. This patch is likely to
have negative performance consequences without checkpoint sorting, but
unfortunately so has sorting without flush control.
Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
9 years ago
|
|
|
extern int checkpoint_flush_after;
|
|
|
|
extern int backend_flush_after;
|
|
|
|
extern int bgwriter_flush_after;
|
|
|
|
|
|
|
|
/* in buf_init.c */
|
|
|
|
extern PGDLLIMPORT char *BufferBlocks;
|
|
|
|
|
|
|
|
/* in localbuf.c */
|
|
|
|
extern PGDLLIMPORT int NLocBuffer;
|
|
|
|
extern PGDLLIMPORT Block *LocalBufferBlockPointers;
|
|
|
|
extern PGDLLIMPORT int32 *LocalRefCount;
|
|
|
|
|
|
|
|
/* upper limit for effective_io_concurrency */
|
|
|
|
#define MAX_IO_CONCURRENCY 1000
|
|
|
|
|
|
|
|
/* special block number for ReadBuffer() */
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
8 years ago
|
|
|
#define P_NEW InvalidBlockNumber /* grow the file to get a new page */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Buffer content lock modes (mode argument for LockBuffer())
|
|
|
|
*/
|
|
|
|
#define BUFFER_LOCK_UNLOCK 0
|
|
|
|
#define BUFFER_LOCK_SHARE 1
|
|
|
|
#define BUFFER_LOCK_EXCLUSIVE 2
|
|
|
|
|
|
|
|
/*
|
|
|
|
* These routines are beaten on quite heavily, hence the macroization.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* BufferIsValid
|
|
|
|
* True iff the given buffer number is valid (either as a shared
|
|
|
|
* or local buffer).
|
|
|
|
*
|
|
|
|
* Note: For a long time this was defined the same as BufferIsPinned,
|
|
|
|
* that is it would say False if you didn't hold a pin on the buffer.
|
|
|
|
* I believe this was bogus and served only to mask logic errors.
|
|
|
|
* Code should always know whether it has a buffer reference,
|
|
|
|
* independently of the pin state.
|
|
|
|
*
|
|
|
|
* Note: For a further long time this was not quite the inverse of the
|
|
|
|
* BufferIsInvalid() macro, in that it also did sanity checks to verify
|
|
|
|
* that the buffer number was in range. Most likely, this macro was
|
|
|
|
* originally intended only to be used in assertions, but its use has
|
|
|
|
* since expanded quite a bit, and the overhead of making those checks
|
|
|
|
* even in non-assert-enabled builds can be significant. Thus, we've
|
|
|
|
* now demoted the range checks to assertions within the macro itself.
|
|
|
|
*/
|
|
|
|
#define BufferIsValid(bufnum) \
|
|
|
|
( \
|
|
|
|
AssertMacro((bufnum) <= NBuffers && (bufnum) >= -NLocBuffer), \
|
|
|
|
(bufnum) != InvalidBuffer \
|
|
|
|
)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* BufferGetBlock
|
|
|
|
* Returns a reference to a disk page image associated with a buffer.
|
|
|
|
*
|
|
|
|
* Note:
|
|
|
|
* Assumes buffer is valid.
|
|
|
|
*/
|
|
|
|
#define BufferGetBlock(buffer) \
|
|
|
|
( \
|
|
|
|
AssertMacro(BufferIsValid(buffer)), \
|
|
|
|
BufferIsLocal(buffer) ? \
|
|
|
|
LocalBufferBlockPointers[-(buffer) - 1] \
|
|
|
|
: \
|
|
|
|
(Block) (BufferBlocks + ((Size) ((buffer) - 1)) * BLCKSZ) \
|
|
|
|
)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* BufferGetPageSize
|
|
|
|
* Returns the page size within a buffer.
|
|
|
|
*
|
|
|
|
* Notes:
|
|
|
|
* Assumes buffer is valid.
|
|
|
|
*
|
|
|
|
* The buffer can be a raw disk block and need not contain a valid
|
|
|
|
* (formatted) disk page.
|
|
|
|
*/
|
|
|
|
/* XXX should dig out of buffer descriptor */
|
|
|
|
#define BufferGetPageSize(buffer) \
|
|
|
|
( \
|
|
|
|
AssertMacro(BufferIsValid(buffer)), \
|
|
|
|
(Size)BLCKSZ \
|
|
|
|
)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* BufferGetPage
|
|
|
|
* Returns the page associated with a buffer.
|
|
|
|
*
|
|
|
|
* When this is called as part of a scan, there may be a need for a nearby
|
|
|
|
* call to TestForOldSnapshot(). See the definition of that for details.
|
|
|
|
*/
|
|
|
|
#define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
|
|
|
|
|
|
|
|
/*
|
|
|
|
* prototypes for functions in bufmgr.c
|
|
|
|
*/
|
|
|
|
extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
|
|
|
|
BlockNumber blockNum);
|
|
|
|
extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
|
Unite ReadBufferWithFork, ReadBufferWithStrategy, and ZeroOrReadBuffer
functions into one ReadBufferExtended function, that takes the strategy
and mode as argument. There's three modes, RBM_NORMAL which is the default
used by plain ReadBuffer(), RBM_ZERO, which replaces ZeroOrReadBuffer, and
a new mode RBM_ZERO_ON_ERROR, which allows callers to read corrupt pages
without throwing an error. The FSM needs the new mode to recover from
corrupt pages, which could happend if we crash after extending an FSM file,
and the new page is "torn".
Add fork number to some error messages in bufmgr.c, that still lacked it.
17 years ago
|
|
|
extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
|
|
|
|
BlockNumber blockNum, ReadBufferMode mode,
|
|
|
|
BufferAccessStrategy strategy);
|
|
|
|
extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
|
|
|
|
ForkNumber forkNum, BlockNumber blockNum,
|
|
|
|
ReadBufferMode mode, BufferAccessStrategy strategy);
|
|
|
|
extern void ReleaseBuffer(Buffer buffer);
|
|
|
|
extern void UnlockReleaseBuffer(Buffer buffer);
|
|
|
|
extern void MarkBufferDirty(Buffer buffer);
|
|
|
|
extern void IncrBufferRefCount(Buffer buffer);
|
|
|
|
extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
|
|
|
|
BlockNumber blockNum);
|
|
|
|
|
|
|
|
extern void InitBufferPool(void);
|
|
|
|
extern void InitBufferPoolAccess(void);
|
|
|
|
extern void InitBufferPoolBackend(void);
|
|
|
|
extern void AtEOXact_Buffers(bool isCommit);
|
|
|
|
extern void PrintBufferLeakWarning(Buffer buffer);
|
|
|
|
extern void CheckPointBuffers(int flags);
|
|
|
|
extern BlockNumber BufferGetBlockNumber(Buffer buffer);
|
|
|
|
extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
|
|
|
|
ForkNumber forkNum);
|
|
|
|
extern void FlushOneBuffer(Buffer buffer);
|
|
|
|
extern void FlushRelationBuffers(Relation rel);
|
Skip WAL for new relfilenodes, under wal_level=minimal.
Until now, only selected bulk operations (e.g. COPY) did this. If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY. See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules. Maintainers of table access
methods should examine that section.
To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL. A new GUC,
wal_skip_threshold, guides that choice. If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold. Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT from commands like COPY.
Internally, this requires a reliable determination of whether
RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the
specification of rd_createSubid such that the field is zero when a new
rel has an old rd_node. Make relcache.c retain entries for certain
dropped relations until end of transaction.
Back-patch to 9.5 (all supported versions). This introduces a new WAL
record type, XLOG_GIST_ASSIGN_LSN, without bumping XLOG_PAGE_MAGIC. As
always, update standby systems before master systems. This changes
sizeof(RelationData) and sizeof(IndexStmt), breaking binary
compatibility for affected extensions. (The most recent commit to
affect the same class of extensions was
089e4d405d0f3b94c74a2c6a54357a84a681754b.)
Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
Haas. Heikki Linnakangas and Michael Paquier implemented earlier
designs that materially clarified the problem. Reviewed, in earlier
designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout.
Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
5 years ago
|
|
|
extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels);
|
|
|
|
extern void FlushDatabaseBuffers(Oid dbid);
|
Speedup truncations of relation forks.
When a relation is truncated, shared_buffers needs to be scanned
so that any buffers for the relation forks are invalidated in it.
Previously, shared_buffers was scanned for each relation forks, i.e.,
MAIN, FSM and VM, when VACUUM truncated off any empty pages
at the end of relation or TRUNCATE truncated the relation in place.
Since shared_buffers needed to be scanned multiple times,
it could take a long time to finish those commands especially
when shared_buffers was large.
This commit changes the logic so that shared_buffers is scanned only
one time for those three relation forks.
Author: Kirk Jamison
Reviewed-by: Masahiko Sawada, Thomas Munro, Alvaro Herrera, Takayuki Tsunakawa and Fujii Masao
Discussion: https://postgr.es/m/D09B13F772D2274BB348A310EE3027C64E2067@g01jpexmbkw24
6 years ago
|
|
|
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
|
|
|
|
int nforks, BlockNumber *firstDelBlock);
|
Accelerate end-of-transaction dropping of relations
When relations are dropped, at end of transaction we need to remove the
files and clean the buffer pool of buffers containing pages of those
relations. Previously we would scan the buffer pool once per relation
to clean up buffers. When there are many relations to drop, the
repeated scans make this process slow; so we now instead pass a list of
relations to drop and scan the pool once, checking each buffer against
the passed list. When the number of relations is larger than a
threshold (which as of this patch is being set to 20 relations) we sort
the array before starting, and bsearch the array; when it's smaller, we
simply scan the array linearly each time, because that's faster. The
exact optimal threshold value depends on many factors, but the
difference is not likely to be significant enough to justify making it
user-settable.
This has been measured to be a significant win (a 15x win when dropping
100,000 relations; an extreme case, but reportedly a real one).
Author: Tomas Vondra, some tweaks by me
Reviewed by: Robert Haas, Shigeru Hanada, Andres Freund, Álvaro Herrera
13 years ago
|
|
|
extern void DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes);
|
|
|
|
extern void DropDatabaseBuffers(Oid dbid);
|
|
|
|
|
|
|
|
#define RelationGetNumberOfBlocks(reln) \
|
|
|
|
RelationGetNumberOfBlocksInFork(reln, MAIN_FORKNUM)
|
|
|
|
|
|
|
|
extern bool BufferIsPermanent(Buffer buffer);
|
|
|
|
extern XLogRecPtr BufferGetLSNAtomic(Buffer buffer);
|
|
|
|
|
|
|
|
#ifdef NOT_USED
|
|
|
|
extern void PrintPinnedBufs(void);
|
|
|
|
#endif
|
|
|
|
extern Size BufferShmemSize(void);
|
|
|
|
extern void BufferGetTag(Buffer buffer, RelFileNode *rnode,
|
|
|
|
ForkNumber *forknum, BlockNumber *blknum);
|
|
|
|
|
|
|
|
extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
|
|
|
|
|
|
|
|
extern void UnlockBuffers(void);
|
|
|
|
extern void LockBuffer(Buffer buffer, int mode);
|
|
|
|
extern bool ConditionalLockBuffer(Buffer buffer);
|
|
|
|
extern void LockBufferForCleanup(Buffer buffer);
|
|
|
|
extern bool ConditionalLockBufferForCleanup(Buffer buffer);
|
|
|
|
extern bool IsBufferCleanupOK(Buffer buffer);
|
|
|
|
extern bool HoldingBufferPinThatDelaysRecovery(void);
|
|
|
|
|
|
|
|
extern void AbortBufferIO(void);
|
|
|
|
|
|
|
|
extern void BufmgrCommit(void);
|
Allow to trigger kernel writeback after a configurable number of writes.
Currently writes to the main data files of postgres all go through the
OS page cache. This means that some operating systems can end up
collecting a large number of dirty buffers in their respective page
caches. When these dirty buffers are flushed to storage rapidly, be it
because of fsync(), timeouts, or dirty ratios, latency for other reads
and writes can increase massively. This is the primary reason for
regular massive stalls observed in real world scenarios and artificial
benchmarks; on rotating disks stalls on the order of hundreds of seconds
have been observed.
On linux it is possible to control this by reducing the global dirty
limits significantly, reducing the above problem. But global
configuration is rather problematic because it'll affect other
applications; also PostgreSQL itself doesn't always generally want this
behavior, e.g. for temporary files it's undesirable.
Several operating systems allow some control over the kernel page
cache. Linux has sync_file_range(2), several posix systems have msync(2)
and posix_fadvise(2). sync_file_range(2) is preferable because it
requires no special setup, whereas msync() requires the to-be-flushed
range to be mmap'ed. For the purpose of flushing dirty data
posix_fadvise(2) is the worst alternative, as flushing dirty data is
just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
from the page cache. Thus the feature is enabled by default only on
linux, but can be enabled on all systems that have any of the above
APIs.
While desirable and likely possible this patch does not contain an
implementation for windows.
With the infrastructure added, writes made via checkpointer, bgwriter
and normal user backends can be flushed after a configurable number of
writes. Each of these sources of writes controlled by a separate GUC,
checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
respectively; they're separate because the number of flushes that are
good are separate, and because the performance considerations of
controlled flushing for each of these are different.
A later patch will add checkpoint sorting - after that flushes from the
ckeckpoint will almost always be desirable. Bgwriter flushes are most of
the time going to be random, which are slow on lots of storage hardware.
Flushing in backends works well if the storage and bgwriter can keep up,
but if not it can have negative consequences. This patch is likely to
have negative performance consequences without checkpoint sorting, but
unfortunately so has sorting without flush control.
Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
9 years ago
|
|
|
extern bool BgBufferSync(struct WritebackContext *wb_context);
|
|
|
|
|
|
|
|
extern void AtProcExit_LocalBuffers(void);
|
|
|
|
|
|
|
|
extern void TestForOldSnapshot_impl(Snapshot snapshot, Relation relation);
|
|
|
|
|
|
|
|
/* in freelist.c */
|
|
|
|
extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype);
|
|
|
|
extern void FreeAccessStrategy(BufferAccessStrategy strategy);
|
|
|
|
|
|
|
|
|
|
|
|
/* inline functions */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Although this header file is nominally backend-only, certain frontend
|
|
|
|
* programs like pg_waldump include it. For compilers that emit static
|
|
|
|
* inline functions even when they're unused, that leads to unsatisfied
|
|
|
|
* external references; hence hide these with #ifndef FRONTEND.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef FRONTEND
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check whether the given snapshot is too old to have safely read the given
|
|
|
|
* page from the given table. If so, throw a "snapshot too old" error.
|
|
|
|
*
|
|
|
|
* This test generally needs to be performed after every BufferGetPage() call
|
|
|
|
* that is executed as part of a scan. It is not needed for calls made for
|
|
|
|
* modifying the page (for example, to position to the right place to insert a
|
|
|
|
* new index tuple or for vacuuming). It may also be omitted where calls to
|
|
|
|
* lower-level functions will have already performed the test.
|
|
|
|
*
|
|
|
|
* Note that a NULL snapshot argument is allowed and causes a fast return
|
|
|
|
* without error; this is to support call sites which can be called from
|
|
|
|
* either scans or index modification areas.
|
|
|
|
*
|
|
|
|
* For best performance, keep the tests that are fastest and/or most likely to
|
|
|
|
* exclude a page from old snapshot testing near the front.
|
|
|
|
*/
|
|
|
|
static inline void
|
|
|
|
TestForOldSnapshot(Snapshot snapshot, Relation relation, Page page)
|
|
|
|
{
|
|
|
|
Assert(relation != NULL);
|
|
|
|
|
|
|
|
if (old_snapshot_threshold >= 0
|
|
|
|
&& (snapshot) != NULL
|
|
|
|
&& ((snapshot)->snapshot_type == SNAPSHOT_MVCC
|
|
|
|
|| (snapshot)->snapshot_type == SNAPSHOT_TOAST)
|
|
|
|
&& !XLogRecPtrIsInvalid((snapshot)->lsn)
|
|
|
|
&& PageGetLSN(page) > (snapshot)->lsn)
|
|
|
|
TestForOldSnapshot_impl(snapshot, relation);
|
|
|
|
}
|
|
|
|
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
8 years ago
|
|
|
#endif /* FRONTEND */
|
|
|
|
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
8 years ago
|
|
|
#endif /* BUFMGR_H */
|