|
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
|
*
|
|
|
|
|
* ipci.c
|
|
|
|
|
* POSTGRES inter-process communication initialization code.
|
|
|
|
|
*
|
|
|
|
|
* Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
|
|
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
|
|
|
|
*
|
|
|
|
|
*
|
|
|
|
|
* IDENTIFICATION
|
|
|
|
|
* src/backend/storage/ipc/ipci.c
|
|
|
|
|
*
|
|
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
|
*/
|
|
|
|
|
#include "postgres.h"
|
|
|
|
|
|
|
|
|
|
#include "access/clog.h"
|
Keep track of transaction commit timestamps
Transactions can now set their commit timestamp directly as they commit,
or an external transaction commit timestamp can be fed from an outside
system using the new function TransactionTreeSetCommitTsData(). This
data is crash-safe, and truncated at Xid freeze point, same as pg_clog.
This module is disabled by default because it causes a performance hit,
but can be enabled in postgresql.conf requiring only a server restart.
A new test in src/test/modules is included.
Catalog version bumped due to the new subdirectory within PGDATA and a
couple of new SQL functions.
Authors: Álvaro Herrera and Petr Jelínek
Reviewed to varying degrees by Michael Paquier, Andres Freund, Robert
Haas, Amit Kapila, Fujii Masao, Jaime Casanova, Simon Riggs, Steven
Singer, Peter Eisentraut
11 years ago
|
|
|
#include "access/commit_ts.h"
|
|
|
|
|
#include "access/heapam.h"
|
|
|
|
|
#include "access/multixact.h"
|
|
|
|
|
#include "access/nbtree.h"
|
|
|
|
|
#include "access/subtrans.h"
|
|
|
|
|
#include "access/twophase.h"
|
|
|
|
|
#include "commands/async.h"
|
|
|
|
|
#include "miscadmin.h"
|
|
|
|
|
#include "pgstat.h"
|
|
|
|
|
#include "postmaster/autovacuum.h"
|
Allow background workers to be started dynamically.
There is a new API, RegisterDynamicBackgroundWorker, which allows
an ordinary user backend to register a new background writer during
normal running. This means that it's no longer necessary for all
background workers to be registered during processing of
shared_preload_libraries, although the option of registering workers
at that time remains available.
When a background worker exits and will not be restarted, the
slot previously used by that background worker is automatically
released and becomes available for reuse. Slots used by background
workers that are configured for automatic restart can't (yet) be
released without shutting down the system.
This commit adds a new source file, bgworker.c, and moves some
of the existing control logic for background workers there.
Previously, there was little enough logic that it made sense to
keep everything in postmaster.c, but not any more.
This commit also makes the worker_spi contrib module into an
extension and adds a new function, worker_spi_launch, which can
be used to demonstrate the new facility.
13 years ago
|
|
|
#include "postmaster/bgworker_internals.h"
|
|
|
|
|
#include "postmaster/bgwriter.h"
|
|
|
|
|
#include "postmaster/postmaster.h"
|
|
|
|
|
#include "replication/logicallauncher.h"
|
|
|
|
|
#include "replication/slot.h"
|
|
|
|
|
#include "replication/walreceiver.h"
|
|
|
|
|
#include "replication/walsender.h"
|
Introduce replication progress tracking infrastructure.
When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
e.g. to avoid loops in bi-directional replication setups
The solution to these problems, as implemented here, consist out of
three parts:
1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
replication origin, how far replay has progressed in a efficient and
crash safe manner.
3) The ability to filter out changes performed on the behest of a
replication origin during logical decoding; this allows complex
replication topologies. E.g. by filtering all replayed changes out.
Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated. We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.
This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL. Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.
For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.
Bumps both catversion and wal page magic.
Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
20140923182422.GA15776@alap3.anarazel.de,
20131114172632.GE7522@alap2.anarazel.de
11 years ago
|
|
|
#include "replication/origin.h"
|
|
|
|
|
#include "storage/bufmgr.h"
|
|
|
|
|
#include "storage/dsm.h"
|
|
|
|
|
#include "storage/ipc.h"
|
|
|
|
|
#include "storage/pg_shmem.h"
|
|
|
|
|
#include "storage/pmsignal.h"
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
15 years ago
|
|
|
#include "storage/predicate.h"
|
|
|
|
|
#include "storage/proc.h"
|
|
|
|
|
#include "storage/procarray.h"
|
|
|
|
|
#include "storage/procsignal.h"
|
|
|
|
|
#include "storage/sinvaladt.h"
|
|
|
|
|
#include "storage/spin.h"
|
|
|
|
|
#include "utils/snapmgr.h"
|
|
|
|
|
|
|
|
|
|
/* GUCs */
|
|
|
|
|
int shared_memory_type = DEFAULT_SHARED_MEMORY_TYPE;
|
|
|
|
|
|
|
|
|
|
shmem_startup_hook_type shmem_startup_hook = NULL;
|
|
|
|
|
|
|
|
|
|
static Size total_addin_request = 0;
|
|
|
|
|
static bool addin_request_allowed = true;
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* RequestAddinShmemSpace
|
|
|
|
|
* Request that extra shmem space be allocated for use by
|
|
|
|
|
* a loadable module.
|
|
|
|
|
*
|
|
|
|
|
* This is only useful if called from the _PG_init hook of a library that
|
|
|
|
|
* is loaded into the postmaster via shared_preload_libraries. Once
|
|
|
|
|
* shared memory has been allocated, calls will be ignored. (We could
|
|
|
|
|
* raise an error, but it seems better to make it a no-op, so that
|
|
|
|
|
* libraries containing such calls can be reloaded if needed.)
|
|
|
|
|
*/
|
|
|
|
|
void
|
|
|
|
|
RequestAddinShmemSpace(Size size)
|
|
|
|
|
{
|
|
|
|
|
if (IsUnderPostmaster || !addin_request_allowed)
|
|
|
|
|
return; /* too late */
|
|
|
|
|
total_addin_request = add_size(total_addin_request, size);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* CreateSharedMemoryAndSemaphores
|
|
|
|
|
* Creates and initializes shared memory and semaphores.
|
|
|
|
|
*
|
|
|
|
|
* This is called by the postmaster or by a standalone backend.
|
|
|
|
|
* It is also called by a backend forked from the postmaster in the
|
|
|
|
|
* EXEC_BACKEND case. In the latter case, the shared memory segment
|
|
|
|
|
* already exists and has been physically attached to, but we have to
|
|
|
|
|
* initialize pointers in local memory that reference the shared structures,
|
|
|
|
|
* because we didn't inherit the correct pointer values from the postmaster
|
|
|
|
|
* as we do in the fork() scenario. The easiest way to do that is to run
|
|
|
|
|
* through the same code as before. (Note that the called routines mostly
|
|
|
|
|
* check IsUnderPostmaster, rather than EXEC_BACKEND, to detect this case.
|
|
|
|
|
* This is a bit code-wasteful and could be cleaned up.)
|
|
|
|
|
*/
|
|
|
|
|
void
|
|
|
|
|
CreateSharedMemoryAndSemaphores(int port)
|
|
|
|
|
{
|
|
|
|
|
PGShmemHeader *shim = NULL;
|
|
|
|
|
|
|
|
|
|
if (!IsUnderPostmaster)
|
|
|
|
|
{
|
|
|
|
|
PGShmemHeader *seghdr;
|
|
|
|
|
Size size;
|
|
|
|
|
int numSemas;
|
|
|
|
|
|
Make the different Unix-y semaphore implementations ABI-compatible.
Previously, the "sem" field of PGPROC varied in size depending on which
kernel semaphore API we were using. That was okay as long as there was
only one likely choice per platform, but in the wake of commit ecb0d20a9,
that assumption seems rather shaky. It doesn't seem out of the question
anymore that an extension compiled against one API choice might be loaded
into a postmaster built with another choice. Moreover, this prevents any
possibility of selecting the semaphore API at postmaster startup, which
might be something we want to do in future.
Hence, change PGPROC.sem to be PGSemaphore (i.e. a pointer) for all Unix
semaphore APIs, and turn the pointed-to data into an opaque struct whose
contents are only known within the responsible modules.
For the SysV and unnamed-POSIX APIs, the pointed-to data has to be
allocated elsewhere in shared memory, which takes a little bit of
rejiggering of the InitShmemAllocation code sequence. (I invented a
ShmemAllocUnlocked() function to make that a little cleaner than it used
to be. That function is not meant for any uses other than the ones it
has now, but it beats having InitShmemAllocation() know explicitly about
allocation of space for semaphores and spinlocks.) This change means an
extra indirection to access the semaphore data, but since we only touch
that when blocking or awakening a process, there shouldn't be any
meaningful performance penalty. Moreover, at least for the unnamed-POSIX
case on Linux, the sem_t type is quite a bit wider than a pointer, so this
reduces sizeof(PGPROC) which seems like a good thing.
For the named-POSIX API, there's effectively no change: the PGPROC.sem
field was and still is a pointer to something returned by sem_open() in
the postmaster's memory space. Document and check the pre-existing
limitation that this case can't work in EXEC_BACKEND mode.
It did not seem worth unifying the Windows semaphore ABI with the Unix
cases, since there's no likelihood of needing ABI compatibility much less
runtime switching across those cases. However, we can simplify the Windows
code a bit if we define PGSemaphore as being directly a HANDLE, rather than
pointer to HANDLE, so let's do that while we're here. (This also ends up
being no change in what's physically stored in PGPROC.sem. We're just
moving the HANDLE fetch from callees to callers.)
It would take a bunch of additional code shuffling to get to the point of
actually choosing a semaphore API at postmaster start, but the effects
of that would now be localized in the port/XXX_sema.c files, so it seems
like fit material for a separate patch. The need for it is unproven as
yet, anyhow, whereas the ABI risk to extensions seems real enough.
Discussion: https://postgr.es/m/4029.1481413370@sss.pgh.pa.us
9 years ago
|
|
|
/* Compute number of semaphores we'll need */
|
|
|
|
|
numSemas = ProcGlobalSemas();
|
|
|
|
|
numSemas += SpinlockSemas();
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Size of the Postgres shared-memory block is estimated via
|
|
|
|
|
* moderately-accurate estimates for the big hogs, plus 100K for the
|
|
|
|
|
* stuff that's too small to bother with estimating.
|
|
|
|
|
*
|
|
|
|
|
* We take some care during this phase to ensure that the total size
|
|
|
|
|
* request doesn't overflow size_t. If this gets through, we don't
|
|
|
|
|
* need to be so careful during the actual allocation phase.
|
|
|
|
|
*/
|
|
|
|
|
size = 100000;
|
Make the different Unix-y semaphore implementations ABI-compatible.
Previously, the "sem" field of PGPROC varied in size depending on which
kernel semaphore API we were using. That was okay as long as there was
only one likely choice per platform, but in the wake of commit ecb0d20a9,
that assumption seems rather shaky. It doesn't seem out of the question
anymore that an extension compiled against one API choice might be loaded
into a postmaster built with another choice. Moreover, this prevents any
possibility of selecting the semaphore API at postmaster startup, which
might be something we want to do in future.
Hence, change PGPROC.sem to be PGSemaphore (i.e. a pointer) for all Unix
semaphore APIs, and turn the pointed-to data into an opaque struct whose
contents are only known within the responsible modules.
For the SysV and unnamed-POSIX APIs, the pointed-to data has to be
allocated elsewhere in shared memory, which takes a little bit of
rejiggering of the InitShmemAllocation code sequence. (I invented a
ShmemAllocUnlocked() function to make that a little cleaner than it used
to be. That function is not meant for any uses other than the ones it
has now, but it beats having InitShmemAllocation() know explicitly about
allocation of space for semaphores and spinlocks.) This change means an
extra indirection to access the semaphore data, but since we only touch
that when blocking or awakening a process, there shouldn't be any
meaningful performance penalty. Moreover, at least for the unnamed-POSIX
case on Linux, the sem_t type is quite a bit wider than a pointer, so this
reduces sizeof(PGPROC) which seems like a good thing.
For the named-POSIX API, there's effectively no change: the PGPROC.sem
field was and still is a pointer to something returned by sem_open() in
the postmaster's memory space. Document and check the pre-existing
limitation that this case can't work in EXEC_BACKEND mode.
It did not seem worth unifying the Windows semaphore ABI with the Unix
cases, since there's no likelihood of needing ABI compatibility much less
runtime switching across those cases. However, we can simplify the Windows
code a bit if we define PGSemaphore as being directly a HANDLE, rather than
pointer to HANDLE, so let's do that while we're here. (This also ends up
being no change in what's physically stored in PGPROC.sem. We're just
moving the HANDLE fetch from callees to callers.)
It would take a bunch of additional code shuffling to get to the point of
actually choosing a semaphore API at postmaster start, but the effects
of that would now be localized in the port/XXX_sema.c files, so it seems
like fit material for a separate patch. The need for it is unproven as
yet, anyhow, whereas the ABI risk to extensions seems real enough.
Discussion: https://postgr.es/m/4029.1481413370@sss.pgh.pa.us
9 years ago
|
|
|
size = add_size(size, PGSemaphoreShmemSize(numSemas));
|
Reduce the number of semaphores used under --disable-spinlocks.
Instead of allocating a semaphore from the operating system for every
spinlock, allocate a fixed number of semaphores (by default, 1024)
from the operating system and multiplex all the spinlocks that get
created onto them. This could self-deadlock if a process attempted
to acquire more than one spinlock at a time, but since processes
aren't supposed to execute anything other than short stretches of
straight-line code while holding a spinlock, that shouldn't happen.
One motivation for this change is that, with the introduction of
dynamic shared memory, it may be desirable to create spinlocks that
last for less than the lifetime of the server. Without this change,
attempting to use such facilities under --disable-spinlocks would
quickly exhaust any supply of available semaphores. Quite apart
from that, it's desirable to contain the quantity of semaphores
needed to run the server simply on convenience grounds, since using
too many may make it harder to get PostgreSQL running on a new
platform, which is mostly the point of --disable-spinlocks in the
first place.
Patch by me; review by Tom Lane.
12 years ago
|
|
|
size = add_size(size, SpinlockSemaSize());
|
|
|
|
|
size = add_size(size, hash_estimate_size(SHMEM_INDEX_SIZE,
|
|
|
|
|
sizeof(ShmemIndexEnt)));
|
|
|
|
|
size = add_size(size, BufferShmemSize());
|
|
|
|
|
size = add_size(size, LockShmemSize());
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
15 years ago
|
|
|
size = add_size(size, PredicateLockShmemSize());
|
|
|
|
|
size = add_size(size, ProcGlobalShmemSize());
|
|
|
|
|
size = add_size(size, XLOGShmemSize());
|
|
|
|
|
size = add_size(size, CLOGShmemSize());
|
Keep track of transaction commit timestamps
Transactions can now set their commit timestamp directly as they commit,
or an external transaction commit timestamp can be fed from an outside
system using the new function TransactionTreeSetCommitTsData(). This
data is crash-safe, and truncated at Xid freeze point, same as pg_clog.
This module is disabled by default because it causes a performance hit,
but can be enabled in postgresql.conf requiring only a server restart.
A new test in src/test/modules is included.
Catalog version bumped due to the new subdirectory within PGDATA and a
couple of new SQL functions.
Authors: Álvaro Herrera and Petr Jelínek
Reviewed to varying degrees by Michael Paquier, Andres Freund, Robert
Haas, Amit Kapila, Fujii Masao, Jaime Casanova, Simon Riggs, Steven
Singer, Peter Eisentraut
11 years ago
|
|
|
size = add_size(size, CommitTsShmemSize());
|
|
|
|
|
size = add_size(size, SUBTRANSShmemSize());
|
|
|
|
|
size = add_size(size, TwoPhaseShmemSize());
|
Allow background workers to be started dynamically.
There is a new API, RegisterDynamicBackgroundWorker, which allows
an ordinary user backend to register a new background writer during
normal running. This means that it's no longer necessary for all
background workers to be registered during processing of
shared_preload_libraries, although the option of registering workers
at that time remains available.
When a background worker exits and will not be restarted, the
slot previously used by that background worker is automatically
released and becomes available for reuse. Slots used by background
workers that are configured for automatic restart can't (yet) be
released without shutting down the system.
This commit adds a new source file, bgworker.c, and moves some
of the existing control logic for background workers there.
Previously, there was little enough logic that it made sense to
keep everything in postmaster.c, but not any more.
This commit also makes the worker_spi contrib module into an
extension and adds a new function, worker_spi_launch, which can
be used to demonstrate the new facility.
13 years ago
|
|
|
size = add_size(size, BackgroundWorkerShmemSize());
|
|
|
|
|
size = add_size(size, MultiXactShmemSize());
|
|
|
|
|
size = add_size(size, LWLockShmemSize());
|
|
|
|
|
size = add_size(size, ProcArrayShmemSize());
|
|
|
|
|
size = add_size(size, BackendStatusShmemSize());
|
|
|
|
|
size = add_size(size, SInvalShmemSize());
|
Install a "dead man switch" to allow the postmaster to detect cases where
a backend has done exit(0) or exit(1) without having disengaged itself
from shared memory. We are at risk for this whenever third-party code is
loaded into a backend, since such code might not know it's supposed to go
through proc_exit() instead. Also, it is reported that under Windows
there are ways to externally kill a process that cause the status code
returned to the postmaster to be indistinguishable from a voluntary exit
(thank you, Microsoft). If this does happen then the system is probably
hosed --- for instance, the dead session might still be holding locks.
So the best recovery method is to treat this like a backend crash.
The dead man switch is armed for a particular child process when it
acquires a regular PGPROC, and disarmed when the PGPROC is released;
these should be the first and last touches of shared memory resources
in a backend, or close enough anyway. This choice means there is no
coverage for auxiliary processes, but I doubt we need that, since they
shouldn't be executing any user-provided code anyway.
This patch also improves the management of the EXEC_BACKEND
ShmemBackendArray array a bit, by reducing search costs.
Although this problem is of long standing, the lack of field complaints
seems to mean it's not critical enough to risk back-patching; at least
not till we get some more testing of this mechanism.
17 years ago
|
|
|
size = add_size(size, PMSignalShmemSize());
|
|
|
|
|
size = add_size(size, ProcSignalShmemSize());
|
|
|
|
|
size = add_size(size, CheckpointerShmemSize());
|
|
|
|
|
size = add_size(size, AutoVacuumShmemSize());
|
|
|
|
|
size = add_size(size, ReplicationSlotsShmemSize());
|
Introduce replication progress tracking infrastructure.
When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
e.g. to avoid loops in bi-directional replication setups
The solution to these problems, as implemented here, consist out of
three parts:
1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
replication origin, how far replay has progressed in a efficient and
crash safe manner.
3) The ability to filter out changes performed on the behest of a
replication origin during logical decoding; this allows complex
replication topologies. E.g. by filtering all replayed changes out.
Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated. We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.
This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL. Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.
For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.
Bumps both catversion and wal page magic.
Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
20140923182422.GA15776@alap3.anarazel.de,
20131114172632.GE7522@alap2.anarazel.de
11 years ago
|
|
|
size = add_size(size, ReplicationOriginShmemSize());
|
|
|
|
|
size = add_size(size, WalSndShmemSize());
|
|
|
|
|
size = add_size(size, WalRcvShmemSize());
|
|
|
|
|
size = add_size(size, ApplyLauncherShmemSize());
|
|
|
|
|
size = add_size(size, SnapMgrShmemSize());
|
|
|
|
|
size = add_size(size, BTreeShmemSize());
|
|
|
|
|
size = add_size(size, SyncScanShmemSize());
|
|
|
|
|
size = add_size(size, AsyncShmemSize());
|
|
|
|
|
#ifdef EXEC_BACKEND
|
|
|
|
|
size = add_size(size, ShmemBackendArraySize());
|
|
|
|
|
#endif
|
|
|
|
|
|
|
|
|
|
/* freeze the addin request size and include it */
|
|
|
|
|
addin_request_allowed = false;
|
|
|
|
|
size = add_size(size, total_addin_request);
|
|
|
|
|
|
|
|
|
|
/* might as well round it off to a multiple of a typical page size */
|
|
|
|
|
size = add_size(size, 8192 - (size % 8192));
|
|
|
|
|
|
|
|
|
|
elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Create the shmem segment
|
|
|
|
|
*/
|
|
|
|
|
seghdr = PGSharedMemoryCreate(size, port, &shim);
|
|
|
|
|
|
|
|
|
|
InitShmemAccess(seghdr);
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Create semaphores
|
|
|
|
|
*/
|
|
|
|
|
PGReserveSemaphores(numSemas, port);
|
Make the different Unix-y semaphore implementations ABI-compatible.
Previously, the "sem" field of PGPROC varied in size depending on which
kernel semaphore API we were using. That was okay as long as there was
only one likely choice per platform, but in the wake of commit ecb0d20a9,
that assumption seems rather shaky. It doesn't seem out of the question
anymore that an extension compiled against one API choice might be loaded
into a postmaster built with another choice. Moreover, this prevents any
possibility of selecting the semaphore API at postmaster startup, which
might be something we want to do in future.
Hence, change PGPROC.sem to be PGSemaphore (i.e. a pointer) for all Unix
semaphore APIs, and turn the pointed-to data into an opaque struct whose
contents are only known within the responsible modules.
For the SysV and unnamed-POSIX APIs, the pointed-to data has to be
allocated elsewhere in shared memory, which takes a little bit of
rejiggering of the InitShmemAllocation code sequence. (I invented a
ShmemAllocUnlocked() function to make that a little cleaner than it used
to be. That function is not meant for any uses other than the ones it
has now, but it beats having InitShmemAllocation() know explicitly about
allocation of space for semaphores and spinlocks.) This change means an
extra indirection to access the semaphore data, but since we only touch
that when blocking or awakening a process, there shouldn't be any
meaningful performance penalty. Moreover, at least for the unnamed-POSIX
case on Linux, the sem_t type is quite a bit wider than a pointer, so this
reduces sizeof(PGPROC) which seems like a good thing.
For the named-POSIX API, there's effectively no change: the PGPROC.sem
field was and still is a pointer to something returned by sem_open() in
the postmaster's memory space. Document and check the pre-existing
limitation that this case can't work in EXEC_BACKEND mode.
It did not seem worth unifying the Windows semaphore ABI with the Unix
cases, since there's no likelihood of needing ABI compatibility much less
runtime switching across those cases. However, we can simplify the Windows
code a bit if we define PGSemaphore as being directly a HANDLE, rather than
pointer to HANDLE, so let's do that while we're here. (This also ends up
being no change in what's physically stored in PGPROC.sem. We're just
moving the HANDLE fetch from callees to callers.)
It would take a bunch of additional code shuffling to get to the point of
actually choosing a semaphore API at postmaster start, but the effects
of that would now be localized in the port/XXX_sema.c files, so it seems
like fit material for a separate patch. The need for it is unproven as
yet, anyhow, whereas the ABI risk to extensions seems real enough.
Discussion: https://postgr.es/m/4029.1481413370@sss.pgh.pa.us
9 years ago
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* If spinlocks are disabled, initialize emulation layer (which
|
|
|
|
|
* depends on semaphores, so the order is important here).
|
|
|
|
|
*/
|
|
|
|
|
#ifndef HAVE_SPINLOCKS
|
|
|
|
|
SpinlockSemaInit();
|
|
|
|
|
#endif
|
|
|
|
|
}
|
|
|
|
|
else
|
|
|
|
|
{
|
|
|
|
|
/*
|
|
|
|
|
* We are reattaching to an existing shared memory segment. This
|
|
|
|
|
* should only be reached in the EXEC_BACKEND case.
|
|
|
|
|
*/
|
|
|
|
|
#ifndef EXEC_BACKEND
|
|
|
|
|
elog(PANIC, "should be attached to shared memory already");
|
|
|
|
|
#endif
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Set up shared memory allocation mechanism
|
|
|
|
|
*/
|
|
|
|
|
if (!IsUnderPostmaster)
|
|
|
|
|
InitShmemAllocation();
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Now initialize LWLocks, which do shared memory allocation and are
|
|
|
|
|
* needed for InitShmemIndex.
|
|
|
|
|
*/
|
|
|
|
|
CreateLWLocks();
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Set up shmem.c index hashtable
|
|
|
|
|
*/
|
|
|
|
|
InitShmemIndex();
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Set up xlog, clog, and buffers
|
|
|
|
|
*/
|
|
|
|
|
XLOGShmemInit();
|
|
|
|
|
CLOGShmemInit();
|
Keep track of transaction commit timestamps
Transactions can now set their commit timestamp directly as they commit,
or an external transaction commit timestamp can be fed from an outside
system using the new function TransactionTreeSetCommitTsData(). This
data is crash-safe, and truncated at Xid freeze point, same as pg_clog.
This module is disabled by default because it causes a performance hit,
but can be enabled in postgresql.conf requiring only a server restart.
A new test in src/test/modules is included.
Catalog version bumped due to the new subdirectory within PGDATA and a
couple of new SQL functions.
Authors: Álvaro Herrera and Petr Jelínek
Reviewed to varying degrees by Michael Paquier, Andres Freund, Robert
Haas, Amit Kapila, Fujii Masao, Jaime Casanova, Simon Riggs, Steven
Singer, Peter Eisentraut
11 years ago
|
|
|
CommitTsShmemInit();
|
|
|
|
|
SUBTRANSShmemInit();
|
|
|
|
|
MultiXactShmemInit();
|
|
|
|
|
InitBufferPool();
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Set up lock manager
|
|
|
|
|
*/
|
|
|
|
|
InitLocks();
|
|
|
|
|
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
15 years ago
|
|
|
/*
|
|
|
|
|
* Set up predicate lock manager
|
|
|
|
|
*/
|
|
|
|
|
InitPredicateLocks();
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Set up process table
|
|
|
|
|
*/
|
|
|
|
|
if (!IsUnderPostmaster)
|
|
|
|
|
InitProcGlobal();
|
|
|
|
|
CreateSharedProcArray();
|
|
|
|
|
CreateSharedBackendStatus();
|
|
|
|
|
TwoPhaseShmemInit();
|
Allow background workers to be started dynamically.
There is a new API, RegisterDynamicBackgroundWorker, which allows
an ordinary user backend to register a new background writer during
normal running. This means that it's no longer necessary for all
background workers to be registered during processing of
shared_preload_libraries, although the option of registering workers
at that time remains available.
When a background worker exits and will not be restarted, the
slot previously used by that background worker is automatically
released and becomes available for reuse. Slots used by background
workers that are configured for automatic restart can't (yet) be
released without shutting down the system.
This commit adds a new source file, bgworker.c, and moves some
of the existing control logic for background workers there.
Previously, there was little enough logic that it made sense to
keep everything in postmaster.c, but not any more.
This commit also makes the worker_spi contrib module into an
extension and adds a new function, worker_spi_launch, which can
be used to demonstrate the new facility.
13 years ago
|
|
|
BackgroundWorkerShmemInit();
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Set up shared-inval messaging
|
|
|
|
|
*/
|
|
|
|
|
CreateSharedInvalidationState();
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Set up interprocess signaling mechanisms
|
|
|
|
|
*/
|
Install a "dead man switch" to allow the postmaster to detect cases where
a backend has done exit(0) or exit(1) without having disengaged itself
from shared memory. We are at risk for this whenever third-party code is
loaded into a backend, since such code might not know it's supposed to go
through proc_exit() instead. Also, it is reported that under Windows
there are ways to externally kill a process that cause the status code
returned to the postmaster to be indistinguishable from a voluntary exit
(thank you, Microsoft). If this does happen then the system is probably
hosed --- for instance, the dead session might still be holding locks.
So the best recovery method is to treat this like a backend crash.
The dead man switch is armed for a particular child process when it
acquires a regular PGPROC, and disarmed when the PGPROC is released;
these should be the first and last touches of shared memory resources
in a backend, or close enough anyway. This choice means there is no
coverage for auxiliary processes, but I doubt we need that, since they
shouldn't be executing any user-provided code anyway.
This patch also improves the management of the EXEC_BACKEND
ShmemBackendArray array a bit, by reducing search costs.
Although this problem is of long standing, the lack of field complaints
seems to mean it's not critical enough to risk back-patching; at least
not till we get some more testing of this mechanism.
17 years ago
|
|
|
PMSignalShmemInit();
|
|
|
|
|
ProcSignalShmemInit();
|
|
|
|
|
CheckpointerShmemInit();
|
|
|
|
|
AutoVacuumShmemInit();
|
|
|
|
|
ReplicationSlotsShmemInit();
|
Introduce replication progress tracking infrastructure.
When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
e.g. to avoid loops in bi-directional replication setups
The solution to these problems, as implemented here, consist out of
three parts:
1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
replication origin, how far replay has progressed in a efficient and
crash safe manner.
3) The ability to filter out changes performed on the behest of a
replication origin during logical decoding; this allows complex
replication topologies. E.g. by filtering all replayed changes out.
Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated. We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.
This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL. Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.
For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.
Bumps both catversion and wal page magic.
Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
20140923182422.GA15776@alap3.anarazel.de,
20131114172632.GE7522@alap2.anarazel.de
11 years ago
|
|
|
ReplicationOriginShmemInit();
|
|
|
|
|
WalSndShmemInit();
|
|
|
|
|
WalRcvShmemInit();
|
|
|
|
|
ApplyLauncherShmemInit();
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Set up other modules that need some shared memory space
|
|
|
|
|
*/
|
|
|
|
|
SnapMgrInit();
|
|
|
|
|
BTreeShmemInit();
|
|
|
|
|
SyncScanShmemInit();
|
|
|
|
|
AsyncShmemInit();
|
|
|
|
|
|
|
|
|
|
#ifdef EXEC_BACKEND
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Alloc the win32 shared backend array
|
|
|
|
|
*/
|
|
|
|
|
if (!IsUnderPostmaster)
|
|
|
|
|
ShmemBackendArrayAllocation();
|
|
|
|
|
#endif
|
|
|
|
|
|
|
|
|
|
/* Initialize dynamic shared memory facilities. */
|
|
|
|
|
if (!IsUnderPostmaster)
|
|
|
|
|
dsm_postmaster_startup(shim);
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
|
* Now give loadable modules a chance to set up their shmem allocations
|
|
|
|
|
*/
|
|
|
|
|
if (shmem_startup_hook)
|
|
|
|
|
shmem_startup_hook();
|
|
|
|
|
}
|