pg_createsubscriber: Improve handling of automated recovery configuration

When repurposing a standby to a logical replica, pg_createsubscriber
uses for the new replica a set of configuration parameters saved into
postgresql.auto.conf, to force recovery patterns when the physical
replica is promoted.

While not wrong in practice, this approach can cause issues when forcing
again recovery on a logical replica or its base backup as the recovery
parameters are not reset on the target server once pg_createsubscriber
is done with the node.

This commit aims at improving the situation, by changing the way
recovery parameters are saved on the target node.  Instead of writing
all the configuration to postgresql.auto.conf, this file now uses an
include_if_exists, that points to a pg_createsubscriber.conf.  This new
file contains all the recovery configuration, and is renamed to
pg_createsubscriber.conf.disabled when pg_createsubscriber exits.  This
approach resets the recovery parameters, and offers the benefit to keep
a trace of the setup used when the target node got promoted, for
debugging purposes.  If pg_createsubscriber.conf cannot be renamed
(unlikely scenario), a warning is issued to inform users that a manual
intervention may be required to reset this configuration.

This commit includes a test case to demonstrate the problematic case: a
standby node created from a base backup of what was the target node of
pg_createsubscriber does not get confused when started.  If removing
this new logic, the test fails with the standby not able to start due
to an incorrect recovery target setup, where the startup process fails
quickly with a FATAL.

I have provided the design idea for the patch, that Alyona has written
(with some code adjustments from me).  This could be considered as a
bug, but after discussion this is put into the bucket for improvements.
Redesigning pg_createsubscriber would not be acceptable in the stable
branches anyway.

Author: Alyona Vinter <dlaaren8@gmail.com>
Reviewed-by: Ilyasov Ian <ianilyasov@outlook.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Andrey Rudometov <unlimitedhikari@gmail.com>
Discussion: https://postgr.es/m/CAGWv16K6L6Pzm99i1KiXLjFWx2bUS3DVsR6yV87-YR9QO7xb3A@mail.gmail.com
master
Michael Paquier 2 weeks ago
parent 28c4b8a05b
commit 639352d904
  1. 12
      doc/src/sgml/ref/pg_createsubscriber.sgml
  2. 82
      src/bin/pg_basebackup/pg_createsubscriber.c
  3. 46
      src/bin/pg_basebackup/t/040_pg_createsubscriber.pl

@ -518,8 +518,11 @@ PostgreSQL documentation
<step>
<para>
Write recovery parameters into the target data directory and restart the
target server. It specifies an LSN (<xref
Write recovery parameters into the separate configuration file
<filename>pg_createsubscriber.conf</filename> that is included from
<filename>postgresql.auto.conf</filename> using
<literal>include_if_exists</literal> in the target data directory, then
restart the target server. It specifies an LSN (<xref
linkend="guc-recovery-target-lsn"/>) of the write-ahead log location up
to which recovery will proceed. It also specifies
<literal>promote</literal> as the action that the server should take
@ -532,7 +535,10 @@ PostgreSQL documentation
server ends standby mode and is accepting read-write transactions. If
<option>--recovery-timeout</option> option is set,
<application>pg_createsubscriber</application> terminates if recovery
does not end until the given number of seconds.
does not end until the given number of seconds. Upon completion, the
included configuration file is renamed to
<filename>pg_createsubscriber.conf.disabled</filename> so as it is no
longer loaded on subsequent restarts.
</para>
</step>

@ -20,6 +20,7 @@
#include "common/connect.h"
#include "common/controldata_utils.h"
#include "common/file_utils.h"
#include "common/logging.h"
#include "common/pg_prng.h"
#include "common/restricted_token.h"
@ -33,6 +34,21 @@
#define DEFAULT_SUB_PORT "50432"
#define OBJECTTYPE_PUBLICATIONS 0x0001
/*
* Configuration files for recovery parameters.
*
* The recovery parameters are set in INCLUDED_CONF_FILE, itself loaded by
* the server through an include_if_exists in postgresql.auto.conf.
*
* INCLUDED_CONF_FILE is renamed to INCLUDED_CONF_FILE_DISABLED when exiting,
* so as the recovery parameters set by this tool never take effect on node
* restart. The contents of INCLUDED_CONF_FILE_DISABLED can be useful for
* debugging.
*/
#define PG_AUTOCONF_FILENAME "postgresql.auto.conf"
#define INCLUDED_CONF_FILE "pg_createsubscriber.conf"
#define INCLUDED_CONF_FILE_DISABLED INCLUDED_CONF_FILE ".disabled"
/* Command-line options */
struct CreateSubscriberOptions
{
@ -156,22 +172,43 @@ static char *subscriber_dir = NULL;
static bool recovery_ended = false;
static bool standby_running = false;
static bool recovery_params_set = false;
/*
* Cleanup objects that were created by pg_createsubscriber if there is an
* error.
* Clean up objects created by pg_createsubscriber.
*
* Publications and replication slots are created on the primary. Depending
* on the step where it failed, already-created objects should be removed if
* possible (sometimes this won't work due to a connection issue).
* There is no cleanup on the target server *after* its promotion, because any
* failure at this point means recreating the physical replica and starting
* again.
*
* Publications and replication slots are created on primary. Depending on the
* step it failed, it should remove the already created objects if it is
* possible (sometimes it won't work due to a connection issue).
* There is no cleanup on the target server. The steps on the target server are
* executed *after* promotion, hence, at this point, a failure means recreate
* the physical replica and start again.
* The recovery configuration is always removed, by renaming the included
* configuration file out of the way.
*/
static void
cleanup_objects_atexit(void)
{
/* Rename the included configuration file, if necessary. */
if (recovery_params_set)
{
char conf_filename[MAXPGPATH];
char conf_filename_disabled[MAXPGPATH];
snprintf(conf_filename, MAXPGPATH, "%s/%s", subscriber_dir,
INCLUDED_CONF_FILE);
snprintf(conf_filename_disabled, MAXPGPATH, "%s/%s", subscriber_dir,
INCLUDED_CONF_FILE_DISABLED);
if (durable_rename(conf_filename, conf_filename_disabled) != 0)
{
/* durable_rename() has already logged something. */
pg_log_warning_hint("A manual removal of the recovery parameters may be required.");
}
}
if (success)
return;
@ -1334,11 +1371,36 @@ setup_recovery(const struct LogicalRepInfo *dbinfo, const char *datadir, const c
{
appendPQExpBuffer(recoveryconfcontents, "recovery_target_lsn = '%s'\n",
lsn);
WriteRecoveryConfig(conn, datadir, recoveryconfcontents);
}
disconnect_database(conn, false);
pg_log_debug("recovery parameters:\n%s", recoveryconfcontents->data);
if (!dry_run)
{
char conf_filename[MAXPGPATH];
FILE *fd;
/* Write the recovery parameters to INCLUDED_CONF_FILE */
snprintf(conf_filename, MAXPGPATH, "%s/%s", datadir,
INCLUDED_CONF_FILE);
fd = fopen(conf_filename, "w");
if (fd == NULL)
pg_fatal("could not open file \"%s\": %m", conf_filename);
if (fwrite(recoveryconfcontents->data, recoveryconfcontents->len, 1, fd) != 1)
pg_fatal("could not write to file \"%s\": %m", conf_filename);
fclose(fd);
recovery_params_set = true;
/* Include conditionally the recovery parameters. */
resetPQExpBuffer(recoveryconfcontents);
appendPQExpBufferStr(recoveryconfcontents,
"include_if_exists '" INCLUDED_CONF_FILE "'\n");
WriteRecoveryConfig(conn, datadir, recoveryconfcontents);
}
disconnect_database(conn, false);
}
/*

@ -160,6 +160,7 @@ primary_slot_name = '$slotname'
primary_conninfo = '$pconnstr dbname=postgres'
hot_standby_feedback = on
]);
my $sconnstr = $node_s->connstr;
$node_s->set_standby_mode();
$node_s->start;
@ -472,6 +473,11 @@ command_ok(
],
'run pg_createsubscriber on node S');
# Check that included file is renamed after success.
my $node_s_datadir = $node_s->data_dir;
ok( -f "$node_s_datadir/pg_createsubscriber.conf.disabled",
"pg_createsubscriber.conf.disabled exists in node S");
# Confirm the physical replication slot has been removed
$result = $node_p->safe_psql($db1,
"SELECT count(*) FROM pg_replication_slots WHERE slot_name = '$slotname'"
@ -579,10 +585,50 @@ is($result, qq($db1|{test_pub3}
$db2|{pub2}),
"subscriptions use the correct publications");
# Verify that node K, set as a standby, is able to start correctly without
# the recovery configuration written by pg_createsubscriber interfering.
# This node is created from node S, where pg_createsubscriber has been run.
# Create a physical standby from the promoted subscriber
$node_s->safe_psql('postgres',
"SELECT pg_create_physical_replication_slot('$slotname');");
# Create backup from promoted subscriber
$node_s->backup('backup_3');
# Initialize new physical standby
my $node_k = PostgreSQL::Test::Cluster->new('node_k');
$node_k->init_from_backup($node_s, 'backup_3', has_streaming => 1);
my $node_k_datadir = $node_k->data_dir;
ok( -f "$node_k_datadir/pg_createsubscriber.conf.disabled",
"pg_createsubscriber.conf.disabled exists in node K");
# Configure the new standby
$node_k->append_conf(
'postgresql.conf', qq[
primary_slot_name = '$slotname'
primary_conninfo = '$sconnstr dbname=postgres'
hot_standby_feedback = on
]);
$node_k->set_standby_mode();
my $node_k_name = $node_s->name;
command_ok(
[
'pg_ctl', '--wait',
'--pgdata' => $node_k->data_dir,
'--log' => $node_k->logfile,
'--options' => "--cluster-name=$node_k_name",
'start'
],
"node K has started");
# clean up
$node_p->teardown_node;
$node_s->teardown_node;
$node_t->teardown_node;
$node_f->teardown_node;
$node_k->teardown_node;
done_testing();

Loading…
Cancel
Save