Before this commit, we excluded 1664_providers from being rewritten
during a backup, since this is one of the files that the user has to
copy to the destination before the backup starts. It was done so the
user can have different global providers and keys to encrypt backup's
WAL than the source server. However, this raised several issues in
case the server creates new providers or modifies existing ones during
the backup. Then, those changes will be lost in the backup and data
related to such providers might become unreadable, or redo might
struggle to perform a rotation.
This commit treats 1664_providers as the rest of the files and makes
provider changes safe during the backup.
We still don't rewrite wal_keys, as we generate a unique WAL key for
the backup, and the server can't generate new WAL keys can't during
the backup.
Fixes PG-1895
This structure isn't really a key but meta data for a range of WAL using
a given key. Rework these structures to better represent this.
Also decouple the structure from the file format and use InternalKey for
the actual decrypted key data.
There is still plenty of cleanup to do in relation to this, but I
believe this is at least easier to understand than what we currently
have. I wish the end of the ranges were handled better though, but that
is a bit out of scope for this commit.
It gets confusing when this structure is used for both encrypted and
decrypted key data. With this change it's obvious that any allocation of
InternalKey can potentially leak key material to swap files.
Also be more explicit with empty alignment in the TDEMapEntry struct as
multiple levels of anonymous structs to maintain alignment would get a
bit messy.
Exposing and setting the environment variables everywhere makes it
harder to refactor and to understand what is going on. Instead we can
just write a couple of helper functions and use Perl scalar variables.
This only affects EXEC_BACKEND/Windows builds which we currently do not
support, but we fix this anyway to make the code more consistent and
easier to understand since we try to care about this in other places.
In the future we may want to add CI and proper support for EXEC_BACKEND
builds.
The issue was originally found by Zsolt Parragi.
Archiving being enabled during the setup of the SMGR environment caused
one of the test suites for WAL archiving to fail so we disable it while
running queries in single-user mode.
Previously we ignored the extra arguments to initdb when initializing
the pg_tde directory and just copied the directory from a database
initialized without the extra arguments, or if available from the cache.
Now make sure that when extra arguments are supplied that we do not
use the cache and that we copy the pg_tde directory from database
initialized with the extra arguments.
As far as I know this is only relevant to the --allow-group-access flag
but we may as well make the solution generic.
This is currently only added for meson, but could also be added for
make.
This has to be setup before the actual TAP tests are run as they are run
in parallel and as such would all try to setup the template at the same
time if we let them use the same folder for it without it being
pre-generated.
This seems to shorten the test suite run-time by ~25% on my laptop, so
it seems worth doing.
This enables table encryption by default in TAP tests when TDE_MODE=1.
Use TDE_MODE_SMGR=0 to turn off table encryption when running with
pg_tde loaded.
The setup for running regress with tde turned on has been slightly
modified to match what is done for TAP tests to let tests that run the
regress suite under TAP work.
This enables WAL encryption by default when the TAP tests are run with
TDE_MODE=1. Use TDE_MODE_WAL=0 to disable wal encryption while still
having pg_tde enabled.
This makes sure pg_tde is loaded and keys are setup when running
postgresql TAP suite. No TDE features are enabled at this point.
Single user mode is used to generate a template of pg_tde setup files
which are then copied to each created cluster's data directory.
We used to assume that the only errors which could happen were ones
which set the errno, but that is not the case. We also want to give nice
output on non-zero return values and if the process was killed by a
signal.
Before, pg_basebackup would encrypt streamed WAL according to the keys
in pg_tde/wal_keys in the destination dir.
This commit introduces the number of changes:
pg_basebackup encrypts WAL only if the "-E --encrypt-wal" flag is
provided. In such a case, it would extract the principal key, truncate
pg_tde/wal_keys and encrypt WAL with a newly generated WAL key. We
still expect pg_tde/wal_keys and pg_tde/1664_providers in the
destination dir. In case these files are not provided, but "-E" is
specified, it fails with an error.
We also throw a warning if pg_basebackup runs w/o -E, but there is
wal_keys on the source as WAL might be compromised, and the backup
is broken
For PG-1603, PG-1857
There was a race condition in the WAL archiving tests where if the
end-of-recovery checkpoint had completed the tests for the WAL contents
were non-sensical and racy. Solve this by explicilty promoting the
server first after we have looked at the WAL contents but still making
sure to wait until all WAL has been replayed.
Additionally improve the tests by actually making sure the replica
starts in a good state where all WAL is encrypted and testing both
the plaintext and the encrypted scenarios.
Unfortunately the logic for generating a new key to protect the stream
cipher used to encrypt the WAL stream in our restore command was based
on totally incorrect assumptions due to how the recovery is implemented.
Recovery is a state machine which can go back and forward between one
mode where it streams from a primary and another where it first tries to
fetch WAL from the archive and if that fails from the pg_wal directory,
and in the pg_wal directory we may have files which are encrypted with
whatever keys were there originally.
To handle all the possible scenarios we remove the ability of
pg_tde_restore_encrypt to generate new keys and just has it use whatever
keys there are in the key file. This unfortunately means we open
ourselves to some attacks on the stream cipher if the system is tricked
into encrypting a different WAL stream at the same TLI and LSN as we
already have encrypted. As far as I know this should be rare under
normal operations since normally e.g. the WAL should be the same in the
archive as the one in pg_wal or which we receive through streaming.
Ideally we would want to fix this but for now it is better to have WAL
encryption with this weakness than to not have it at all.
This also incidentally fixes a bug we discovered caused by generating a
new key only invalidating one key rather than all keys which should have
become invalid, since we no longer generate a new key.
It seems like there are cases when the postmaster have "restarted"
after a backend crash where the wal cache inherited from the postmaster
is wrong.
I'm not at all sure exactly how and why this happens, but this patch
fixes a bug with this and allows recovery/013_crash_restart to pass with
WAL encryption enabled.
According to the documentation, each backend is supposed to hold
AddinShmemInitLock when calling ShmemInitStruct. We only did that for
half of our calls before this patch.
There is at lesat one corner case scenario where we have to load the
last record into the cache during a write:
* replica crashes, receives last segment from primary
* replica replays last segment, reaches end
* replica activtes new key
* replica replays prepared transaction, has to use old keys again
* old key write function sees that we generated a new key, tries to load
it
In this scenario we could get away by detecting that we are in a write,
and asserting if we tried to use the last key.
But in a release build assertions are not fired, and we would end up
writing some non encrypted data to disk, and later if we have to run
recovery failing.
It could be a FATAL, but that would still crash the server, and the next
startup would crash again and again...
Instead, to properly avoid this situation we preallocate memory for one
more key in the cache during initialization. Since we can only add one
extra key to the cache during the servers run, this means we no longer
try to allocate in the critical section in any corner case.
While this is not the nicest solution, it is simple and keeps the
current cache and decrypt/encrypt logic the same as before. Any other
solution would be more complex, and even more of a hack, as it would
require dealing with a possibly out of date cache.
To not break recovery when we replay encrypted WAL but WAL encryption is
disabled the simplest way is to treat disabled WAL encryption just like
enabled WAL encryption. The issue is not big in practice since it should
only hit users who disable WAL encryption and then crash the database
but treating both cases the same way makes the code simple to
understand.
Previosly we simply set the LSN for the new key to the first write
location.
This is however not correct, as there are many corner cases around this:
* recovery / replication might write old LSNs
* we can't handle multiple keys with the same TLI/LSN, which can happen
with quick restarts without writes
To support this in this commit we modify the following:
* We only activate new keys outside crash recovery, or immediately if
encryption is turned off
* We also take the already existing last key into account (if exists),
and only activate a new key if we progressed past its start location
The remaining changes are just support infrastructure for this:
* Since we might rewrite old records, we use the already existing keys
for those writes, not the active last keys
* We prefetch existing keys during initialization, so it doesn't
accidentally happen in the critical section during a write
There is a remaining bug with stopping wal encryption, also mentioned in
a TODO message in the code. This will be addressed in a later PR as this
fix already took too long.
The min/max comparisons of LSNs assumed that everyting is in the same
timeline. In practice, with replication + recovery combinations, it is
possible that keys span at least 3 timelines, which means that this has
to be included in both combinations, as in other timelines, the
restrictions are less strict.
Use a single argument for the wrapped command in the archivation
wrappers.
Instead of giving all of the arguments of the command separately and
trying to figure out which one should be replaced by the path to the
unencrypted WAL segment, we take a single argument and do % parameter
replacement similar to what postgres does with archive_command and
restore_command.
This also mean that we can simplify by using system() instead of exec().
We also clean up usage instructions and make the two wrappers more
symmetrical by requiring the same parameters.
Co-authored-by: Andreas Karlsson <andreas.karlsson@percona.com>
Instead of first deleting any leftover key and then writing the new key
we do a single pass through the file where we replace any old key that
we find. To make this happen on redo too we need to stop generating a
separate WAL record for the key deletion for encrypted tables and only
generate that record for unencrypted tables where we still need a key
deletion record.
We except this optimization to primarily be visible on WAL replay where
only a single backend is used to replay everything, but it also speeds
up table creation in general on workloads with many tables.
We forgot to have a check against trying to delete leftover SMGR keys
for temporary tables which is a useless operation since they are store
in memory.
Additionally we forgot to prevent WAL from being written when creating
or removing a key in smgrcreate() for temporary tables.
- Fix whitespace
- Make sure to use the right languages
- Do not wrap short SQL queries unnecessarily
- Add missing end of code block
- Add missing semicolon to SQL query
Instead of having the WAL key code include the headers for the SMGR keys
we move the shared code into a separate header file. Additionally we
clean up some minor header issues.
Just logging that the function was called at DEBUG2 is not very helpful
to anyone and is presumably jsut a leftover from someone's attempt at
debugging a particular issue they had at some point.
Breaking these particular snippets out as separate functions did not
improve readability and was only done because they use to be called from
multiple locations.
This change has already been done in the WAL key code.
When WAL is streamed during the backup (default mode), it comes in
unencrypted. But we need keys to encrypt it. For now, we expect that
the user would put `pg_tde` dir containing the `1664_key` and
`1664_providers` into the destination directory before starting the
backup. We encrypt the streamed WAL according to internal keys. No
`pg_tde` dir means no streamed WAL encryption.
Also rename enum variants for consistency plus renumber the types for
the WAL keys which is fine since this file is newly introduced which
makes breaking backwards compatibility not an issue.
Before this commit, WAL keys didn't mind TLI at all. But after
pg_rewind, for example, pg_wal/ may contain segments from two
timelines. And the wal reader choosing the key may pick the wrong one
because LSNs of different TLIs may overlap. There was also another bug:
There is a key with the start LSN 0/30000 in TLI 1. And after the start
in TLI 2, the wal writer creates a new key with the SN 0/30000, but in
TLI 2. But the reader wouldn't fetch the latest key because w/o TLI,
these are the same.
This commit adds TLI to the Internal keys and makes use of it along
with LSN for key compares.