postgres

Commit Graph

Author	SHA1	Message	Date
Heikki Linnakangas	9fee3232a1	Fix assertions with RI triggers in heap_update and heap_delete. If the tuple being updated is not visible to the crosscheck snapshot, we return TM_Updated but the assertions would not hold in that case. Move them to before the cross-check. Fixes bug #17893. Backpatch to all supported versions. Author: Alexander Lakhin Backpatch-through: 12 Discussion: https://www.postgresql.org/message-id/17893-35847009eec517b5%40postgresql.org	2 years ago
Tom Lane	f76b975d58	Don't release index root page pin in ginFindParents(). It's clearly stated in the comments that ginFindParents() must keep the pin on the index's root page that's associated with the topmost GinBtreeStack item. However, the code path for the case that the desired downlink has been pushed down to the next index level ignored this proviso, and would release the pin anyway if we were still examining the root level. That led to an assertion failure or "buffer NNNN is not owned by resource owner" error later, when we try to release the pin again at the end of the insertion. This is quite hard to reproduce, since it can only happen if an index root page split occurs concurrently with our own insertion. Thanks to Jeff Janes for finding a test case that triggers it often enough to allow investigation. This has been there since the beginning of GIN, so back-patch to all supported branches. Discussion: https://postgr.es/m/CAMkU=1yCAKtv86dMrD__Ja-7KzjE=uMeKX8y__cx5W-OEWy2ow@mail.gmail.com	2 years ago
David Rowley	ef7c365551	Ensure we use the correct spelling of "ensure" We seem to have accidentally used "insure" in a few places. Correct that. Author: Peter Smith Discussion: https://postgr.es/m/CAHut+Pv0biqrhA3pMhu40aDsj343mTsD75khKnHsLqR8P04f=Q@mail.gmail.com Backpatch-through: 12, oldest supported version	2 years ago
Michael Paquier	4dccf95753	Enlarge assertion in bloom_init() for false_positive_rate false_positive_rate is a parameter that can be set with the bloom opclass in BRIN, and setting it to a value of exactly 0.25 would trigger an assertion in the first INSERT done on the index with value set. The assertion changed here relied on BLOOM_{MIN\|MAX}_FALSE_POSITIVE_RATE that are somewhat arbitrary values, and specifying an out-of-range value would also trigger a failure when defining such an index. So, as-is, the assertion was just doubling on the min-max check of the reloption. This is now enlarged to check that it is a correct percentage value, instead, based on a suggestion by Tom Lane. Author: Alexander Lakhin Reviewed-by: Tom Lane, Shihao Zhong Discussion: https://postgr.es/m/17969-a6c54de48026d694@postgresql.org Backpatch-through: 14	2 years ago
Bruce Momjian	350bd6bdc1	doc: 1-byte varlena headers can be used for user PLAIN storage This also updates some C comments. Reported-by: suchithjn22@gmail.com Discussion: https://postgr.es/m/167336599095.2667301.15497893107226841625@wrigleys.postgresql.org Author: Laurenz Albe (doc patch) Backpatch-through: 11	2 years ago
Noah Misch	1a368dd3ec	Diagnose !indisvalid in more SQL functions. pgstatindex failed with ERRCODE_DATA_CORRUPTED, of the "can't-happen" class XX. The other functions succeeded on an empty index; they might have malfunctioned if the failed index build left torn I/O or other complex state. Report an ERROR in statistics functions pgstatindex, pgstatginindex, pgstathashindex, and pgstattuple. Report DEBUG1 and skip all index I/O in maintenance functions brin_desummarize_range, brin_summarize_new_values, brin_summarize_range, and gin_clean_pending_list. Back-patch to v11 (all supported versions). Discussion: https://postgr.es/m/20231001195309.a3@google.com	2 years ago
Tomas Vondra	924e0e2ee0	Fix minmax-multi distance for extreme interval values When calculating distance for interval values, the code mostly mimicked interval_mi, i.e. it built a new interval value for the difference. That however does not work for sufficiently distant interval values, when the difference overflows the interval range. Instead, we can calculate the distance directly, without constructing the intermediate (and unnecessary) interval value. Backpatch to 14, where minmax-multi indexes were introduced. Reported-by: Dean Rasheed Reviewed-by: Ashutosh Bapat, Dean Rasheed Backpatch-through: 14 Discussion: https://postgr.es/m/eef0ea8c-4aaa-8d0d-027f-58b1f35dd170@enterprisedb.com	2 years ago
Tomas Vondra	497fc92086	Fix minmax-multi on infinite date/timestamp values Make sure that infinite values in date/timestamp columns are treated as if in infinite distance. Infinite values should not be merged with other values, leaving them as outliers. The code however returned distance 0 in this case, so that infinite values were merged first. While this does not break the index (i.e. it still produces correct query results), it may make it much less efficient. We don't need explicit handling of infinite date/timestamp values when calculating distances, because those values are represented as extreme but regular values (e.g. INT64_MIN/MAX for the timestamp type). We don't need an exact distance, just a value that is much larger than distanced between regular values. With the added cast to double values, we can simply subtract the values. The regression test queries a value in the "gap" and checks the range was properly eliminated by the BRIN index. This only affects minmax-multi indexes on timestamp/date columns with infinite values, which is not very common in practice. The affected indexes may need to be rebuilt. Backpatch to 14, where minmax-multi indexes were introduced. Reported-by: Ashutosh Bapat Reviewed-by: Ashutosh Bapat, Dean Rasheed Backpatch-through: 14 Discussion: https://postgr.es/m/eef0ea8c-4aaa-8d0d-027f-58b1f35dd170@enterprisedb.com	2 years ago
Tomas Vondra	e7965226d5	Fix calculation in brin_minmax_multi_distance_date When calculating the distance between date values, make sure to subtract them in the right order, i.e. (larger - smaller). The distance is used to determine which values to merge, and is expected to be a positive value. The code unfortunately did the subtraction in the opposite order, i.e. (smaller - larger), thus producing negative values and merging values the most distant values first. The resulting index is correct (i.e. produces correct results), but may be significantly less efficient. This affects all minmax-multi indexes on date columns. Backpatch to 14, where minmax-multi indexes were introduced. Reported-by: Ashutosh Bapat Reviewed-by: Ashutosh Bapat, Dean Rasheed Backpatch-through: 14 Discussion: https://postgr.es/m/eef0ea8c-4aaa-8d0d-027f-58b1f35dd170@enterprisedb.com	2 years ago
Tomas Vondra	0635fe02b4	Fix overflow when calculating timestamp distance in BRIN When calculating distances for timestamp values for BRIN minmax-multi indexes, we need to be careful about overflows for extreme values. If the value overflows into a negative value, the index may be inefficient. The new regression test checks this for the timestamp type by adding a table with enough values to force range compaction/merging. The values are close to min/max, which means a risk of overflow. Fixed by converting the int64 values to double first, before calculating the distance. This prevents the overflow. We may lose some precision, of course, but that's good enough. In the worst case we build a slightly less efficient index, but for large distances this won't matter. This only affects minmax-multi indexes on timestamp columns, with ranges containing values sufficiently distant to cause an overflow. That seems like a fairly rare case in practice. Backpatch to 14, where minmax-multi indexes were introduced. Reported-by: Ashutosh Bapat Reviewed-by: Ashutosh Bapat, Dean Rasheed Backpatch-through: 14 Discussion: https://postgr.es/m/eef0ea8c-4aaa-8d0d-027f-58b1f35dd170@enterprisedb.com	2 years ago
Nathan Bossart	d1c56ad37b	Move extra code out of the Pre/PostRestoreCommand() section. If SIGTERM is received within this section, the startup process will immediately proc_exit() in the signal handler, so it is inadvisable to include any more code than is required there (as such code is unlikely to be compatible with doing proc_exit() in a signal handler). This commit moves the code recently added to this section (see `1b06d7bac9` and `7fed801135`) to outside of the section. This ensures that the startup process only calls proc_exit() in its SIGTERM handler for the duration of the system() call, which is how this code worked from v8.4 to v14. Reported-by: Michael Paquier, Thomas Munro Analyzed-by: Andres Freund Suggested-by: Tom Lane Reviewed-by: Michael Paquier, Robert Haas, Thomas Munro, Andres Freund Discussion: https://postgr.es/m/Y9nGDSgIm83FHcad%40paquier.xyz Discussion: https://postgr.es/m/20230223231503.GA743455%40nathanxps13 Backpatch-through: 15	2 years ago
Thomas Munro	3090213690	Fix comment from commit `22655aa231`. Per automated complaint from BF animal koel this needed to be re-indented, but there was also a typo. Back-patch to 16.	2 years ago
Andres Freund	0002feb820	Fix bulk table extension when copying into multiple partitions When COPYing into a partitioned table that does now permit the use of table_multi_insert(), we could error out with ERROR: could not read block NN in file "base/...": read only 0 of 8192 bytes because BulkInsertState->next_free was not reset between partitions. This problem occurred only when not able to use table_multi_insert(), as a dedicated BulkInsertState for each partition is used in that case. The bug was introduced in `00d1e02be2`, but it was hard to hit at that point, as commonly bulk relation extension is not used when not using table_multi_insert(). It became more likely after `82a4edabd2`, which expanded the use of bulk extension. To fix the bug, reset the bulk relation extension state in BulkInsertState in ReleaseBulkInsertStatePin(). That was added (in `b1ecb9b3fc`) to tackle a very similar issue. Obviously the name is not quite correct, but there might be external callers, and bulk insert state needs to be reset in precisely in the situations that ReleaseBulkInsertStatePin() already needed to be called. Medium term the better fix likely is to disallow reusing BulkInsertState across relations. Add a test that, without the fix, reproduces #18130 in most configurations. The test also catches the problem fixed in `b1ecb9b3fc` when run with small shared_buffers. Reported-by: Ivan Kolombet <enderstd@gmail.com> Analyzed-by: Tom Lane <tgl@sss.pgh.pa.us> Analyzed-by: Andres Freund <andres@anarazel.de> Bug: #18130 Discussion: https://postgr.es/m/18130-7a86a7356a75209d%40postgresql.org Discussion: https://postgr.es/m/257696.1695670946%40sss.pgh.pa.us Backpatch: 16-	2 years ago
Jeff Davis	b8963e8a28	Fix bug in GenericXLogFinish(). Mark the buffers dirty before writing WAL. Discussion: https://postgr.es/m/25104133-7df8-cae3-b9a2-1c0aaa1c094a@iki.fi Reviewed-by: Heikki Linnakangas Backpatch-through: 11	2 years ago
Michael Paquier	a06efbc3ac	Fail hard on out-of-memory failures in xlogreader.c This commit changes the WAL reader routines so as a FATAL for the backend or exit(FAILURE) for the frontend is triggered if an allocation for a WAL record decode fails in walreader.c, rather than treating this case as bogus data, which would be equivalent to the end of WAL. The key is to avoid palloc_extended(MCXT_ALLOC_NO_OOM) in walreader.c, relying on plain palloc() calls. The previous behavior could make WAL replay finish too early than it should. For example, crash recovery finishing earlier may corrupt clusters because not all the WAL available locally was replayed to ensure a consistent state. Out-of-memory failures would show up randomly depending on the memory pressure on the host, but one simple case would be to generate a large record, then replay this record after downsizing a host, as Ethan Mertz originally reported. This relies on `bae868caf2`, as the WAL reader routines now do the memory allocation required for a record only once its header has been fully read and validated, making xl_tot_len trustable. Making the WAL reader react differently on out-of-memory or bogus record data would require ABI changes, so this is the safest choice for stable branches. Also, it is worth noting that `3f1ce97346` has been using a plain palloc() in this code for some time now. Thanks to Noah Misch and Thomas Munro for the discussion. Like the other commit, backpatch down to 12, leaving out v11 that will be EOL'd soon. The behavior of considering a failed allocation as bogus data comes originally from `0ffe11abd3`, where the record length retrieved from its header was not entirely trustable. Reported-by: Ethan Mertz Discussion: https://postgr.es/m/ZRKKdI5-RRlta3aF@paquier.xyz Backpatch-through: 12	2 years ago
Peter Geoghegan	3fa81b62e0	Fix btmarkpos/btrestrpos array key wraparound bug. nbtree's mark/restore processing failed to correctly handle an edge case involving array key advancement and related search-type scan key state. Scans with ScalarArrayScalarArrayOpExpr quals requiring mark/restore processing (for a merge join) could incorrectly conclude that an affected array/scan key must not have advanced during the time between marking and restoring the scan's position. As a result of all this, array key handling within btrestrpos could skip a required call to _bt_preprocess_keys(). This confusion allowed later primitive index scans to overlook tuples matching the true current array keys. The scan's search-type scan keys would still have spurious values corresponding to the final array element(s) -- not values matching the first/now-current array element(s). To fix, remember that "array key wraparound" has taken place during the ongoing btrescan in a flag variable stored in the scan's state, and use that information at the point where btrestrpos decides if another call to _bt_preprocess_keys is required. Oversight in commit `70bc5833`, which taught nbtree to handle array keys during mark/restore processing, but missed this subtlety. That commit was itself a bug fix for an issue in commit `9e8da0f7`, which taught nbtree to handle ScalarArrayOpExpr quals natively. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WzkgP3DDRJxw6DgjCxo-cu-DKrvjEv_ArkP2ctBJatDCYg@mail.gmail.com Backpatch: 11- (all supported branches).	2 years ago
Etsuro Fujita	8bdfc53be4	Fix typo in src/backend/access/transam/README.	2 years ago
Heikki Linnakangas	d7f521325f	Fix another bug in parent page splitting during GiST index build. Yet another bug in the ilk of commits `a7ee7c851` and `741b88435`. In `741b88435`, we took care to clear the memorized location of the downlink when we split the parent page, because splitting the parent page can move the downlink. But we missed that even updating a tuple on the parent can move it, because updating a tuple on a gist page is implemented as a delete+insert, so the updated tuple gets moved to the end of the page. This commit fixes the bug in two different ways (belt and suspenders): 1. Clear the downlink when we update a tuple on the parent page, even if it's not split. This the same approach as in commits `a7ee7c851` and `741b88435`. I also noticed that gistFindCorrectParent did not clear the 'downlinkoffnum' when it stepped to the right sibling. Fix that too, as it seems like a clear bug even though I haven't been able to find a test case to hit that. 2. Change gistFindCorrectParent so that it treats 'downlinkoffnum' merely as a hint. It now always first checks if the downlink is still at that location, and if not, it scans the page like before. That's more robust if there are still more cases where we fail to clear 'downlinkoffnum' that we haven't yet uncovered. With this, it's no longer necessary to meticulously clear 'downlinkoffnum', so this makes the previous fixes unnecessary, but I didn't revert them because it still seems nice to clear it when we know that the downlink has moved. Also add the test case using the same test data that Alexander posted. I tried to reduce it to a smaller test, and I also tried to reproduce this with different test data, but I was not able to, so let's just include what we have. Backpatch to v12, like the previous fixes. Reported-by: Alexander Lakhin Discussion: https://www.postgresql.org/message-id/18129-caca016eaf0c3702@postgresql.org	2 years ago
Thomas Munro	10d0591ea2	Fix edge-case for xl_tot_len broken by `bae868ca`. `bae868ca` removed a check that was still needed. If you had an xl_tot_len at the end of a page that was too small for a record header, but not big enough to span onto the next page, we'd immediately perform the CRC check using a bogus large length. Because of arbitrary coding differences between the CRC implementations on different platforms, nothing very bad happened on common modern systems. On systems using the _sb8.c fallback we could segfault. Restore that check, add a new assertion and supply a test for that case. Back-patch to 12, like `bae868ca`. Tested-by: Tom Lane <tgl@sss.pgh.pa.us> Tested-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGLCkTT7zYjzOxuLGahBdQ%3DMcF%3Dz5ZvrjSOnW4EDhVjT-g%40mail.gmail.com	2 years ago
Thomas Munro	ce497f648e	Don't trust unvalidated xl_tot_len. xl_tot_len comes first in a WAL record. Usually we don't trust it to be the true length until we've validated the record header. If the record header was split across two pages, previously we wouldn't do the validation until after we'd already tried to allocate enough memory to hold the record, which was bad because it might actually be garbage bytes from a recycled WAL file, so we could try to allocate a lot of memory. Release 15 made it worse. Since `70b4f82a4b`, we'd at least generate an end-of-WAL condition if the garbage 4 byte value happened to be > 1GB, but we'd still try to allocate up to 1GB of memory bogusly otherwise. That was an improvement, but unfortunately release 15 tries to allocate another object before that, so you could get a FATAL error and recovery could fail. We can fix both variants of the problem more fundamentally using pre-existing page-level validation, if we just re-order some logic. The new order of operations in the split-header case defers all memory allocation based on xl_tot_len until we've read the following page. At that point we know that its first few bytes are not recycled data, by checking its xlp_pageaddr, and that its xlp_rem_len agrees with xl_tot_len on the preceding page. That is strong evidence that xl_tot_len was truly the start of a record that was logged. This problem was most likely to occur on a standby, because walreceiver.c recycles WAL files without zeroing out trailing regions of each page. We could fix that too, but it wouldn't protect us from rare crash scenarios where the trailing zeroes don't make it to disk. With reliable xl_tot_len validation in place, the ancient policy of considering malloc failure to indicate corruption at end-of-WAL seems quite surprising, but changing that is left for later work. Also included is a new TAP test to exercise various cases of end-of-WAL detection by writing contrived data into the WAL from Perl. Back-patch to 12. We decided not to put this change into the final release of 11. Author: Thomas Munro <thomas.munro@gmail.com> Author: Michael Paquier <michael@paquier.xyz> Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Noah Misch <noah@leadboat.com> (the idea, not the code) Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Sergei Kornilov <sk@zsrv.org> Reviewed-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/17928-aa92416a70ff44a2%40postgresql.org	2 years ago
Tom Lane	227c7cf156	Fix COMMIT/ROLLBACK AND CHAIN in the presence of subtransactions. In older branches, COMMIT/ROLLBACK AND CHAIN failed to propagate the current transaction's properties to the new transaction if there was any open subtransaction (unreleased savepoint). Instead, some previous transaction's properties would be restored. This is because the "if (s->chain)" check in CommitTransactionCommand examined the wrong instance of the "chain" flag and falsely concluded that it didn't need to save transaction properties. Our regression tests would have noticed this, except they used identical transaction properties for multiple tests in a row, so that the faulty behavior was not distinguishable from correct behavior. Commit `12d768e70` fixed the problem in v15 and later, but only rather accidentally, because I removed the "if (s->chain)" test to avoid a compiler warning, while not realizing that the warning was flagging a real bug. In v14 and before, remove the if-test and save transaction properties unconditionally; just as in the newer branches, that's not expensive enough to justify thinking harder. Add the comment and extra regression test to v15 and later to forestall any future recurrence, but there's no live bug in those branches. Patch by me, per bug #18118 from Liu Xiang. Back-patch to v12 where the AND CHAIN feature was added. Discussion: https://postgr.es/m/18118-4b72fcbb903aace6@postgresql.org	2 years ago
Heikki Linnakangas	64276c7e45	Fix GiST README's explanation of the NSN cross-check. The text got the condition backwards, it's "NSN > LSN", not "NSN < LSN". While we're at it, expand it a little for clarity. Reviewed-by: Daniel Gustafsson Discussion: https://www.postgresql.org/message-id/4cb46e18-e688-524a-0f73-b1f03ed5d6ee@iki.fi	2 years ago
Heikki Linnakangas	d5feb6011e	Report syncscan position at end of scan. The comment in heapgettup_advance_block() says that it reports the scan position before checking for end of scan, but that didn't match the code. The code was refactored in commit `7ae0ab0ad9`, which inadvertently changed the order of the check and reporting. Change it back. This caused a few regression test failures with a small shared_buffers setting like 10 MB. The 'portals' and 'cluster' tests perform seqscans that are large enough that sync seqscans kick in. When the sync scan position is not updated at end of scan, the next seq scan doesn't start at the beginning of the table, and the test queries are sensitive to that. Reviewed-by: Melanie Plageman, David Rowley Discussion: https://www.postgresql.org/message-id/6f991389-ae22-d844-a9d8-9aceb7c01a9a@iki.fi Backpatch-through: 16	2 years ago
Heikki Linnakangas	c3b874c431	Fix _bt_allequalimage() call within critical section. _bt_allequalimage() does complicated things, so it's not OK to call it in a critical section. Per buildfarm failure on 'prion', which uses -DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE options. Discussion: https://www.postgresql.org/message-id/6e5bbc08-cdfc-b2b3-9e23-1a914b9850a9@iki.fi Backpatch-through: 16, like commit `ccadf73163` that introduced this	2 years ago
Heikki Linnakangas	6bc1fd4e60	Use the buffer cache when initializing an unlogged index. Some of the ambuildempty functions used smgrwrite() directly, followed by smgrimmedsync(). A few small problems with that: Firstly, one is supposed to use smgrextend() when extending a relation, not smgrwrite(). It doesn't make much difference in production builds. smgrextend() updates the relation size cache, so you miss that, but that's harmless because we never use the cached relation size of an init fork. But if you compile with CHECK_WRITE_VS_EXTEND, you get an assertion failure. Secondly, the smgrwrite() calls were performed before WAL-logging, so the page image written to disk had 0/0 as the LSN, not the LSN of the WAL record. That's also harmless in practice, but seems sloppy. Thirdly, it's better to use the buffer cache, because then you don't need to smgrimmedsync() the relation to disk, which adds latency. Bypassing the cache makes sense for bulk operations like index creation, but not when you're just initializing an empty index. Creation of unlogged tables is hardly performance bottleneck in any real world applications, but nevertheless. Backpatch to v16, but no further. These issues should be harmless in practice, so better to not rock the boat in older branches. Reviewed-by: Robert Haas Discussion: https://www.postgresql.org/message-id/6e5bbc08-cdfc-b2b3-9e23-1a914b9850a9@iki.fi	2 years ago
Thomas Munro	f58af9f416	ExtendBufferedWhat -> BufferManagerRelation. Commit `31966b15` invented a way for functions dealing with relation extension to accept a Relation in online code and an SMgrRelation in recovery code. It seems highly likely that future bufmgr.c interfaces will face the same problem, and need to do something similar. Generalize the names so that each interface doesn't have to re-invent the wheel. Back-patch to 16. Since extension AM authors might start using the constructor macros once 16 ships, we agreed to do the rename in 16 rather than waiting for 17. Reviewed-by: Peter Geoghegan <pg@bowt.ie> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKG%2B6tLD2BhpRWycEoti6LVLyQq457UL4ticP5xd8LqHySA%40mail.gmail.com	2 years ago
Andrew Dunstan	f938acd68b	Cache by-reference missing values in a long lived context Attribute missing values might be needed past the lifetime of the tuple descriptors from which they are extracted. To avoid possibly using pointers for by-reference values which might thus be left dangling, we cache a datumCopy'd version of the datum in the TopMemoryContext. Since we first search for the value this only needs to be done once per session for any such value. Original complaint from Tom Lane, idea for mitigation by Andrew Dunstan, tweaked by Tom Lane. Backpatch to version 11 where missing values were introduced. Discussion: https://postgr.es/m/1306569.1687978174@sss.pgh.pa.us	2 years ago
Andres Freund	d37ab378b6	hio: Take number of prior relation extensions into account The new relation extension logic, introduced in `00d1e02be2`, could lead to slowdowns in some scenarios. E.g., when loading narrow rows into a table using COPY, the caller of RelationGetBufferForTuple() will only request a small number of pages. Without concurrency, we just extended using pwritev() in that case. However, if there is some concurrency, we switched between extending by a small number of pages and a larger number of pages, depending on the number of waiters for the relation extension logic. However, some filesystems, XFS in particular, do not perform well when switching between extending files using fallocate() and pwritev(). To avoid that issue, remember the number of prior relation extensions in BulkInsertState and extend more aggressively if there were prior relation extensions. That not just avoids the aforementioned slowdown, but also leads to noticeable performance gains in other situations, primarily due to extending more aggressively when there is no concurrency. I should have done it this way from the get go. Reported-by: Masahiko Sawada <sawada.mshk@gmail.com> Author: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAD21AoDvDmUQeJtZrau1ovnT_smN940=Kp6mszNGK3bq9yRN6g@mail.gmail.com Backpatch: 16-, where the new relation extension code was added	2 years ago
Noah Misch	7823a298ba	Fix off-by-one in XLogRecordMaxSize check. pg_logical_emit_message(false, '_', repeat('x', 1069547465)) failed with self-contradictory message "WAL record would be 1069547520 bytes (of maximum 1069547520 bytes)". There's no particular benefit from allowing or denying one byte in either direction; XLogRecordMaxSize could rise a few megabytes without trouble. Hence, this is just for cleanliness. Back-patch to v16, where this check first appeared.	2 years ago
Michael Paquier	926aa6d11b	Fix indentation in twophase.c This has been missed in `cb0cca1`, noticed before buildfarm member koel has been able to complain while poking at a different patch. Like the other commit, backpatch all the way down to limit the odds of merge conflicts. Backpatch-through: 11	2 years ago
Michael Paquier	f88bc9f388	Fix recovery of 2PC transaction during crash recovery A crash in the middle of a checkpoint with some two-phase state data already flushed to disk by this checkpoint could cause a follow-up crash recovery to recover twice the same transaction, once from what has been found in pg_twophase/ at the beginning of recovery and a second time when replaying its corresponding record. This would lead to FATAL failures in the startup process during recovery, where the same transaction would have a state recovered twice instead of once: LOG: recovering prepared transaction 731 from shared memory LOG: recovering prepared transaction 731 from shared memory FATAL: lock ExclusiveLock on object 731/0/0 is already held This issue is fixed by skipping the addition of any 2PC state coming from a record whose equivalent 2PC state file has already been loaded in TwoPhaseState at the beginning of recovery by restoreTwoPhaseData(), which is OK as long as the system has not reached a consistent state. The timing to get a messed up recovery processing is very racy, and would very unlikely happen. The thread that has reported the issue has demonstrated the bug using injection points to force a PANIC in the middle of a checkpoint. Issue introduced in `728bd99`, so backpatch all the way down. Reported-by: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com> Author: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com> Author: Michael Paquier Discussion: https://postgr.es/m/109e6994-b971-48cb-84f6-829646f18b4c.mengjuan.cmj@alibaba-inc.com Backpatch-through: 11	2 years ago
Peter Eisentraut	2631ebab7b	Message wording improvements	2 years ago
Tomas Vondra	1124cb2cf2	Document relaxed HOT for summarizing indexes Commit `19d8e2308b` allowed a weaker check for HOT with summarizing indexes, but it did not update README.HOT. So do that now. Patch by Matthias van de Meent, minor changes by me. Backpatch to 16, where the optimization was introduced. Author: Matthias van de Meent Reviewed-by: Tomas Vondra Backpatch-through: 16 Discussion: https://postgr.es/m/CAEze2WiEOm8V+c9kUeYp2BPhbEc5s473fUf51xNeqvSFGv44Ew@mail.gmail.com	2 years ago
Thomas Munro	12529028a4	Fix race in SSI interaction with gin fast path. The ginfast.c code previously checked for conflicts in before locking the relevant buffer, leaving a window where a RW conflict could be missed. Re-order. There was also a place where buffer ID and block number were confused while trying to predicate-lock a page, noted by visual inspection. Back-patch to all supported releases. Fixes one more problem discovered with the reproducer from bug #17949, in this case when Dmitry tried other index types. Reported-by: Artem Anisimov <artem.anisimov.255@gmail.com> Reported-by: Dmitry Dolgov <9erthalion6@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/17949-a0f17035294a55e2%40postgresql.org	2 years ago
Thomas Munro	d03d9a2614	Fix race in SSI interaction with bitmap heap scan. When performing a bitmap heap scan, we don't want to miss concurrent writes that occurred after we observed the heap's rs_nblocks, but before we took predicate locks on index pages. Therefore, we can't skip fetching any heap tuples that are referenced by the index, because we need to test them all with CheckForSerializableConflictOut(). The old optimization that would ignore any references to blocks >= rs_nblocks gets in the way of that requirement, because it means that concurrent writes in that window are ignored. Removing that optimization shouldn't affect correctness at any isolation level, because any new tuples shouldn't be visible to an MVCC snapshot. There also shouldn't be any error-causing references to heap blocks past the end, because we should have held at least an AccessShareLock on the table before the index scan. It can't get smaller while our transaction is running. For now, though, we'll keep the optimization at lower levels to avoid making unnecessary changes in a bug fix. Back-patch to all supported releases. In release 11, the code is in a different place but not fundamentally different. Fixes one aspect of bug #17949. Reported-by: Artem Anisimov <artem.anisimov.255@gmail.com> Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/17949-a0f17035294a55e2%40postgresql.org	2 years ago
Thomas Munro	0cb1fb2c9b	Fix race in SSI interaction with empty btrees. When predicate-locking btrees, we have a special case for completely empty btrees, since there is no page to lock. This was racy, because, without buffer lock held, a matching key could be inserted between the _bt_search() and the PredicateLockRelation() calls. Fix, by rechecking _bt_search() after taking the relation-level SIREAD lock, if using SERIALIZABLE isolation and an empty btree is discovered. Back-patch to all supported releases. Fixes one aspect of bug #17949. Reported-by: Artem Anisimov <artem.anisimov.255@gmail.com> Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/17949-a0f17035294a55e2%40postgresql.org	2 years ago
Thomas Munro	76d5966b3b	Silence "missing contrecord" error. Commit `dd38ff28ad` added a new error message "missing contrecord" when we fail to reassemble a record. Unfortunately that caused noisy messages to be logged by pg_waldump at end of segment, and by walsender when asked to shut down on a segment boundary. Remove the new error message, so that this condition signals end-of- WAL without a message. It's arguably a reportable condition that should not be silenced while performing crash recovery, but fixing that without introducing noise in the other cases will require more research. Back-patch to 15. Reported-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: https://postgr.es/m/6a1df56e-4656-b3ce-4b7a-a9cb41df8189%40enterprisedb.com	2 years ago
Peter Geoghegan	5f0762f147	nbtree VACUUM: cope with topparent inconsistencies. Avoid "right sibling %u of block %u is not next child" errors when vacuuming a corrupt nbtree index. Just LOG the issue and press on. That way VACUUM will have a decent chance of finishing off all required processing for the index (and for the table as a whole). This is similar to recent work from commit `5abff197`, as well as work from commit `5b861baa` (later backpatched as commit `43e409ce`), which taught nbtree VACUUM to keep going when its "re-find" check fails. The hardening added by this commit takes place directly after the "re-find" check, right before the critical section for the first stage of page deletion. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-Wz=dayg0vjs4+er84TS9ami=csdzjpuiCGbEw=idhwqhzQ@mail.gmail.com Backpatch: 11- (all supported versions).	2 years ago
Michael Paquier	68cb5af46c	Enable archiving in recovery TAP test 009_twophase.pl This is a follow-up of `f663b00`, that has been committed to v13 and v14, tweaking the TAP test for two-phase transactions so as it provides coverage for the bug that has been fixed. This change is done in its own commit for clarity, as v15 and HEAD did not show the problematic behavior, still missed coverage for it. While on it, this adds a comment about the dependency of the last partial segment rename and RecoverPreparedTransactions() at the end of recovery, as that can be easy to miss. Author: Michael Paquier Reviewed-by: Kyotaro Horiguchi Discussion: https://postgr.es/m/743b9b45a2d4013bd90b6a5cba8d6faeb717ee34.camel@cybertec.at Backpatch-through: 13	2 years ago
Peter Geoghegan	d088ba5a5a	nbtree: Allocate new pages in separate function. Split nbtree's _bt_getbuf function is two: code that read locks or write locks existing pages remains in _bt_getbuf, while code that deals with allocating new pages is moved to a new, dedicated function called _bt_allocbuf. This simplifies most _bt_getbuf callers, since it is no longer necessary for them to pass a heaprel argument. Many of the changes to nbtree from commit `61b313e4` can be reverted. This minimizes the divergence between HEAD/PostgreSQL 16 and earlier release branches. _bt_allocbuf replaces the previous nbtree idiom of passing P_NEW to _bt_getbuf. There are only 3 affected call sites, all of which continue to pass a heaprel for recovery conflict purposes. Note that nbtree's use of P_NEW was superficial; nbtree never actually relied on the P_NEW code paths in bufmgr.c, so this change is strictly mechanical. GiST already took the same approach; it has a dedicated function for allocating new pages called gistNewBuffer(). That factor allowed commit `61b313e4` to make much more targeted changes to GiST. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/CAH2-Wz=8Z9qY58bjm_7TAHgtW6RzZ5Ke62q5emdCEy9BAzwhmg@mail.gmail.com	2 years ago
Jeff Davis	2fcc7ee7af	Revert "Fix search_path to a safe value during maintenance operations." This reverts commit `05e1737351`.	2 years ago
Jeff Davis	05e1737351	Fix search_path to a safe value during maintenance operations. While executing maintenance operations (ANALYZE, CLUSTER, REFRESH MATERIALIZED VIEW, REINDEX, or VACUUM), set search_path to 'pg_catalog, pg_temp' to prevent inconsistent behavior. Functions that are used for functional indexes, in index expressions, or in materialized views and depend on a different search path must be declared with CREATE FUNCTION ... SET search_path='...'. This change addresses a security risk introduced in commit `60684dd834`, where a role with MAINTAIN privileges on a table may be able to escalate privileges to the table owner. That commit is not yet part of any release, so no need to backpatch. Discussion: https://postgr.es/m/e44327179e5c9015c8dda67351c04da552066017.camel%40j-davis.com Reviewed-by: Greg Stark Reviewed-by: Nathan Bossart	2 years ago
Nathan Bossart	9aee26a491	Fix missing word in nbtree/README. Reported-by: Daniel Westermann Author: Gurjeet Singh Reviewed-by: Richard Guo Discussion: https://postgr.es/m/ZR0P278MB0427F0E0CE4ED140F52D1923D250A%40ZR0P278MB0427.CHEP278.PROD.OUTLOOK.COM	2 years ago
Heikki Linnakangas	95f0340c3b	Initialize 'recordXtime' to silence compiler warning. In reality, recordXtime will always be set by the getRecordTimestamp call, but the compiler doesn't necessarily see that. Back-patch to all supported versions. Author: Tristan Partin Discussion: https://www.postgresql.org/message-id/CT5MN8E11U0M.1NYNCHXYUHY41@gonk	2 years ago
Peter Geoghegan	5abff197cc	nbtree VACUUM: cope with right sibling link corruption. Avoid "right sibling's left-link doesn't match" errors when vacuuming a corrupt nbtree index. Just LOG the issue and press on. That way VACUUM will have a decent chance of finishing off all required processing for the index (and for the table as a whole). This error was seen in the field from time to time (it's more than a theoretical risk), so giving VACUUM the ability to press on like this has real value. Nothing short of a REINDEX is expected to fix the underlying index corruption, so giving up (by throwing an error) risks making a bad situation far worse. Anything that blocks forward progress by VACUUM like this might go unnoticed for a long time. This could eventually lead to a wraparound/xidStopLimit outage. Note that _bt_unlink_halfdead_page() has always been able to bail on page deletion when the target page's left sibling page was in an inconsistent state. It now does the same thing (returns false to back out of the second phase of deletion) when it notices sibling link corruption in the target page's right sibling page. This is similar to the work from commit `5b861baa` (later backpatched as commit `43e409ce`), which taught nbtree to press on with vacuuming an index when page deletion fails to "re-find" a downlink in the target page's parent page. The "re-find" check seems to make VACUUM bail on page deletion more often in practice, but there is no reason to take any chances here. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/CAH2-Wzko2q2kP1+UvgJyP9g0mF4hopK0NtQZcxwvMv9_ytGhkQ@mail.gmail.com Backpatch: 11- (all supported versions).	2 years ago
Tom Lane	0245f8db36	Pre-beta mechanical code beautification. Run pgindent, pgperltidy, and reformat-dat-files. This set of diffs is a bit larger than typical. We've updated to pg_bsd_indent 2.1.2, which properly indents variable declarations that have multi-line initialization expressions (the continuation lines are now indented one tab stop). We've also updated to perltidy version 20230309 and changed some of its settings, which reduces its desire to add whitespace to lines to make assignments etc. line up. Going forward, that should make for fewer random-seeming changes to existing code. Discussion: https://postgr.es/m/20230428092545.qfb3y5wcu4cm75ur@alvherre.pgsql	2 years ago
Peter Eisentraut	8e7912e73d	Message style improvements	2 years ago
Tomas Vondra	3581cbdcd6	Fix handling of empty ranges and NULLs in BRIN BRIN indexes did not properly distinguish between summaries for empty (no rows) and all-NULL ranges, treating them as essentially the same thing. Summaries were initialized with allnulls=true, and opclasses simply reset allnulls to false when processing the first non-NULL value. This however produces incorrect results if the range starts with a NULL value (or a sequence of NULL values), in which case we forget the range contains NULL values when adding the first non-NULL value. This happens because the allnulls flag is used for two separate purposes - to mark empty ranges (not representing any rows yet) and ranges containing only NULL values. Opclasses don't know which of these cases it is, and so don't know whether to set hasnulls=true. Setting the flag in both cases would make it correct, but it would also make BRIN indexes useless for queries with IS NULL clauses. All ranges start empty (and thus allnulls=true), so all ranges would end up with either allnulls=true or hasnulls=true. The severity of the issue is somewhat reduced by the fact that it only happens when adding values to an existing summary with allnulls=true. This can happen e.g. for small tables (because a summary for the first range exists for all BRIN indexes), or for tables with large fraction of NULL values in the indexed columns. Bulk summarization (e.g. during CREATE INDEX or automatic summarization) that processes all values at once is not affected by this issue. In this case the flags were updated in a slightly different way, not forgetting the NULL values. To identify empty ranges we use a new flag, stored in an unused bit in the BRIN tuple header so the on-disk format remains the same. A matching flag is added to BrinMemTuple, into a 3B gap after bt_placeholder. That means there's no risk of ABI breakage, although we don't actually pass the BrinMemTuple to any public API. We could also skip storing index tuples for empty summaries, but then we'd have to always process such ranges - even if there are no rows in large parts of the table (e.g. after a bulk DELETE), it would still require reading the pages etc. So we store them, but ignore them when building the bitmap. Backpatch to 11. The issue exists since BRIN indexes were introduced in 9.5, but older releases are already EOL. Backpatch-through: 11 Reviewed-by: Justin Pryzby, Matthias van de Meent, Alvaro Herrera Discussion: https://postgr.es/m/402430e4-7d9d-6cf1-09ef-464d80afff3b@enterprisedb.com	2 years ago
Tomas Vondra	3ec8a3bfb5	Fix handling of NULLs when merging BRIN summaries When merging BRIN summaries, union_tuples() did not correctly update the target hasnulls/allnulls flags. When merging all-NULL summary into a summary without any NULL values, the result had both flags set to false (instead of having hasnulls=true). This happened because the code only considered the hasnulls flags, ignoring the possibility the source summary has allnulls=true. Discovered while investigating issues with handling empty BRIN ranges and handling of NULL values, but it's a separate problem (has nothing to do with empty ranges). Fixed by considering both flags on the source summary, and updating the hasnulls flag on the target summary. Backpatch to 11. The bug exists since 9.5 (where BRIN indexes were introduced), but those releases are EOL already. Discussion: https://postgr.es/m/9d993d0d-e431-2196-9ccc-0554d0e60154%40enterprisedb.com	2 years ago
Michael Paquier	8961cb9a03	Fix typos in comments The changes done in this commit impact comments with no direct user-visible changes, with fixes for incorrect function, variable or structure names. Author: Alexander Lakhin Discussion: https://postgr.es/m/e8c38840-596a-83d6-bd8d-cebc51111572@gmail.com	2 years ago

1 2 3 4 5 ...

4696 Commits (9fee3232a1d3661981c3cbf87227bfd02ee36c7b)