postgres

Commit Graph

Author	SHA1	Message	Date
Tom Lane	8f29467c57	Change "long" numGroups fields to be Cardinality (i.e., double). We've been nibbling away at removing uses of "long" for a long time, since its width is platform-dependent. Here's one more: change the remaining "long" fields in Plan nodes to Cardinality, since the three surviving examples all represent group-count estimates. The upstream planner code was converted to Cardinality some time ago; for example the corresponding fields in Path nodes are type Cardinality, as are the arguments of the make_foo_path functions. Downstream in the executor, it turns out that these all feed to the table-size argument of BuildTupleHashTable. Change that to "double" as well, and fix it so that it safely clamps out-of-range values to the uint32 limit of simplehash.h, as was not being done before. Essentially, this is removing all the artificial datatype-dependent limitations on these values from upstream processing, and applying just one clamp at the moment where we're forced to do so by the datatype choices of simplehash.h. Also, remove BuildTupleHashTable's misguided attempt to enforce work_mem/hash_mem_limit. It doesn't have enough information (particularly not the expected tuple width) to do that accurately, and it has no real business second-guessing the caller's choice. For all these plan types, it's really the planner's responsibility to not choose a hashed implementation if the hashtable is expected to exceed hash_mem_limit. The previous patch improved the accuracy of those estimates, and even if BuildTupleHashTable had more information it should arrive at the same conclusions. Reported-by: Jeff Janes <jeff.janes@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com	2 months ago
Tom Lane	1ea5bdb00b	Improve planner's estimates of tuple hash table sizes. For several types of plan nodes that use TupleHashTables, the planner estimated the expected size of the table as basically numEntries * (MAXALIGN(dataWidth) + MAXALIGN(SizeofHeapTupleHeader)). This is pretty far off, especially for small data widths, because it doesn't account for the overhead of the simplehash.h hash table nor for any per-tuple "additional space" the plan node may request. Jeff Janes noted a case where the estimate was off by about a factor of three, even though the obvious hazards such as inaccurate estimates of numEntries or dataWidth didn't apply. To improve matters, create functions provided by the relevant executor modules that can estimate the required sizes with reasonable accuracy. (We're still not accounting for effects like allocator padding, but this at least gets the first-order effects correct.) I added functions that can estimate the tuple table sizes for nodeSetOp and nodeSubplan; these rely on an estimator for TupleHashTables in general, and that in turn relies on one for simplehash.h hash tables. That feels like kind of a lot of mechanism, but if we take any short-cuts we're violating modularity boundaries. The other places that use TupleHashTables are nodeAgg, which took pains to get its numbers right already, and nodeRecursiveunion. I did not try to improve the situation for nodeRecursiveunion because there's nothing to improve: we are not making an estimate of the hash table size, and it wouldn't help us to do so because we have no non-hashed alternative implementation. On top of that, our estimate of the number of entries to be hashed in that module is so suspect that we'd likely often choose the wrong implementation if we did have two ways to do it. Reported-by: Jeff Janes <jeff.janes@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com	2 months ago
Peter Eisentraut	8a27d418f8	Mark function arguments of type "Datum " as "const Datum " where possible Several functions in the codebase accept "Datum " parameters but do not modify the pointed-to data. These have been updated to take "const Datum " instead, improving type safety and making the interfaces clearer about their intent. This change helps the compiler catch accidental modifications and better documents immutability of arguments. Most of "Datum " parameters have a pairing "bool isnull" parameter, they are constified as well. No functional behavior is changed by this patch. Author: Chao Li <lic@highgo.com> Discussion: https://www.postgresql.org/message-id/flat/CAEoWx2msfT0knvzUa72ZBwu9LR_RLY4on85w2a9YpE-o2By5HQ@mail.gmail.com	2 months ago
Tom Lane	c106ef0807	Use BumpContext contexts in TupleHashTables, and do some code cleanup. For all extant uses of TupleHashTables, execGrouping.c itself does nothing with the "tablecxt" except to allocate new hash entries in it, and the callers do nothing with it except to reset the whole context. So this is an ideal use-case for a BumpContext, and the hash tables are frequently big enough for the savings to be significant. (Commit `cc721c459` already taught nodeAgg.c this idea, but neglected the other callers of BuildTupleHashTable.) While at it, let's clean up some ill-advised leftovers from rebasing TupleHashTables on simplehash.h: * Many comments and variable names were based on the idea that the tablecxt holds the whole TupleHashTable, whereas now it only holds the hashed tuples (plus any caller-defined "additional storage"). Rename to names like tuplescxt and tuplesContext, and adjust the comments. Also adjust the memory context names to be like "<Foo> hashed tuples". * Make ResetTupleHashTable() reset the tuplescxt rather than relying on the caller to do so; that was fairly bizarre and seems like a recipe for leaks. This is less efficient in the case where nodeAgg.c uses the same tuplescxt for several different hashtables, but only microscopically so because mcxt.c will short-circuit the extra resets via its isReset flag. I judge the extra safety and intellectual cleanliness well worth those few cycles. * Remove the long-obsolete "allow_jit" check added by ac88807f9; instead, just Assert that metacxt and tuplescxt are different. We need that anyway for this definition of ResetTupleHashTable() to be safe. There is a side issue of the extent to which this change invalidates the planner's estimates of hashtable memory consumption. However, those estimates are already pretty bad, so improving them seems like it can be a separate project. This change is useful to do first to establish consistent executor behavior that the planner can expect. A loose end not addressed here is that the "entrysize" calculation in BuildTupleHashTable seems wrong: "sizeof(TupleHashEntryData) + additionalsize" corresponds neither to the size of the simplehash entries nor to the total space needed per tuple. It's questionable why BuildTupleHashTable is second-guessing its caller's nbuckets choice at all, since the original source of the number should have had more information. But that all seems wrapped up with the planner's estimation logic, so let's leave it for the planned followup patch. Reported-by: Jeff Janes <jeff.janes@gmail.com> Reported-by: David Rowley <dgrowleyml@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com Discussion: https://postgr.es/m/2268409.1761512111@sss.pgh.pa.us	2 months ago
Peter Eisentraut	e1ac846f3d	Mark ItemPointer arguments as const throughout This is a follow up `991295f`. I searched over src/ and made all ItemPointer arguments as const as much as possible. Note: We cut out from the original patch the pieces that would have created incompatibilities in the index or table AM APIs. Those could be considered separately. Author: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/CAEoWx2nBaypg16Z5ciHuKw66pk850RFWw9ACS2DqqJ_AkKeRsw%40mail.gmail.com	2 months ago
Michael Paquier	f9a09aa295	Add wal_fpi_bytes to pg_stat_wal and pg_stat_get_backend_wal() This new counter, called "wal_fpi_bytes", tracks the total amount in bytes of full page images (FPIs) generated in WAL. This data becomes available globally via pg_stat_wal, and for backend statistics via pg_stat_get_backend_wal(). Previously, this information could only be retrieved with pg_waldump or pg_walinspect, which may not be available depending on the environment, and are expensive to execute. It offers hints about how much FPIs impact the WAL generated, which could be a large percentage for some workloads, as well as the effects of wal_compression or page holes. Bump catalog version. Bump PGSTAT_FILE_FORMAT_ID, due to the addition of wal_fpi_bytes in PgStat_WalCounters. Author: Shinya Kato <shinya11.kato@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CAOzEurQtZEAfg6P0kU3Wa-f9BWQOi0RzJEMPN56wNTOmJLmfaQ@mail.gmail.com	2 months ago
Amit Kapila	f0b3573c3a	Introduce "REFRESH SEQUENCES" for subscriptions. This patch adds support for a new SQL command: ALTER SUBSCRIPTION ... REFRESH SEQUENCES This command updates the sequence entries present in the pg_subscription_rel catalog table with the INIT state to trigger resynchronization. In addition to the new command, the following subscription commands have been enhanced to automatically refresh sequence mappings: ALTER SUBSCRIPTION ... REFRESH PUBLICATION ALTER SUBSCRIPTION ... ADD PUBLICATION ALTER SUBSCRIPTION ... DROP PUBLICATION ALTER SUBSCRIPTION ... SET PUBLICATION These commands will perform the following actions: Add newly published sequences that are not yet part of the subscription. Remove sequences that are no longer included in the publication. This ensures that sequence replication remains aligned with the current state of the publication on the publisher side. Note that the actual synchronization of sequence data/values will be handled in a subsequent patch that introduces a dedicated sequence sync worker. Author: Vignesh C <vignesh21@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Nisha Moond <nisha.moond412@gmail.com> Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Hou Zhijie <houzj.fnst@fujitsu.com> Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com	2 months ago
Etsuro Fujita	12609fbacb	Fix EvalPlanQual handling of foreign/custom joins in ExecScanFetch. If inside an EPQ recheck, ExecScanFetch would run the recheck method function for foreign/custom joins even if they aren't descendant nodes in the EPQ recheck plan tree, which is problematic at least in the foreign-join case, because such a foreign join isn't guaranteed to have an alternative local-join plan required for running the recheck method function; in the postgres_fdw case this could lead to a segmentation fault or an assert failure in an assert-enabled build when running the recheck method function. Even if inside an EPQ recheck, any scan nodes that aren't descendant ones in the EPQ recheck plan tree should be normally processed by using the access method function; fix by modifying ExecScanFetch so that if inside an EPQ recheck, it runs the recheck method function for foreign/custom joins that are descendant nodes in the EPQ recheck plan tree as before and runs the access method function for foreign/custom joins that aren't. This fix also adds to postgres_fdw an isolation test for an EPQ recheck that caused issues stated above. Oversight in commit `385f337c9`. Reported-by: Kristian Lejao <kristianlejao@gmail.com> Author: Masahiko Sawada <sawada.mshk@gmail.com> Co-authored-by: Etsuro Fujita <etsuro.fujita@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Etsuro Fujita <etsuro.fujita@gmail.com> Discussion: https://postgr.es/m/CAD21AoBpo6Gx55FBOW+9s5X=nUw3Xpq64v35fpDEKsTERnc4TQ@mail.gmail.com Backpatch-through: 13	2 months ago
Álvaro Herrera	1a8b5b11e4	Don't include access/htup_details.h in executor/tuptable.h This is not at all needed; I suspect it was a simple mistake in commit `5408e233f0`. It causes htup_details.h to bleed into a huge number of places via execnodes.h. Remove it and fix fallout. Discussion: https://postgr.es/m/202510021240.ptc2zl5cvwen@alvherre.pgsql	3 months ago
Peter Eisentraut	d4d1fc527b	Update various forward declarations to use typedef There are a number of forward declarations that use struct but not the customary typedef, because that could have led to repeat typedefs, which was not allowed. This is now allowed in C11, so we can update these to provide the typedefs as well, so that the later uses of the types look more consistent. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/10d32190-f31b-40a5-b177-11db55597355@eisentraut.org	3 months ago
Dean Rasheed	5386bfb9c1	Fix replica identity check for INSERT ON CONFLICT DO UPDATE. If an INSERT has an ON CONFLICT DO UPDATE clause, the executor must check that the target relation supports UPDATE as well as INSERT. In particular, it must check that the target relation has a REPLICA IDENTITY if it publishes updates. Formerly, it was not doing this check, which could lead to silently breaking replication. Fix by adding such a check to CheckValidResultRel(), which requires adding a new onConflictAction argument. In back-branches, preserve ABI compatibility by introducing a wrapper function with the original signature. Author: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Tested-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/OS3PR01MB57180C87E43A679A730482DF94B62@OS3PR01MB5718.jpnprd01.prod.outlook.com Backpatch-through: 13	4 months ago
Dean Rasheed	22424953cd	Fix security checks in selectivity estimation functions. Commit `e2d4ef8de8` (the fix for CVE-2017-7484) added security checks to the selectivity estimation functions to prevent them from running user-supplied operators on data obtained from pg_statistic if the user lacks privileges to select from the underlying table. In cases involving inheritance/partitioning, those checks were originally performed against the child RTE (which for plain inheritance might actually refer to the parent table). Commit `553d2ec271` then extended that to also check the parent RTE, allowing access if the user had permissions on either the parent or the child. It turns out, however, that doing any checks using the child RTE is incorrect, since securityQuals is set to NULL when creating an RTE for an inheritance child (whether it refers to the parent table or the child table), and therefore such checks do not correctly account for any RLS policies or security barrier views. Therefore, do the security checks using only the parent RTE. This is consistent with how RLS policies are applied, and the executor's ACL checks, both of which use only the parent table's permissions/policies. Similar checks are performed in the extended stats code, so update that in the same way, centralizing all the checks in a new function. In addition, note that these checks by themselves are insufficient to ensure that the user has access to the table's data because, in a query that goes via a view, they only check that the view owner has permissions on the underlying table, not that the current user has permissions on the view itself. In the selectivity estimation functions, there is no easy way to navigate from underlying tables to views, so add permissions checks for all views mentioned in the query to the planner startup code. If the user lacks permissions on a view, a permissions error will now be reported at planner-startup, and the selectivity estimation functions will not be run. Checking view permissions at planner-startup in this way is a little ugly, since the same checks will be repeated at executor-startup. Longer-term, it might be better to move all the permissions checks from the executor to the planner so that permissions errors can be reported sooner, instead of creating a plan that won't ever be run. However, such a change seems too far-reaching to be back-patched. Back-patch to all supported versions. In v13, there is the added complication that UPDATEs and DELETEs on inherited target tables are planned using inheritance_planner(), which plans each inheritance child table separately, so that the selectivity estimation functions do not know that they are dealing with a child table accessed via its parent. Handle that by checking access permissions on the top parent table at planner-startup, in the same way as we do for views. Any securityQuals on the top parent table are moved down to the child tables by inheritance_planner(), so they continue to be checked by the selectivity estimation functions. Author: Dean Rasheed <dean.a.rasheed@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Noah Misch <noah@leadboat.com> Backpatch-through: 13 Security: CVE-2025-8713	5 months ago
Amit Kapila	fd5a1a0c3e	Detect and report update_deleted conflicts. This enhancement builds upon the infrastructure introduced in commit `228c370868`, which enables the preservation of deleted tuples and their origin information on the subscriber. This capability is crucial for handling concurrent transactions replicated from remote nodes. The update introduces support for detecting update_deleted conflicts during the application of update operations on the subscriber. When an update operation fails to locate the target row-typically because it has been concurrently deleted-we perform an additional table scan. This scan uses the SnapshotAny mechanism and we do this additional scan only when the retain_dead_tuples option is enabled for the relevant subscription. The goal of this scan is to locate the most recently deleted tuple-matching the old column values from the remote update-that has not yet been removed by VACUUM and is still visible according to our slot (i.e., its deletion is not older than conflict-detection-slot's xmin). If such a tuple is found, the system reports an update_deleted conflict, including the origin and transaction details responsible for the deletion. This provides a groundwork for more robust and accurate conflict resolution process, preventing unexpected behavior by correctly identifying cases where a remote update clashes with a deletion from another origin. Author: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Nisha Moond <nisha.moond412@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2@OS0PR01MB5716.jpnprd01.prod.outlook.com	5 months ago
Tom Lane	b27644bade	Sync typedefs.list with the buildfarm. Our maintenance of typedefs.list has been a little haphazard (and apparently we can't alphabetize worth a darn). Replace the file with the authoritative list from our buildfarm, and run pgindent using that. I also updated the additions/exclusions lists in pgindent where necessary to keep pgindent from messing things up significantly. Notably, now that regex_t and some related names are macros not real typedefs, we have to whitelist them explicitly. The exclusions list has also drifted noticeably, presumably due to changes of system headers on the buildfarm animals that contribute to the list. Unlike in prior years, I've not manually added typedef names that are missing from the buildfarm's list because they are not used to declare any variables or fields. So there are a few places where the typedef declaration itself is formatted worse than before, e.g. typedef enum IoMethod. I could preserve the names that were manually added to the list previously, but I'd really prefer to find a less manual way of dealing with these cases. A quick grep finds about 75 such symbols, most of which have never gotten any special treatment. Per discussion among pgsql-release, doing this now seems appropriate even though we're still a week or two away from making the v18 branch.	6 months ago
Amit Langote	1722d5eb05	Revert "Don't lock partitions pruned by initial pruning" As pointed out by Tom Lane, the patch introduced fragile and invasive design around plan invalidation handling when locking of prunable partitions was deferred from plancache.c to the executor. In particular, it violated assumptions about CachedPlan immutability and altered executor APIs in ways that are difficult to justify given the added complexity and overhead. This also removes the firstResultRels field added to PlannedStmt in commit `28317de72`, which was intended to support deferred locking of certain ModifyTable result relations. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/605328.1747710381@sss.pgh.pa.us	7 months ago
Tom Lane	0dca5d68d7	Change SQL-language functions to use the plan cache. In the historical implementation of SQL functions (if they don't get inlined), we built plans for all the contained queries at first call within an outer query, and then re-used those plans for the duration of the outer query, and then forgot everything. This was not ideal, not least because the plans could not be customized to specific values of the function's parameters. Our plancache infrastructure seems mature enough to be used here. That will solve both the problem with not being able to build custom plans and the problem with not being able to share work across successive outer queries. Aside from those performance concerns, this change fixes a longstanding bugaboo with SQL functions: you could not write DDL that would affect later statements in the same function. That's mostly still true with new-style SQL functions, since the results of parse analysis are baked into the stored query trees (and protected by dependency records). But for old-style SQL functions, it will now work much as it does with PL/pgSQL functions, because we delay parse analysis and planning of each query until we're ready to run it. Some edge cases that require replanning are now handled better too; see for example the new rowsecurity test, where we now detect an RLS context change that was previously missed. One other edge-case change that might be worthy of a release note is that we now insist that a SQL function's result be generated by the physically-last query within it. Previously, if the last original query was deleted by a DO INSTEAD NOTHING rule, we'd be willing to take the result from the preceding query instead. This behavior was undocumented except in source-code comments, and it seems hard to believe that anyone's relying on it. Along the way to this feature, we needed a few infrastructure changes: * The plancache can now take either a raw parse tree or an analyzed-but-not-rewritten Query as the starting point for a CachedPlanSource. If given a Query, it is caller's responsibility that nothing will happen to invalidate that form of the query. We use this for new-style SQL functions, where what's in pg_proc is serialized Query(s) and we trust the dependency mechanism to disallow DDL that would break those. * The plancache now offers a way to invoke a post-rewrite callback to examine/modify the rewritten parse tree when it is rebuilding the parse trees after a cache invalidation. We need this because SQL functions sometimes adjust the parse tree to make its output exactly match the declared result type; if the plan gets rebuilt, that has to be re-done. * There is a new backend module utils/cache/funccache.c that abstracts the idea of caching data about a specific function usage (a particular function and set of input data types). The code in it is moved almost verbatim from PL/pgSQL, which has done that for a long time. We use that logic now for SQL-language functions too, and maybe other PLs will have use for it in the future. Author: Alexander Pyhalov <a.pyhalov@postgrespro.ru> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com> Discussion: https://postgr.es/m/8216639.NyiUUSuA9g@aivenlaptop	9 months ago
Peter Eisentraut	cdc168ad4b	Add support for not-null constraints on virtual generated columns This was left out of the original patch for virtual generated columns (commit `83ea6c5402`). This just involves a bit of extra work in the executor to expand the generation expressions and run a "IS NOT NULL" test against them. There is also a bit of work to make sure that not-null constraints are checked during a table rewrite. Author: jian he <jian.universality@gmail.com> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Navneet Kumar <thanit3111@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/CACJufxHArQysbDkWFmvK+D1TPHQWWTxWN15cMuUaTYX3xhQXgg@mail.gmail.com	9 months ago
Jeff Davis	626df47ad9	Remove 'additional' pointer from TupleHashEntryData. Reduces memory required for hash aggregation by avoiding an allocation and a pointer in the TupleHashEntryData structure. That structure is used for all buckets, whether occupied or not, so the savings is substantial. Discussion: https://postgr.es/m/AApHDvpN4v3t_sdz4dvrv1Fx_ZPw=twSnxuTEytRYP7LFz5K9A@mail.gmail.com Reviewed-by: David Rowley <dgrowleyml@gmail.com>	9 months ago
Jeff Davis	a0942f441e	Add ExecCopySlotMinimalTupleExtra(). Allows an "extra" argument that allocates extra memory at the end of the MinimalTuple. This is important for callers that need to store additional data, but do not want to perform an additional allocation. Suggested-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAApHDvppeqw2pNM-+ahBOJwq2QmC0hOAGsmCpC89QVmEoOvsdg@mail.gmail.com	9 months ago
Jeff Davis	4d143509cb	Create accessor functions for TupleHashEntry. Refactor for upcoming optimizations. Reviewed-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/1cc3b400a0e8eead18ff967436fa9e42c0c14cfb.camel@j-davis.com	9 months ago
Amit Langote	28317de723	Ensure first ModifyTable rel initialized if all are pruned Commit `cbc127917e` introduced tracking of unpruned relids to avoid processing pruned relations, and changed ExecInitModifyTable() to initialize only unpruned result relations. As a result, MERGE statements that prune all target partitions can now lead to crashes or incorrect behavior during execution. The crash occurs because some executor code paths rely on ModifyTableState.resultRelInfo[0] being present and initialized, even when no result relations remain after pruning. For example, ExecMerge() and ExecMergeNotMatched() use the first resultRelInfo to determine the appropriate action. Similarly, ExecInitPartitionInfo() assumes that at least one result relation exists. To preserve these assumptions, ExecInitModifyTable() now includes the first result relation in the initialized result relation list if all result relations for that ModifyTable were pruned. To enable that, ExecDoInitialPruning() ensures the first relation is locked if it was pruned and locking is necessary. To support this exception to the pruning logic, PlannedStmt now includes a list of RT indexes identifying the first result relation of each ModifyTable node in the plan. This allows ExecDoInitialPruning() to check whether each such relation was pruned and, if so, lock it if necessary. Bug: #18830 Reported-by: Robins Tharakan <tharakan@gmail.com> Diagnozed-by: Tender Wang <tndrwang@gmail.com> Diagnozed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Co-authored-by: Dean Rasheed <dean.a.rasheed@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Discussion: https://postgr.es/m/18830-1f31ea1dc930d444%40postgresql.org	9 months ago
Peter Geoghegan	0fbceae841	Show index search count in EXPLAIN ANALYZE, take 2. Expose the count of index searches/index descents in EXPLAIN ANALYZE's output for index scan/index-only scan/bitmap index scan nodes. This information is particularly useful with scans that use ScalarArrayOp quals, where the number of index searches can be unpredictable due to implementation details that interact with physical index characteristics (at least with nbtree SAOP scans, since Postgres 17 commit `5bf748b8`). The information shown also provides useful context when EXPLAIN ANALYZE runs a plan with an index scan node that successfully applied the skip scan optimization (set to be added to nbtree by an upcoming patch). The instrumentation works by teaching all index AMs to increment a new nsearches counter whenever a new index search begins. The counter is incremented at exactly the same point that index AMs already increment the pg_stat_*_indexes.idx_scan counter (we're counting the same event, but at the scan level rather than the relation level). Parallel queries have workers copy their local counter struct into shared memory when an index scan node ends -- even when it isn't a parallel aware scan node. An earlier version of this patch that only worked with parallel aware scans became commit `5ead85fb` (though that was quickly reverted by commit `d00107cd` following "debug_parallel_query=regress" buildfarm failures). Our approach doesn't match the approach used when tracking other index scan related costs (e.g., "Rows Removed by Filter:"). It is comparable to the approach used in similar cases involving costs that are only readily accessible inside an access method, not from the executor proper (e.g., "Heap Blocks:" output for a Bitmap Heap Scan, which was recently enhanced to show per-worker costs by commit `5a1e6df3`, using essentially the same scheme as the one used here). It is necessary for index AMs to have direct responsibility for maintaining the new counter, since the counter might need to be incremented multiple times per amgettuple call (or per amgetbitmap call). But it is also necessary for the executor proper to manage the shared memory now used to transfer each worker's counter struct to the leader. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Masahiro Ikeda <ikedamsh@oss.nttdata.com> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-WzkRqvaqR2CTNqTZP0z6FuL4-3ED6eQB0yx38XBNj1v-4Q@mail.gmail.com Discussion: https://postgr.es/m/CAH2-Wz=PKR6rB7qbx+Vnd7eqeB5VTcrW=iJvAsTsKbdG+kW_UA@mail.gmail.com	10 months ago
Daniel Gustafsson	d35d32d711	Add special case fast-paths for strict functions Many STRICT function calls will have one or two arguments, in which case we can speed up checking for NULL input by avoiding setting up a loop over the arguments. This adds EEOP_FUNCEXPR_STRICT_1 and the corresponding EEOP_FUNCEXPR_STRICT_2 for functions with one and two arguments respectively. Author: Andres Freund <andres@anarazel.de> Co-authored-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Andreas Karlsson <andreas@proxel.se> Discussion: https://postgr.es/m/415721CE-7D2E-4B74-B5D9-1950083BA03E@yesql.se Discussion: https://postgr.es/m/20191023163849.sosqbfs5yenocez3@alap3.anarazel.de	10 months ago
Daniel Gustafsson	8dd7c7cd0a	Replace EEOP_DONE with special steps for return/no return Knowing when the side-effects of an expression is the intended result of the execution, rather than the returnvalue, is important for being able generate more efficient JITed code. This replaces EEOP_DONE with two new steps: EEOP_DONE_RETURN and EEOP_DONE_NO_RETURN. Expressions which return a value should use the former step; expressions used for their side-effects which don't return value should use the latter. Author: Andres Freund <andres@anarazel.de> Co-authored-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Andreas Karlsson <andreas@proxel.se> Discussion: https://postgr.es/m/415721CE-7D2E-4B74-B5D9-1950083BA03E@yesql.se Discussion: https://postgr.es/m/20191023163849.sosqbfs5yenocez3@alap3.anarazel.de	10 months ago
Amit Langote	525392d572	Don't lock partitions pruned by initial pruning Before executing a cached generic plan, AcquireExecutorLocks() in plancache.c locks all relations in a plan's range table to ensure the plan is safe for execution. However, this locks runtime-prunable relations that will later be pruned during "initial" runtime pruning, introducing unnecessary overhead. This commit defers locking for such relations to executor startup and ensures that if the CachedPlan is invalidated due to concurrent DDL during this window, replanning is triggered. Deferring these locks avoids unnecessary locking overhead for pruned partitions, resulting in significant speedup, particularly when many partitions are pruned during initial runtime pruning. * Changes to locking when executing generic plans: AcquireExecutorLocks() now locks only unprunable relations, that is, those found in PlannedStmt.unprunableRelids (introduced in commit `cbc127917e`), to avoid locking runtime-prunable partitions unnecessarily. The remaining locks are taken by ExecDoInitialPruning(), which acquires them only for partitions that survive pruning. This deferral does not affect the locks required for permission checking in InitPlan(), which takes place before initial pruning. ExecCheckPermissions() now includes an Assert to verify that all relations undergoing permission checks, none of which can be in the set of runtime-prunable relations, are properly locked. * Plan invalidation handling: Deferring locks introduces a window where prunable relations may be altered by concurrent DDL, invalidating the plan. A new function, ExecutorStartCachedPlan(), wraps ExecutorStart() to detect and handle invalidation caused by deferred locking. If invalidation occurs, ExecutorStartCachedPlan() updates CachedPlan using the new UpdateCachedPlan() function and retries execution with the updated plan. To ensure all code paths that may be affected by this handle invalidation properly, all callers of ExecutorStart that may execute a PlannedStmt from a CachedPlan have been updated to use ExecutorStartCachedPlan() instead. UpdateCachedPlan() replaces stale plans in CachedPlan.stmt_list. A new CachedPlan.stmt_context, created as a child of CachedPlan.context, allows freeing old PlannedStmts while preserving the CachedPlan structure and its statement list. This ensures that loops over statements in upstream callers of ExecutorStartCachedPlan() remain intact. ExecutorStart() and ExecutorStart_hook implementations now return a boolean value indicating whether plan initialization succeeded with a valid PlanState tree in QueryDesc.planstate, or false otherwise, in which case QueryDesc.planstate is NULL. Hook implementations are required to call standard_ExecutorStart() at the beginning, and if it returns false, they should do the same without proceeding. * Testing: To verify these changes, the delay_execution module tests scenarios where cached plans become invalid due to changes in prunable relations after deferred locks. * Note to extension authors: ExecutorStart_hook implementations must verify plan validity after calling standard_ExecutorStart(), as explained earlier. For example: if (prev_ExecutorStart) plan_valid = prev_ExecutorStart(queryDesc, eflags); else plan_valid = standard_ExecutorStart(queryDesc, eflags); if (!plan_valid) return false; <extension-code> return true; Extensions accessing child relations, especially prunable partitions, via ExecGetRangeTableRelation() must now ensure their RT indexes are present in es_unpruned_relids (introduced in commit `cbc127917e`), or they will encounter an error. This is a strict requirement after this change, as only relations in that set are locked. The idea of deferring some locks to executor startup, allowing locks for prunable partitions to be skipped, was first proposed by Tom Lane. Reviewed-by: Robert Haas <robertmhaas@gmail.com> (earlier versions) Reviewed-by: David Rowley <dgrowleyml@gmail.com> (earlier versions) Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (earlier versions) Reviewed-by: Tomas Vondra <tomas@vondra.me> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://postgr.es/m/CA+HiwqFGkMSge6TgC9KQzde0ohpAycLQuV7ooitEEpbKB0O_mg@mail.gmail.com	10 months ago
Michael Paquier	eaf502747b	Move wal_buffers_full from PgStat_PendingWalStats to WalUsage wal_buffers_full has been introduced in pg_stat_wal in `8d9a935965`, as some information providing metrics for the tuning of the GUC wal_buffers. WalUsage has been introduced before that in `df3b181499`. Moving this field is proving to be beneficial for several reasons: - This information can now be made available in more layers, providing more granularity than just pg_stat_wal, on a per-query basis: EXPLAIN, pgss and VACUUM/ANALYZE logs. - A patch is under discussion to provide statistics for WAL at backend level, and this move simplifies a bit the handling of pending statistics. The remaining data in PgStat_PendingWalStats now relates to write/sync counters and times, with equivalents present in pg_stat_io, that backend statistics are able to already track. So this should cut all the dependencies between PgStat_PendingWalStats and WAL stats at backend level. As of this change, wal_buffers_full only shows in pg_stat_wal. Author: Bertrand Drouvot Reviewed-by: Ilia Evdokimov Discussion: https://postgr.es/m/Z6SOha5YFFgvpwQY@ip-10-97-1-34.eu-west-3.compute.internal	10 months ago
Tom Lane	6c7251db0c	Implement new optimization rule for updates of expanded variables. If a read/write expanded variable is declared locally to the assignment statement that is updating it, and it is referenced exactly once in the assignment RHS, then we can optimize the operation as a direct update of the expanded value, whether or not the function(s) operating on it can be trusted not to modify the value before throwing an error. This works because if an error does get thrown, we no longer care what value the variable has. In cases where that doesn't work, fall back to the previous rule that checks for safety of the top-level function. In any case, postpone determination of whether these optimizations are feasible until we are executing a Param referencing the target variable and that variable holds a R/W expanded object. While the previous incarnation of exec_check_rw_parameter was pretty cheap, this is a bit less so, and our plan to invoke support functions will make it even less so. So avoiding the check for variables where it couldn't be useful should be a win. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com> Discussion: https://postgr.es/m/CACxu=vJaKFNsYxooSnW1wEgsAO5u_v1XYBacfVJ14wgJV_PYeg@mail.gmail.com	11 months ago
Peter Eisentraut	83ea6c5402	Virtual generated columns This adds a new variant of generated columns that are computed on read (like a view, unlike the existing stored generated columns, which are computed on write, like a materialized view). The syntax for the column definition is ... GENERATED ALWAYS AS (...) VIRTUAL and VIRTUAL is also optional. VIRTUAL is the default rather than STORED to match various other SQL products. (The SQL standard makes no specification about this, but it also doesn't know about VIRTUAL or STORED.) (Also, virtual views are the default, rather than materialized views.) Virtual generated columns are stored in tuples as null values. (A very early version of this patch had the ambition to not store them at all. But so much stuff breaks or gets confused if you have tuples where a column in the middle is completely missing. This is a compromise, and it still saves space over being forced to use stored generated columns. If we ever find a way to improve this, a bit of pg_upgrade cleverness could allow for upgrades to a newer scheme.) The capabilities and restrictions of virtual generated columns are mostly the same as for stored generated columns. In some cases, this patch keeps virtual generated columns more restricted than they might technically need to be, to keep the two kinds consistent. Some of that could maybe be relaxed later after separate careful considerations. Some functionality that is currently not supported, but could possibly be added as incremental features, some easier than others: - index on or using a virtual column - hence also no unique constraints on virtual columns - extended statistics on virtual columns - foreign-key constraints on virtual columns - not-null constraints on virtual columns (check constraints are supported) - ALTER TABLE / DROP EXPRESSION - virtual column cannot have domain type - virtual columns are not supported in logical replication The tests in generated_virtual.sql have been copied over from generated_stored.sql with the keyword replaced. This way we can make sure the behavior is mostly aligned, and the differences can be visible. Some tests for currently not supported features are currently commented out. Reviewed-by: Jian He <jian.universality@gmail.com> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Tested-by: Shlok Kyal <shlok.kyal.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/a368248e-69e4-40be-9c07-6c3b5880b0a6@eisentraut.org	11 months ago
Amit Langote	cbc127917e	Track unpruned relids to avoid processing pruned relations This commit introduces changes to track unpruned relations explicitly, making it possible for top-level plan nodes, such as ModifyTable and LockRows, to avoid processing partitions pruned during initial pruning. Scan-level nodes, such as Append and MergeAppend, already avoid the unnecessary processing by accessing partition pruning results directly via part_prune_index. In contrast, top-level nodes cannot access pruning results directly and need to determine which partitions remain unpruned. To address this, this commit introduces a new bitmapset field, es_unpruned_relids, which the executor uses to track the set of unpruned relations. This field is referenced during plan initialization to skip initializing certain nodes for pruned partitions. It is initialized with PlannedStmt.unprunableRelids, a new field that the planner populates with RT indexes of relations that cannot be pruned during runtime pruning. These include relations not subject to partition pruning and those required for execution regardless of pruning. PlannedStmt.unprunableRelids is computed during set_plan_refs() by removing the RT indexes of runtime-prunable relations, identified from PartitionPruneInfos, from the full set of relation RT indexes. ExecDoInitialPruning() then updates es_unpruned_relids by adding partitions that survive initial pruning. To support this, PartitionedRelPruneInfo and PartitionedRelPruningData now include a leafpart_rti_map[] array that maps partition indexes to their corresponding RT indexes. The former is used in set_plan_refs() when constructing unprunableRelids, while the latter is used in ExecDoInitialPruning() to convert partition indexes returned by get_matching_partitions() into RT indexes, which are then added to es_unpruned_relids. These changes make it possible for ModifyTable and LockRows nodes to process only relations that remain unpruned after initial pruning. ExecInitModifyTable() trims lists, such as resultRelations, withCheckOptionLists, returningLists, and updateColnosLists, to consider only unpruned partitions. It also creates ResultRelInfo structs only for these partitions. Similarly, child RowMarks for pruned relations are skipped. By avoiding unnecessary initialization of structures for pruned partitions, these changes improve the performance of updates and deletes on partitioned tables during initial runtime pruning. Due to ExecInitModifyTable() changes as described above, EXPLAIN on a plan for UPDATE and DELETE that uses runtime initial pruning no longer lists partitions pruned during initial pruning. Reviewed-by: Robert Haas <robertmhaas@gmail.com> (earlier versions) Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CA+HiwqFGkMSge6TgC9KQzde0ohpAycLQuV7ooitEEpbKB0O_mg@mail.gmail.com	11 months ago
Peter Eisentraut	622f678c10	Integrate GistTranslateCompareType() into IndexAmTranslateCompareType() This turns GistTranslateCompareType() into a callback function of the gist index AM instead of a standalone function. The existing callers are changed to use IndexAmTranslateCompareType(). This then makes that code not hardcoded toward gist. This means in particular that the temporal keys code is now independent of gist. Also, this generalizes commit `74edabce7a`, so other index access methods other than the previously hardcoded ones could now work as REPLICA IDENTITY in a logical replication subscriber. Author: Mark Dilger <mark.dilger@enterprisedb.com> Co-authored-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	11 months ago
Tom Lane	041e8b95b8	Get rid of our dependency on type "long" for memory size calculations. Consistently use "Size" (or size_t, or in some places int64 or double) as the type for variables holding memory allocation sizes. In most places variables' data types were fine already, but we had an ancient habit of computing bytes from kilobytes-units GUCs with code like "work_mem * 1024L". That risks overflow on Win64 where they did not make "long" as wide as "size_t". We worked around that by restricting such GUCs' ranges, so you couldn't set work_mem et al higher than 2GB on Win64. This patch removes that restriction, after replacing such calculations with "work_mem * (Size) 1024" or variants of that. It should be noted that this patch was constructed by searching outwards from the GUCs that have MAX_KILOBYTES as upper limit. So I can't positively guarantee there are no other places doing memory-size arithmetic in int or long variables. I do however feel pretty confident that increasing MAX_KILOBYTES on Win64 is safe now. Also, nothing in our code should be dealing in multiple-gigabyte allocations without authorization from a relevant GUC, so it seems pretty likely that this search caught everything that could be at risk of overflow. Author: Vladlen Popolitov <v.popolitov@postgrespro.ru> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/1a01f0-66ec2d80-3b-68487680@27595217	11 months ago
Amit Langote	d47cbf474e	Perform runtime initial pruning outside ExecInitNode() This commit builds on the prior change that moved PartitionPruneInfos out of individual plan nodes into a list in PlannedStmt, making it possible to initialize PartitionPruneStates without traversing the plan tree and perform runtime initial pruning before ExecInitNode() initializes the plan trees. These tasks are now handled in a new routine, ExecDoInitialPruning(), which is called by InitPlan() before calling ExecInitNode() on various plan trees. ExecDoInitialPruning() performs the initial pruning and saves the result -- a Bitmapset of indexes for surviving child subnodes -- in es_part_prune_results, a list in EState. PartitionPruneStates created for initial pruning are stored in es_part_prune_states, another list in EState, for later use during exec pruning. Both lists are parallel to es_part_prune_infos, which holds the PartitionPruneInfos from PlannedStmt, enabling shared indexing. PartitionPruneStates initialized in ExecDoInitialPruning() now include only the PartitionPruneContexts for initial pruning steps. Exec pruning contexts are initialized later in ExecInitPartitionExecPruning() when the parent plan node is initialized, as the exec pruning step expressions depend on the parent node's PlanState. The existing function PartitionPruneFixSubPlanMap() has been repurposed for this initialization to avoid duplicating a similar loop structure for finding PartitionedRelPruningData to initialize exec pruning contexts for. It has been renamed to InitExecPruningContexts() to reflect its new primary responsibility. The original logic to "fix subplan maps" remains intact but is now encapsulated within the renamed function. This commit removes two obsolete Asserts in partkey_datum_from_expr(). The ExprContext used for pruning expression evaluation is now independent of the parent PlanState, making these Asserts unnecessary. By centralizing pruning logic and decoupling it from the plan initialization step (ExecInitNode()), this change sets the stage for future patches that will use the result of initial pruning to save the overhead of redundant processing for pruned partitions. Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CA+HiwqFGkMSge6TgC9KQzde0ohpAycLQuV7ooitEEpbKB0O_mg@mail.gmail.com	11 months ago
Tom Lane	115a365519	Simplify executor's handling of CaseTestExpr & CoerceToDomainValue. Instead of deciding at runtime whether to read from casetest.value or caseValue_datum, split EEOP_CASE_TESTVAL into two opcodes and make the decision during expression compilation. Similarly for EEOP_DOMAIN_TESTVAL. This actually results in net less code, mainly because llvmjit_expr.c's code for handling these opcodes gets shorter. The performance gain is doubtless negligible, but this seems worth changing anyway on grounds of simplicity and understandability. Author: Andreas Karlsson <andreas@proxel.se> Co-authored-by: Xing Guo <higuoxing@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CACpMh+AiBYAWn+D1aU7Rsy-V1tox06Cbc0H3qA7rwL5zdJ=anQ@mail.gmail.com	11 months ago
Amit Langote	bb3ec16e14	Move PartitionPruneInfo out of plan nodes into PlannedStmt This moves PartitionPruneInfo from plan nodes to PlannedStmt, simplifying traversal by centralizing all PartitionPruneInfo structures in a single list in it, which holds all instances for the main query and its subqueries. Instead of plan nodes (Append or MergeAppend) storing PartitionPruneInfo pointers, they now reference an index in this list. A bitmapset field is added to PartitionPruneInfo to store the RT indexes corresponding to the apprelids field in Append or MergeAppend. This allows execution pruning logic to verify that it operates on the correct plan node, mainly to facilitate debugging. Duplicated code in set_append_references() and set_mergeappend_references() is refactored into a new function, register_pruneinfo(). This updates RT indexes by applying rtoffet and adds PartitionPruneInfo to the global list in PlannerGlobal. By allowing pruning to be performed without traversing the plan tree, this change lays the groundwork for runtime initial pruning to occur independently of plan tree initialization. Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> (earlier version) Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CA+HiwqFGkMSge6TgC9KQzde0ohpAycLQuV7ooitEEpbKB0O_mg@mail.gmail.com	11 months ago
Amit Langote	fb9f955025	Refactor ExecScan() to allow inlining of its core logic This commit refactors ExecScan() by moving its tuple-fetching, filtering, and projection logic into an inline-able function, ExecScanExtended(), defined in src/include/executor/execScan.h. ExecScanExtended() accepts parameters for EvalPlanQual state, qualifiers (ExprState), and projection (ProjectionInfo). Specialized variants of the execution function of a given Scan node (for example, ExecSeqScan() for SeqScan) can then pass const-NULL for unused parameters. This allows the compiler to inline the logic and eliminate unnecessary branches or checks. Each variant function thus contains only the necessary code, optimizing execution for scans where these features are not needed. The variant function to be used is determined in the ExecInit*() function of the node and assigned to the ExecProcNode function pointer in the node's PlanState, effectively turning runtime checks and conditional branches on the NULLness of epqstate, qual, and projInfo into static ones, provided the compiler successfully eliminates unnecessary checks from the inlined code of ExecScanExtended(). Currently, only ExecSeqScan() is modified to take advantage of this inline-ability. Other Scan nodes might benefit from such specialized variant functions but that is left as future work. Benchmarks performed by Junwang Zhao, David Rowley and myself show up to a 5% reduction in execution time for queries that rely heavily on Seq Scans. The most significant improvements were observed in scenarios where EvalPlanQual, qualifiers, and projection were not required, but other cases also benefit from reduced runtime overhead due to the inlining and removal of unnecessary code paths. The idea for this patch first came from Andres Freund in an off-list discussion. The refactoring approach implemented here is based on a proposal by David Rowley, significantly improving upon the patch I (amitlan) initially proposed. Suggested-by: Andres Freund <andres@anarazel.de> Co-authored-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Tested-by: Junwang Zhao <zhjwpku@gmail.com> Tested-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CA+HiwqGaH-otvqW_ce-paL=96JvU4j+Xbuk+14esJNDwefdkOg@mail.gmail.com	11 months ago
Dean Rasheed	80feb727c8	Add OLD/NEW support to RETURNING in DML queries. This allows the RETURNING list of INSERT/UPDATE/DELETE/MERGE queries to explicitly return old and new values by using the special aliases "old" and "new", which are automatically added to the query (if not already defined) while parsing its RETURNING list, allowing things like: RETURNING old.colname, new.colname, ... RETURNING old., new. Additionally, a new syntax is supported, allowing the names "old" and "new" to be changed to user-supplied alias names, e.g.: RETURNING WITH (OLD AS o, NEW AS n) o.colname, n.colname, ... This is useful when the names "old" and "new" are already defined, such as inside trigger functions, allowing backwards compatibility to be maintained -- the interpretation of any existing queries that happen to already refer to relations called "old" or "new", or use those as aliases for other relations, is not changed. For an INSERT, old values will generally be NULL, and for a DELETE, new values will generally be NULL, but that may change for an INSERT with an ON CONFLICT ... DO UPDATE clause, or if a query rewrite rule changes the command type. Therefore, we put no restrictions on the use of old and new in any DML queries. Dean Rasheed, reviewed by Jian He and Jeff Davis. Discussion: https://postgr.es/m/CAEZATCWx0J0-v=Qjc6gXzR=KtsdvAE7Ow=D=mu50AgOe+pvisQ@mail.gmail.com	11 months ago
Peter Eisentraut	6339f6468e	Rename RowCompareType to CompareType RowCompareType served as a way to describe the fundamental meaning of an operator, notionally independent of an operator class (although so far this was only really supported for btrees). Its original purpose was for use inside RowCompareExpr, and it has also found some small use outside, such as for get_op_btree_interpretation(). We want to expand this now, as a more general way to describe operator semantics for other index access methods, including gist (to improve GistTranslateStratnum()) and others not written yet. To avoid future confusion, we rename the type to CompareType and the symbols from ROWCOMPARE_XXX to COMPARE_XXX to reflect their more general purpose. Reviewed-by: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	12 months ago
Jeff Davis	b4a07f532b	Revert "TupleHashTable: store additional data along with tuple." This reverts commit `e0ece2a981` due to performance regressions. Reported-by: David Rowley	12 months ago
Jeff Davis	e0ece2a981	TupleHashTable: store additional data along with tuple. Previously, the caller needed to allocate the memory and the TupleHashTable would store a pointer to it. That wastes space for the palloc overhead as well as the size of the pointer itself. Now, the TupleHashTable relies on the caller to correctly specify the additionalsize, and allocates that amount of space. The caller can then request a pointer into that space. Discussion: https://postgr.es/m/b9cbf0219a9859dc8d240311643ff4362fd9602c.camel@j-davis.com Reviewed-by: Heikki Linnakangas	12 months ago
Jeff Davis	834c9e807c	Add missing typedefs.list entry for AggStatePerGroupData. Discussion: https://postgr.es/m/7530bd8783b1a78d53a3c70383e38d8da0a5ffe5.camel%40j-davis.com	12 months ago
Bruce Momjian	50e6eb731d	Update copyright for 2025 Backpatch-through: 13	12 months ago
Tom Lane	e0a2721f7c	Get rid of old version of BuildTupleHashTable(). It was reasonable to preserve the old API of BuildTupleHashTable() in the back branches, but in HEAD we should actively discourage use of that version. There are no remaining callers in core, so just get rid of it. Then rename BuildTupleHashTableExt() back to BuildTupleHashTable(). While at it, fix up the miserably-poorly-maintained header comment for BuildTupleHashTable[Ext]. It looks like more than one patch in this area has had the opinion that updating comments is beneath them. Discussion: https://postgr.es/m/538343.1734646986@sss.pgh.pa.us	1 year ago
Tom Lane	2762792952	Convert SetOp to read its inputs as outerPlan and innerPlan. The original design for set operations involved appending the two input relations into one and adding a flag column that allows distinguishing which side each row came from. Then the SetOp node pries them apart again based on the flag. This is bizarre. The only apparent reason to do it is that when sorting, we'd only need one Sort node not two. But since sorting is at least O(N log N), sorting all the data is actually worse than sorting each side separately --- plus, we have no chance of taking advantage of presorted input. On top of that, adding the flag column frequently requires an additional projection step that adds cycles, and then the Append node isn't free either. Let's get rid of all of that and make the SetOp node have two separate children, using the existing outerPlan/innerPlan infrastructure. This initial patch re-implements nodeSetop.c and does a bare minimum of work on the planner side to generate correctly-shaped plans. In particular, I've tried not to change the cost estimates here, so that the visible changes in the regression test results will only involve removal of useless projection steps and not any changes in whether to use sorted vs hashed mode. For SORTED mode, we combine successive identical tuples from each input into groups, and then merge-join the groups. The tuple comparisons now use SortSupport instead of simple equality, but the group-formation part should involve roughly the same number of tuple comparisons as before. The cross-comparisons between left and right groups probably add to that, but I'm not sure to quantify how many more comparisons we might need. For HASHED mode, nodeSetop's logic is almost the same as before, just refactored into two separate loops instead of one loop that has an assumption that it will see all the left-hand inputs first. In both modes, I added early-exit logic to not bother reading the right-hand relation if the left-hand input is empty, since neither INTERSECT nor EXCEPT modes can produce any output if the left input is empty. This could have been done before in the hashed mode, but not in sorted mode. Sorted mode can also stop as soon as it exhausts the left input; any remaining right-hand tuples cannot have matches. Also, this patch adds some infrastructure for detecting whether child plan nodes all output the same type of tuple table slot. If they do, the hash table logic can use slightly more efficient code based on assuming that that's the input slot type it will see. We'll make use of that infrastructure in other plan node types later. Patch by me; thanks to Richard Guo and David Rowley for review. Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us	1 year ago
David Rowley	d96d1d5152	Fix incorrect slot type in BuildTupleHashTableExt `0f5738202` adjusted the execGrouping.c code so it made use of ExprStates to generate hash values. That commit made a wrong assumption that the slot type to pass to ExecBuildHash32FromAttrs() is always &TTSOpsMinimalTuple. That's not the case as the slot type depends on the slot type passed to LookupTupleHashEntry(), which for nodeRecursiveunion.c, could be any of the current slot types. Here we fix this by adding a new parameter to BuildTupleHashTableExt() to allow the slot type to be passed in. In the case of nodeSubplan.c and nodeAgg.c the slot type is always &TTSOpsVirtual, so for both of those cases, it's beneficial to pass the known slot type as that allows ExecBuildHash32FromAttrs() to skip adding the tuple deform step to the resulting ExprState. Another possible fix would have been to have ExecBuildHash32FromAttrs() set "fetch.kind" to NULL so that ExecComputeSlotInfo() always determines the EEOP_INNER_FETCHSOME is required, however, that option isn't favorable as slows down aggregation and hashed subplan evaluation due to the extra (needless) deform step. Thanks to Nathan Bossart for bisecting to find the offending commit based on Paul's report. Reported-by: Paul Ramsey <pramsey@cleverelephant.ca> Discussion: https://postgr.es/m/99F064C1-B3EB-4BE7-97D2-D2A0AA487A71@cleverelephant.ca	1 year ago
David Rowley	0f5738202b	Use ExprStates for hashing in GROUP BY and SubPlans This speeds up obtaining hash values for GROUP BY and hashed SubPlans by using the ExprState support for hashing, thus allowing JIT compilation for obtaining hash values for these operations. This, even without JIT compilation, has been shown to improve Hash Aggregate performance in some cases by around 15% and hashed NOT IN queries in one case by over 30%, however, real-world cases are likely to see smaller gains as the test cases used were purposefully designed to have high hashing overheads by keeping the hash table small to prevent additional memory overheads that would be a factor when working with large hash tables. In passing, fix a hypothetical bug in ExecBuildHash32Expr() so that the initial value is stored directly in the ExprState's result field if there are no expressions to hash. None of the current users of this function use an initial value, so the bug is only hypothetical. Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Discussion: https://postgr.es/m/CAApHDvpYSO3kc9UryMevWqthTBrxgfd9djiAjKHMPUSQeX9vdQ@mail.gmail.com	1 year ago
Peter Eisentraut	a2a475b011	Replace get_equal_strategy_number_for_am() by get_equal_strategy_number() get_equal_strategy_number_for_am() gets the equal strategy number for an AM. This currently only supports btree and hash. In the more general case, this also depends on the operator class (see for example GistTranslateStratnum()). To support that, replace this function with get_equal_strategy_number() that takes an opclass and derives it from there. (This function already existed before as a static function, so the signature is kept for simplicity.) This patch is only a refactoring, it doesn't add support for other index AMs such as gist. This will be done separately. Reviewed-by: Paul Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: vignesh C <vignesh21@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com	1 year ago
Tom Lane	3eea7a0c97	Simplify executor's determination of whether to use parallelism. Our parallel-mode code only works when we are executing a query in full, so ExecutePlan must disable parallel mode when it is asked to do partial execution. The previous logic for this involved passing down a flag (variously named execute_once or run_once) from callers of ExecutorRun or PortalRun. This is overcomplicated, and unsurprisingly some of the callers didn't get it right, since it requires keeping state that not all of them have handy; not to mention that the requirements for it were undocumented. That led to assertion failures in some corner cases. The only state we really need for this is the existing QueryDesc.already_executed flag, so let's just put all the responsibility in ExecutePlan. (It could have been done in ExecutorRun too, leading to a slightly shorter patch -- but if there's ever more than one caller of ExecutePlan, it seems better to have this logic in the subroutine than the callers.) This makes those ExecutorRun/PortalRun parameters unnecessary. In master it seems okay to just remove them, returning the API for those functions to what it was before parallelism. Such an API break is clearly not okay in stable branches, but for them we can just leave the parameters in place after documenting that they do nothing. Per report from Yugo Nagata, who also reviewed and tested this patch. Back-patch to all supported branches. Discussion: https://postgr.es/m/20241206062549.710dc01cf91224809dd6c0e1@sraoss.co.jp	1 year ago
Tom Lane	5b8728cd7f	Fix NULLIF()'s handling of read-write expanded objects. If passed a read-write expanded object pointer, the EEOP_NULLIF code would hand that same pointer to the equality function and then (unless equality was reported) also return the same pointer as its value. This is no good, because a function that receives a read-write expanded object pointer is fully entitled to scribble on or even delete the object, thus corrupting the NULLIF output. (This problem is likely unobservable with the equality functions provided in core Postgres, but it's easy to demonstrate with one coded in plpgsql.) To fix, make sure the pointer passed to the equality function is read-only. We can still return the original read-write pointer as the NULLIF result, allowing optimization of later operations. Per bug #18722 from Alexander Lakhin. This has been wrong since we invented expanded objects, so back-patch to all supported branches. Discussion: https://postgr.es/m/18722-fd9e645448cc78b4@postgresql.org	1 year ago
David Rowley	9ca67658d1	Don't store intermediate hash values in ExprState->resvalue `adf97c156` made it so ExprStates could support hashing and changed Hash Join to use that instead of manually extracting Datums from tuples and hashing them one column at a time. When hashing multiple columns or expressions, the code added in that commit stored the intermediate hash value in the ExprState's resvalue field. That was a mistake as steps may be injected into the ExprState between each hashing step that look at or overwrite the stored intermediate hash value. EEOP_PARAM_SET is an example of such a step. Here we fix this by adding a new dedicated field for storing intermediate hash values and adjust the code so that all apart from the final hashing step store their result in the intermediate field. In passing, rename a variable so that it's more aligned to the surrounding code and also so a few lines stay within the 80 char margin. Reported-by: Andres Freund Reviewed-by: Alena Rybakina <a.rybakina@postgrespro.ru> Discussion: https://postgr.es/m/CAApHDvqo9eenEFXND5zZ9JxO_k4eTA4jKMGxSyjdTrsmYvnmZw@mail.gmail.com	1 year ago
Amit Kapila	9758174e2e	Log the conflicts while applying changes in logical replication. This patch provides the additional logging information in the following conflict scenarios while applying changes: insert_exists: Inserting a row that violates a NOT DEFERRABLE unique constraint. update_differ: Updating a row that was previously modified by another origin. update_exists: The updated row value violates a NOT DEFERRABLE unique constraint. update_missing: The tuple to be updated is missing. delete_differ: Deleting a row that was previously modified by another origin. delete_missing: The tuple to be deleted is missing. For insert_exists and update_exists conflicts, the log can include the origin and commit timestamp details of the conflicting key with track_commit_timestamp enabled. update_differ and delete_differ conflicts can only be detected when track_commit_timestamp is enabled on the subscriber. We do not offer additional logging for exclusion constraint violations because these constraints can specify rules that are more complex than simple equality checks. Resolving such conflicts won't be straightforward. This area can be further enhanced if required. Author: Hou Zhijie Reviewed-by: Shveta Malik, Amit Kapila, Nisha Moond, Hayato Kuroda, Dilip Kumar Discussion: https://postgr.es/m/OS0PR01MB5716352552DFADB8E9AD1D8994C92@OS0PR01MB5716.jpnprd01.prod.outlook.com	1 year ago

1 2 3 4 5 ...

785 Commits (266543a62055541ddefe4e66797e0e1e4aa8705a)