mirror of https://github.com/postgres/postgres
columns, and the new version can be stored on the same heap page, we no longer generate extra index entries for the new version. Instead, index searches follow the HOT-chain links to ensure they find the correct tuple version. In addition, this patch introduces the ability to "prune" dead tuples on a per-page basis, without having to do a complete VACUUM pass to recover space. VACUUM is still needed to clean up dead index entries, however. Pavan Deolasee, with help from a bunch of other people.REL8_3_STABLE
parent
bbf4fdc253
commit
282d2a03dd
@ -0,0 +1,489 @@ |
||||
$PostgreSQL: pgsql/src/backend/access/heap/README.HOT,v 1.1 2007/09/20 17:56:30 tgl Exp $ |
||||
|
||||
Heap Only Tuples (HOT) |
||||
|
||||
Introduction |
||||
------------ |
||||
|
||||
The Heap Only Tuple (HOT) feature eliminates redundant index entries and |
||||
allows the re-use of space taken by DELETEd or obsoleted UPDATEd tuples |
||||
without performing a table-wide vacuum. It does this by allowing |
||||
single-page vacuuming, also called "defragmentation". |
||||
|
||||
Note: there is a Glossary at the end of this document that may be helpful |
||||
for first-time readers. |
||||
|
||||
|
||||
Technical Challenges |
||||
-------------------- |
||||
|
||||
Page-at-a-time vacuuming is normally impractical because of the costs of |
||||
finding and removing the index entries that link to the tuples to be |
||||
reclaimed. Standard vacuuming scans the indexes to ensure all such index |
||||
entries are removed, amortizing the index scan cost across as many dead |
||||
tuples as possible; this approach does not scale down well to the case of |
||||
reclaiming just a few tuples. In principle one could recompute the index |
||||
keys and do standard index searches to find the index entries, but this is |
||||
risky in the presence of possibly-buggy user-defined functions in |
||||
functional indexes. An allegedly immutable function that in fact is not |
||||
immutable might prevent us from re-finding an index entry (and we cannot |
||||
throw an error for not finding it, in view of the fact that dead index |
||||
entries are sometimes reclaimed early). That would lead to a seriously |
||||
corrupt index, in the form of entries pointing to tuple slots that by now |
||||
contain some unrelated content. In any case we would prefer to be able |
||||
to do vacuuming without invoking any user-written code. |
||||
|
||||
HOT solves this problem for a restricted but useful special case: |
||||
where a tuple is repeatedly updated in ways that do not change its |
||||
indexed columns. (Here, "indexed column" means any column referenced |
||||
at all in an index definition, including for example columns that are |
||||
tested in a partial-index predicate but are not stored in the index.) |
||||
|
||||
An additional property of HOT is that it reduces index size by avoiding |
||||
the creation of identically-keyed index entries. This improves search |
||||
speeds. |
||||
|
||||
|
||||
Update Chains With a Single Index Entry |
||||
--------------------------------------- |
||||
|
||||
Without HOT, every version of a row in an update chain has its own index |
||||
entries, even if all indexed columns are the same. With HOT, a new tuple |
||||
placed on the same page and with all indexed columns the same as its |
||||
parent row version does not get new index entries. This means there is |
||||
only one index entry for the entire update chain on the heap page. |
||||
An index-entry-less tuple is marked with the HEAP_ONLY_TUPLE flag. |
||||
The prior row version is marked HEAP_HOT_UPDATED, and (as always in an |
||||
update chain) its t_ctid field links forward to the newer version. |
||||
|
||||
For example: |
||||
|
||||
Index points to 1 |
||||
lp [1] [2] |
||||
|
||||
[111111111]->[2222222222] |
||||
|
||||
In the above diagram, the index points to line pointer 1, and tuple 1 is |
||||
marked as HEAP_HOT_UPDATED. Tuple 2 is a HOT tuple, meaning it has |
||||
no index entry pointing to it, and is marked as HEAP_ONLY_TUPLE. |
||||
Although tuple 2 is not directly referenced by the index, it can still be |
||||
found by an index search: after traversing from the index to tuple 1, |
||||
the index search proceeds forward to child tuples as long as it sees the |
||||
HEAP_HOT_UPDATED flag set. Since we restrict the HOT chain to lie within |
||||
a single page, this requires no additional page fetches and doesn't |
||||
introduce much performance penalty. |
||||
|
||||
Eventually, tuple 1 will no longer be visible to any transaction. |
||||
At that point its space could be reclaimed, but its line pointer cannot, |
||||
since the index still links to that line pointer and we still need to |
||||
be able to find tuple 2 in an index search. HOT handles this by turning |
||||
line pointer 1 into a "redirecting line pointer", which links to tuple 2 |
||||
but has no actual tuple attached. This state of affairs looks like |
||||
|
||||
Index points to 1 |
||||
lp [1]->[2] |
||||
|
||||
[2222222222] |
||||
|
||||
If now the row is updated again, to version 3, the page looks like this: |
||||
|
||||
Index points to 1 |
||||
lp [1]->[2] [3] |
||||
|
||||
[2222222222]->[3333333333] |
||||
|
||||
At some later time when no transaction can see tuple 2 in its snapshot, |
||||
tuple 2 and its line pointer can be pruned entirely: |
||||
|
||||
Index points to 1 |
||||
lp [1]------>[3] |
||||
|
||||
[3333333333] |
||||
|
||||
This is safe because no index entry points to line pointer 2. Subsequent |
||||
insertions into the page can now recycle both line pointer 2 and the |
||||
space formerly used by tuple 2. |
||||
|
||||
If an update changes any indexed column, or there is not room on the |
||||
same page for the new tuple, then the HOT chain ends: the last member |
||||
has a regular t_ctid link to the next version and is not marked |
||||
HEAP_HOT_UPDATED. (In principle we could continue a HOT chain across |
||||
pages, but this would destroy the desired property of being able to |
||||
reclaim space with just page-local manipulations. Anyway, we don't |
||||
want to have to chase through multiple heap pages to get from an index |
||||
entry to the desired tuple, so it seems better to create a new index |
||||
entry for the new tuple.) If further updates occur, the next version |
||||
could become the root of a new HOT chain. |
||||
|
||||
Line pointer 1 has to remain as long as there is any non-dead member of |
||||
the chain on the page. When there is not, it is marked "dead". |
||||
This lets us reclaim the last child line pointer and associated tuple |
||||
immediately. The next regular VACUUM pass can reclaim the index entries |
||||
pointing at the line pointer and then the line pointer itself. Since a |
||||
line pointer is small compared to a tuple, this does not represent an |
||||
undue space cost. |
||||
|
||||
Note: we can use a "dead" line pointer for any DELETEd tuple, |
||||
whether it was part of a HOT chain or not. This allows space reclamation |
||||
in advance of running VACUUM for plain DELETEs as well as HOT updates. |
||||
|
||||
The requirement for doing a HOT update is that none of the indexed |
||||
columns are changed. This is checked at execution time by comparing the |
||||
binary representation of the old and new values. We insist on bitwise |
||||
equality rather than using datatype-specific equality routines. The |
||||
main reason to avoid the latter is that there might be multiple notions |
||||
of equality for a datatype, and we don't know exactly which one is |
||||
relevant for the indexes at hand. We assume that bitwise equality |
||||
guarantees equality for all purposes. |
||||
|
||||
|
||||
Abort Cases |
||||
----------- |
||||
|
||||
If a heap-only tuple's xmin is aborted, then it can be removed immediately: |
||||
it was never visible to any other transaction, and all descendant row |
||||
versions must be aborted as well. Therefore we need not consider it part |
||||
of a HOT chain. By the same token, if a HOT-updated tuple's xmax is |
||||
aborted, there is no need to follow the chain link. However, there is a |
||||
race condition here: the transaction that did the HOT update might abort |
||||
between the time we inspect the HOT-updated tuple and the time we reach |
||||
the descendant heap-only tuple. It is conceivable that someone prunes |
||||
the heap-only tuple before that, and even conceivable that the line pointer |
||||
is re-used for another purpose. Therefore, when following a HOT chain, |
||||
it is always necessary to be prepared for the possibility that the |
||||
linked-to item pointer is unused, dead, or redirected; and if it is a |
||||
normal item pointer, we still have to check that XMIN of the tuple matches |
||||
the XMAX of the tuple we left. Otherwise we should assume that we have |
||||
come to the end of the HOT chain. Note that this sort of XMIN/XMAX |
||||
matching is required when following ordinary update chains anyway. |
||||
|
||||
(Early versions of the HOT code assumed that holding pin on the page |
||||
buffer while following a HOT link would prevent this type of problem, |
||||
but checking XMIN/XMAX matching is a much more robust solution.) |
||||
|
||||
|
||||
Index/Sequential Scans |
||||
---------------------- |
||||
|
||||
When doing an index scan, whenever we reach a HEAP_HOT_UPDATED tuple whose |
||||
xmax is not aborted, we need to follow its t_ctid link and check that |
||||
entry as well; possibly repeatedly until we reach the end of the HOT |
||||
chain. (When using an MVCC snapshot it is possible to optimize this a |
||||
bit: there can be at most one visible tuple in the chain, so we can stop |
||||
when we find it. This rule does not work for non-MVCC snapshots, though.) |
||||
|
||||
Sequential scans do not need to pay attention to the HOT links because |
||||
they scan every item pointer on the page anyway. The same goes for a |
||||
bitmap heap scan with a lossy bitmap. |
||||
|
||||
|
||||
Pruning |
||||
------- |
||||
|
||||
HOT pruning means updating item pointers so that HOT chains are |
||||
reduced in length, by collapsing out line pointers for intermediate dead |
||||
tuples. Although this makes those line pointers available for re-use, |
||||
it does not immediately make the space occupied by their tuples available. |
||||
|
||||
|
||||
Defragmentation |
||||
--------------- |
||||
|
||||
Defragmentation centralizes unused space. After we have converted root |
||||
line pointers to redirected line pointers and pruned away any dead |
||||
intermediate line pointers, the tuples they linked to are free space. |
||||
But unless that space is adjacent to the central "hole" on the page |
||||
(the pd_lower-to-pd_upper area) it cannot be used by tuple insertion. |
||||
Defragmentation moves the surviving tuples to coalesce all the free |
||||
space into one "hole". This is done with the same PageRepairFragmentation |
||||
function that regular VACUUM uses. |
||||
|
||||
|
||||
When can/should we prune or defragment? |
||||
--------------------------------------- |
||||
|
||||
This is the most interesting question in HOT implementation, since there |
||||
is no simple right answer: we must use heuristics to determine when it's |
||||
most efficient to perform pruning and/or defragmenting. |
||||
|
||||
We cannot prune or defragment unless we can get a "buffer cleanup lock" |
||||
on the target page; otherwise, pruning might destroy line pointers that |
||||
other backends have live references to, and defragmenting might move |
||||
tuples that other backends have live pointers to. Thus the general |
||||
approach must be to heuristically decide if we should try to prune |
||||
or defragment, and if so try to acquire the buffer cleanup lock without |
||||
blocking. If we succeed we can proceed with our housekeeping work. |
||||
If we cannot get the lock (which should not happen often, except under |
||||
very heavy contention) then the housekeeping has to be postponed till |
||||
some other time. The worst-case consequence of this is only that an |
||||
UPDATE cannot be made HOT but has to link to a new tuple version placed on |
||||
some other page, for lack of centralized space on the original page. |
||||
|
||||
Ideally we would do defragmenting only when we are about to attempt |
||||
heap_update on a HOT-safe tuple. The difficulty with this approach |
||||
is that the update query has certainly got a pin on the old tuple, and |
||||
therefore our attempt to acquire a buffer cleanup lock will always fail. |
||||
(This corresponds to the idea that we don't want to move the old tuple |
||||
out from under where the query's HeapTuple pointer points. It might |
||||
be possible to finesse that, but it seems fragile.) |
||||
|
||||
Pruning, however, is potentially useful even when we are not about to |
||||
insert a new tuple, since shortening a HOT chain reduces the cost of |
||||
subsequent index searches. However it is unclear that this gain is |
||||
large enough to accept any extra maintenance burden for. |
||||
|
||||
The currently planned heuristic is to prune and defrag when first accessing |
||||
a page that potentially has prunable tuples (flagged by the PD_PRUNABLE |
||||
page hint bit) and that either has free space less than MAX(fillfactor |
||||
target free space, BLCKSZ/10) *or* has recently had an UPDATE fail to |
||||
find enough free space to store an updated tuple version. (These rules |
||||
are subject to change.) |
||||
|
||||
We have effectively implemented the "truncate dead tuples to just line |
||||
pointer" idea that has been proposed and rejected before because of fear |
||||
of line pointer bloat: we might end up with huge numbers of line pointers |
||||
and just a few actual tuples on a page. To limit the damage in the worst |
||||
case, and to keep various work arrays as well as the bitmaps in bitmap |
||||
scans reasonably sized, the maximum number of line pointers per page |
||||
is arbitrarily capped at MaxHeapTuplesPerPage (the most tuples that |
||||
could fit without HOT pruning). |
||||
|
||||
|
||||
VACUUM |
||||
------ |
||||
|
||||
There is little change to regular vacuum. It performs pruning to remove |
||||
dead heap-only tuples, and cleans up any dead line pointers as if they were |
||||
regular dead tuples. |
||||
|
||||
|
||||
VACUUM FULL |
||||
----------- |
||||
|
||||
VACUUM FULL performs an extra operation of collapsing out redirecting line |
||||
pointers, by moving the first non-DEAD tuple of each HOT chain to the root |
||||
position and clearing its heap-only-tuple flag. This effectively changes |
||||
the user-visible CTID of that tuple. This would be completely unsafe |
||||
during normal concurrent operation, but since VACUUM FULL takes full |
||||
exclusive lock on the table, it should be OK. (Note that VACUUM FULL has |
||||
always felt free to change tuples' CTIDs by moving them across pages.) |
||||
Eliminating redirection links means that the main body of VACUUM FULL |
||||
doesn't have to deal with them, which seems a good thing since VACUUM FULL |
||||
is horrendously complex already. |
||||
|
||||
When VACUUM FULL tries to move tuple chains, it does not distinguish regular |
||||
and heap-only tuples, but just moves both types the same. This is OK because |
||||
it will move the entire non-DEAD tail of an update chain and remove index |
||||
entries for each item moved. At worst, we'll uselessly search for index |
||||
entries matching the heap-only tuples included in the move. |
||||
|
||||
|
||||
Statistics |
||||
---------- |
||||
|
||||
Currently, we count HOT updates the same as cold updates for statistics |
||||
purposes, though there is an additional per-table counter that counts |
||||
only HOT updates. When a page pruning operation is able to remove a |
||||
physical tuple by eliminating an intermediate heap-only tuple or |
||||
replacing a physical root tuple by a redirect pointer, a decrement in |
||||
the table's number of dead tuples is reported to pgstats, which may |
||||
postpone autovacuuming. Note that we do not count replacing a root tuple |
||||
by a DEAD item pointer as decrementing n_dead_tuples; we still want |
||||
autovacuum to run to clean up the index entries and DEAD item. |
||||
|
||||
This area probably needs further work ... |
||||
|
||||
|
||||
CREATE INDEX |
||||
------------ |
||||
|
||||
CREATE INDEX presents a problem for HOT updates. While the existing HOT |
||||
chains all have the same index values for existing indexes, the columns |
||||
in the new index might change within a pre-existing HOT chain, creating |
||||
a "broken" chain that can't be indexed properly. |
||||
|
||||
To address this issue, regular (non-concurrent) CREATE INDEX makes the |
||||
new index usable only by transactions newer than the CREATE INDEX |
||||
command. This prevents transactions that can see the inconsistent HOT |
||||
chains from trying to use the new index and getting incorrect results. |
||||
New transactions can only see the rows visible after the index was |
||||
created, hence the HOT chains are consistent for them. |
||||
|
||||
Entries in the new index point to root tuples (tuples with current index |
||||
pointers) so that our index uses the same index pointers as all other |
||||
indexes on the table. However the row we want to index is actually at |
||||
the *end* of the chain, ie, the most recent live tuple on the HOT chain. |
||||
That is the one we compute the index entry values for, but the TID |
||||
we put into the index is that of the root tuple. Since transactions that |
||||
will be allowed to use the new index cannot see any of the older tuple |
||||
versions in the chain, the fact that they might not match the index entry |
||||
isn't a problem. (Such transactions will check the tuple visibility |
||||
information of the older versions and ignore them, without ever looking at |
||||
their contents, so the content inconsistency is OK.) Subsequent updates |
||||
to the live tuple will be allowed to extend the HOT chain only if they are |
||||
HOT-safe for all the indexes. |
||||
|
||||
Because we have ShareLock on the table, any DELETE_IN_PROGRESS or |
||||
INSERT_IN_PROGRESS tuples should have come from our own transaction. |
||||
Therefore we can consider them committed since if the CREATE INDEX |
||||
commits, they will be committed, and if it aborts the index is discarded. |
||||
An exception to this is that early lock release is customary for system |
||||
catalog updates, and so we might find such tuples when reindexing a system |
||||
catalog. In that case we deal with it by waiting for the source |
||||
transaction to commit or roll back. (We could do that for user tables |
||||
too, but since the case is unexpected we prefer to throw an error.) |
||||
|
||||
Practically, we prevent old transactions from using the new index by |
||||
setting pg_index.indcheckxmin to TRUE. Queries are allowed to use such an |
||||
index only after pg_index.xmin is below their TransactionXmin horizon, |
||||
thereby ensuring that any incompatible rows in HOT chains are dead to them. |
||||
(pg_index.xmin will be the XID of the CREATE INDEX transaction. The reason |
||||
for using xmin rather than a normal column is that the regular vacuum |
||||
freezing mechanism will take care of converting xmin to FrozenTransactionId |
||||
before it can wrap around.) |
||||
|
||||
This means in particular that the transaction creating the index will be |
||||
unable to use the index. We alleviate that problem somewhat by not setting |
||||
indcheckxmin unless the table actually contains HOT chains with |
||||
RECENTLY_DEAD members. (In 8.4 we may be able to improve the situation, |
||||
at least for non-serializable transactions, because we expect to be able to |
||||
advance TransactionXmin intratransaction.) |
||||
|
||||
Another unpleasant consequence is that it is now risky to use SnapshotAny |
||||
in an index scan: if the index was created more recently than the last |
||||
vacuum, it's possible that some of the visited tuples do not match the |
||||
index entry they are linked to. This does not seem to be a fatal |
||||
objection, since there are few users of SnapshotAny and most use seqscans. |
||||
The only exception at this writing is CLUSTER, which is okay because it |
||||
does not require perfect ordering of the indexscan readout (and especially |
||||
so because CLUSTER tends to write recently-dead tuples out of order anyway). |
||||
|
||||
|
||||
CREATE INDEX CONCURRENTLY |
||||
------------------------- |
||||
|
||||
In the concurrent case we must take a different approach. We create the |
||||
pg_index entry immediately, before we scan the table. The pg_index entry |
||||
is marked as "not ready for inserts". Then we commit and wait for any |
||||
transactions which have the table open to finish. This ensures that no |
||||
new HOT updates will change the key value for our new index, because all |
||||
transactions will see the existence of the index and will respect its |
||||
constraint on which updates can be HOT. Other transactions must include |
||||
such an index when determining HOT-safety of updates, even though they |
||||
must ignore it for both insertion and searching purposes. |
||||
|
||||
We must do this to avoid making incorrect index entries. For example, |
||||
suppose we are building an index on column X and we make an index entry for |
||||
a non-HOT tuple with X=1. Then some other backend, unaware that X is an |
||||
indexed column, HOT-updates the row to have X=2, and commits. We now have |
||||
an index entry for X=1 pointing at a HOT chain whose live row has X=2. |
||||
We could make an index entry with X=2 during the validation pass, but |
||||
there is no nice way to get rid of the wrong entry with X=1. So we must |
||||
have the HOT-safety property enforced before we start to build the new |
||||
index. |
||||
|
||||
After waiting for transactions which had the table open, we build the index |
||||
for all rows that are valid in a fresh snapshot. Any tuples visible in the |
||||
snapshot will have only valid forward-growing HOT chains. (They might have |
||||
older HOT updates behind them which are broken, but this is OK for the same |
||||
reason it's OK in a regular index build.) As above, we point the index |
||||
entry at the root of the HOT-update chain but we use the key value from the |
||||
live tuple. |
||||
|
||||
We mark the index open for inserts (but still not ready for reads) then |
||||
we again wait for transactions which have the table open. Then we take |
||||
a second reference snapshot and validate the index. This searches for |
||||
tuples missing from the index, and inserts any missing ones. Again, |
||||
the index entries have to have TIDs equal to HOT-chain root TIDs, but |
||||
the value to be inserted is the one from the live tuple. |
||||
|
||||
Then we wait until every transaction that could have a snapshot older than |
||||
the second reference snapshot is finished. This ensures that nobody is |
||||
alive any longer who could need to see any tuples that might be missing |
||||
from the index, as well as ensuring that no one can see any inconsistent |
||||
rows in a broken HOT chain (the first condition is stronger than the |
||||
second). Finally, we can mark the index valid for searches. |
||||
|
||||
|
||||
Limitations and Restrictions |
||||
---------------------------- |
||||
|
||||
It is worth noting that HOT forever forecloses alternative approaches |
||||
to vacuuming, specifically the recompute-the-index-keys approach alluded |
||||
to in Technical Challenges above. It'll be tough to recompute the index |
||||
keys for a root line pointer you don't have data for anymore ... |
||||
|
||||
|
||||
Glossary |
||||
-------- |
||||
|
||||
Broken HOT Chain |
||||
|
||||
A HOT chain in which the key value for an index has changed. |
||||
|
||||
This is not allowed to occur normally but if a new index is created |
||||
it can happen. In that case various strategies are used to ensure |
||||
that no transaction for which the older tuples are visible can |
||||
use the index. |
||||
|
||||
Cold update |
||||
|
||||
A normal, non-HOT update, in which index entries are made for |
||||
the new version of the tuple. |
||||
|
||||
Dead line pointer |
||||
|
||||
A stub line pointer, that does not point to anything, but cannot |
||||
be removed or reused yet because there are index pointers to it. |
||||
Semantically same as a dead tuple. It has state LP_DEAD. |
||||
|
||||
Heap-only tuple |
||||
|
||||
A heap tuple with no index pointers, which can only be reached |
||||
from indexes indirectly through its ancestral root tuple. |
||||
Marked with HEAP_ONLY_TUPLE flag. |
||||
|
||||
HOT-safe |
||||
|
||||
A proposed tuple update is said to be HOT-safe if it changes |
||||
none of the tuple's indexed columns. It will only become an |
||||
actual HOT update if we can find room on the same page for |
||||
the new tuple version. |
||||
|
||||
HOT update |
||||
|
||||
An UPDATE where the new tuple becomes a heap-only tuple, and no |
||||
new index entries are made. |
||||
|
||||
HOT-updated tuple |
||||
|
||||
An updated tuple, for which the next tuple in the chain is a |
||||
heap-only tuple. Marked with HEAP_HOT_UPDATED flag. |
||||
|
||||
Indexed column |
||||
|
||||
A column used in an index definition. The column might not |
||||
actually be stored in the index --- it could be used in a |
||||
functional index's expression, or used in a partial index |
||||
predicate. HOT treats all these cases alike. |
||||
|
||||
Redirecting line pointer |
||||
|
||||
A line pointer that points to another line pointer and has no |
||||
associated tuple. It has the special lp_flags state LP_REDIRECT, |
||||
and lp_off is the OffsetNumber of the line pointer it links to. |
||||
This is used when a root tuple becomes dead but we cannot prune |
||||
the line pointer because there are non-dead heap-only tuples |
||||
further down the chain. |
||||
|
||||
Root tuple |
||||
|
||||
The first tuple in a HOT update chain; the one that indexes point to. |
||||
|
||||
Update chain |
||||
|
||||
A chain of updated tuples, in which each tuple's ctid points to |
||||
the next tuple in the chain. A HOT update chain is an update chain |
||||
(or portion of an update chain) that consists of a root tuple and |
||||
one or more heap-only tuples. A complete update chain can contain |
||||
both HOT and non-HOT (cold) updated tuples. |
@ -0,0 +1,702 @@ |
||||
/*-------------------------------------------------------------------------
|
||||
* |
||||
* pruneheap.c |
||||
* heap page pruning and HOT-chain management code |
||||
* |
||||
* Portions Copyright (c) 1996-2007, PostgreSQL Global Development Group |
||||
* Portions Copyright (c) 1994, Regents of the University of California |
||||
* |
||||
* |
||||
* IDENTIFICATION |
||||
* $PostgreSQL: pgsql/src/backend/access/heap/pruneheap.c,v 1.1 2007/09/20 17:56:30 tgl Exp $ |
||||
* |
||||
*------------------------------------------------------------------------- |
||||
*/ |
||||
#include "postgres.h" |
||||
|
||||
#include "access/heapam.h" |
||||
#include "access/transam.h" |
||||
#include "miscadmin.h" |
||||
#include "pgstat.h" |
||||
#include "utils/inval.h" |
||||
|
||||
|
||||
/* Local functions */ |
||||
static int heap_prune_chain(Relation relation, Buffer buffer, |
||||
OffsetNumber rootoffnum, |
||||
TransactionId OldestXmin, |
||||
OffsetNumber *redirected, int *nredirected, |
||||
OffsetNumber *nowdead, int *ndead, |
||||
OffsetNumber *nowunused, int *nunused, |
||||
bool redirect_move); |
||||
static void heap_prune_record_redirect(OffsetNumber *redirected, |
||||
int *nredirected, |
||||
OffsetNumber offnum, |
||||
OffsetNumber rdoffnum); |
||||
static void heap_prune_record_dead(OffsetNumber *nowdead, int *ndead, |
||||
OffsetNumber offnum); |
||||
static void heap_prune_record_unused(OffsetNumber *nowunused, int *nunused, |
||||
OffsetNumber offnum); |
||||
|
||||
|
||||
/*
|
||||
* Optionally prune and repair fragmentation in the specified page. |
||||
* |
||||
* This is an opportunistic function. It will perform housekeeping |
||||
* only if the page heuristically looks like a candidate for pruning and we |
||||
* can acquire buffer cleanup lock without blocking. |
||||
* |
||||
* Note: this is called quite often. It's important that it fall out quickly |
||||
* if there's not any use in pruning. |
||||
* |
||||
* Caller must have pin on the buffer, and must *not* have a lock on it. |
||||
* |
||||
* OldestXmin is the cutoff XID used to distinguish whether tuples are DEAD |
||||
* or RECENTLY_DEAD (see HeapTupleSatisfiesVacuum). |
||||
*/ |
||||
void |
||||
heap_page_prune_opt(Relation relation, Buffer buffer, TransactionId OldestXmin) |
||||
{ |
||||
PageHeader dp = (PageHeader) BufferGetPage(buffer); |
||||
Size minfree; |
||||
|
||||
/*
|
||||
* Let's see if we really need pruning. |
||||
* |
||||
* Forget it if page is not hinted to contain something prunable |
||||
*/ |
||||
if (!PageIsPrunable(dp)) |
||||
return; |
||||
|
||||
/*
|
||||
* We prune when a previous UPDATE failed to find enough space on the |
||||
* page for a new tuple version, or when free space falls below the |
||||
* relation's fill-factor target (but not less than 10%). |
||||
* |
||||
* Checking free space here is questionable since we aren't holding |
||||
* any lock on the buffer; in the worst case we could get a bogus |
||||
* answer. It's unlikely to be *seriously* wrong, though, since |
||||
* reading either pd_lower or pd_upper is probably atomic. Avoiding |
||||
* taking a lock seems better than sometimes getting a wrong answer |
||||
* in what is after all just a heuristic estimate. |
||||
*/ |
||||
minfree = RelationGetTargetPageFreeSpace(relation, |
||||
HEAP_DEFAULT_FILLFACTOR); |
||||
minfree = Max(minfree, BLCKSZ / 10); |
||||
|
||||
if (PageIsFull(dp) || PageGetHeapFreeSpace((Page) dp) < minfree) |
||||
{ |
||||
/* OK, try to get exclusive buffer lock */ |
||||
if (!ConditionalLockBufferForCleanup(buffer)) |
||||
return; |
||||
|
||||
/*
|
||||
* Now that we have buffer lock, get accurate information about the |
||||
* page's free space, and recheck the heuristic about whether to prune. |
||||
*/ |
||||
if (PageIsFull(dp) || PageGetHeapFreeSpace((Page) dp) < minfree) |
||||
{ |
||||
/* OK to prune (though not to remove redirects) */ |
||||
(void) heap_page_prune(relation, buffer, OldestXmin, false, true); |
||||
} |
||||
|
||||
/* And release buffer lock */ |
||||
LockBuffer(buffer, BUFFER_LOCK_UNLOCK); |
||||
} |
||||
} |
||||
|
||||
|
||||
/*
|
||||
* Prune and repair fragmentation in the specified page. |
||||
* |
||||
* Caller must have pin and buffer cleanup lock on the page. |
||||
* |
||||
* OldestXmin is the cutoff XID used to distinguish whether tuples are DEAD |
||||
* or RECENTLY_DEAD (see HeapTupleSatisfiesVacuum). |
||||
* |
||||
* If redirect_move is set, we remove redirecting line pointers by |
||||
* updating the root line pointer to point directly to the first non-dead |
||||
* tuple in the chain. NOTE: eliminating the redirect changes the first |
||||
* tuple's effective CTID, and is therefore unsafe except within VACUUM FULL. |
||||
* The only reason we support this capability at all is that by using it, |
||||
* VACUUM FULL need not cope with LP_REDIRECT items at all; which seems a |
||||
* good thing since VACUUM FULL is overly complicated already. |
||||
* |
||||
* If report_stats is true then we send the number of reclaimed heap-only |
||||
* tuples to pgstats. (This must be FALSE during vacuum, since vacuum will |
||||
* send its own new total to pgstats, and we don't want this delta applied |
||||
* on top of that.) |
||||
* |
||||
* Returns the number of tuples deleted from the page. |
||||
*/ |
||||
int |
||||
heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin, |
||||
bool redirect_move, bool report_stats) |
||||
{ |
||||
int ndeleted = 0; |
||||
Page page = BufferGetPage(buffer); |
||||
OffsetNumber offnum, |
||||
maxoff; |
||||
OffsetNumber redirected[MaxHeapTuplesPerPage * 2]; |
||||
OffsetNumber nowdead[MaxHeapTuplesPerPage]; |
||||
OffsetNumber nowunused[MaxHeapTuplesPerPage]; |
||||
int nredirected = 0; |
||||
int ndead = 0; |
||||
int nunused = 0; |
||||
|
||||
START_CRIT_SECTION(); |
||||
|
||||
/*
|
||||
* Mark the page as clear of prunable tuples. If we find a tuple which |
||||
* may soon become prunable, we shall set the hint again. Also clear |
||||
* the "page is full" flag, since there's no point in repeating the |
||||
* prune/defrag process until something else happens to the page. |
||||
*/ |
||||
PageClearPrunable(page); |
||||
PageClearFull(page); |
||||
|
||||
/* Scan the page */ |
||||
maxoff = PageGetMaxOffsetNumber(page); |
||||
for (offnum = FirstOffsetNumber; |
||||
offnum <= maxoff; |
||||
offnum = OffsetNumberNext(offnum)) |
||||
{ |
||||
ItemId itemid = PageGetItemId(page, offnum); |
||||
|
||||
/* Nothing to do if slot is empty or already dead */ |
||||
if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid)) |
||||
continue; |
||||
|
||||
/* Process this item or chain of items */ |
||||
ndeleted += heap_prune_chain(relation, buffer, offnum, |
||||
OldestXmin, |
||||
redirected, &nredirected, |
||||
nowdead, &ndead, |
||||
nowunused, &nunused, |
||||
redirect_move); |
||||
} |
||||
|
||||
/* Have we pruned any items? */ |
||||
if (nredirected > 0 || ndead > 0 || nunused > 0) |
||||
{ |
||||
/*
|
||||
* Repair page fragmentation, and update the page's hint bit about |
||||
* whether it has free line pointers. |
||||
*/ |
||||
PageRepairFragmentation((Page) page); |
||||
|
||||
MarkBufferDirty(buffer); |
||||
|
||||
/*
|
||||
* Emit a WAL HEAP_CLEAN or HEAP_CLEAN_MOVE record showing what we did |
||||
*/ |
||||
if (!relation->rd_istemp) |
||||
{ |
||||
XLogRecPtr recptr; |
||||
|
||||
recptr = log_heap_clean(relation, buffer, |
||||
redirected, nredirected, |
||||
nowdead, ndead, |
||||
nowunused, nunused, |
||||
redirect_move); |
||||
PageSetTLI(BufferGetPage(buffer), ThisTimeLineID); |
||||
PageSetLSN(BufferGetPage(buffer), recptr); |
||||
} |
||||
} |
||||
|
||||
END_CRIT_SECTION(); |
||||
|
||||
/*
|
||||
* If requested, report the number of tuples reclaimed to pgstats. |
||||
* This is ndeleted minus ndead, because we don't want to count a now-DEAD |
||||
* root item as a deletion for this purpose. |
||||
*/ |
||||
if (report_stats && ndeleted > ndead) |
||||
pgstat_update_heap_dead_tuples(relation, ndeleted - ndead); |
||||
|
||||
/*
|
||||
* XXX Should we update the FSM information of this page ? |
||||
* |
||||
* There are two schools of thought here. We may not want to update |
||||
* FSM information so that the page is not used for unrelated |
||||
* UPDATEs/INSERTs and any free space in this page will remain |
||||
* available for further UPDATEs in *this* page, thus improving |
||||
* chances for doing HOT updates. |
||||
* |
||||
* But for a large table and where a page does not receive further |
||||
* UPDATEs for a long time, we might waste this space by not |
||||
* updating the FSM information. The relation may get extended and |
||||
* fragmented further. |
||||
* |
||||
* One possibility is to leave "fillfactor" worth of space in this |
||||
* page and update FSM with the remaining space. |
||||
* |
||||
* In any case, the current FSM implementation doesn't accept |
||||
* one-page-at-a-time updates, so this is all academic for now. |
||||
*/ |
||||
|
||||
return ndeleted; |
||||
} |
||||
|
||||
|
||||
/*
|
||||
* Prune specified item pointer or a HOT chain originating at that item. |
||||
* |
||||
* If the item is an index-referenced tuple (i.e. not a heap-only tuple), |
||||
* the HOT chain is pruned by removing all DEAD tuples at the start of the HOT |
||||
* chain. We also prune any RECENTLY_DEAD tuples preceding a DEAD tuple. |
||||
* This is OK because a RECENTLY_DEAD tuple preceding a DEAD tuple is really |
||||
* DEAD, the OldestXmin test is just too coarse to detect it. |
||||
* |
||||
* The root line pointer is redirected to the tuple immediately after the |
||||
* latest DEAD tuple. If all tuples in the chain are DEAD, the root line |
||||
* pointer is marked LP_DEAD. (This includes the case of a DEAD simple |
||||
* tuple, which we treat as a chain of length 1.) |
||||
* |
||||
* OldestXmin is the cutoff XID used to identify dead tuples. |
||||
* |
||||
* Redirected items are added to the redirected[] array (two entries per |
||||
* redirection); items set to LP_DEAD state are added to nowdead[]; and |
||||
* items set to LP_UNUSED state are added to nowunused[]. (These arrays |
||||
* will be used to generate a WAL record after all chains are pruned.) |
||||
* |
||||
* If redirect_move is true, we get rid of redirecting line pointers. |
||||
* |
||||
* Returns the number of tuples deleted from the page. |
||||
*/ |
||||
static int |
||||
heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum, |
||||
TransactionId OldestXmin, |
||||
OffsetNumber *redirected, int *nredirected, |
||||
OffsetNumber *nowdead, int *ndead, |
||||
OffsetNumber *nowunused, int *nunused, |
||||
bool redirect_move) |
||||
{ |
||||
int ndeleted = 0; |
||||
Page dp = (Page) BufferGetPage(buffer); |
||||
TransactionId priorXmax = InvalidTransactionId; |
||||
ItemId rootlp; |
||||
HeapTupleHeader htup; |
||||
OffsetNumber latestdead = InvalidOffsetNumber, |
||||
maxoff = PageGetMaxOffsetNumber(dp), |
||||
offnum; |
||||
OffsetNumber chainitems[MaxHeapTuplesPerPage]; |
||||
int nchain = 0, |
||||
i; |
||||
|
||||
rootlp = PageGetItemId(dp, rootoffnum); |
||||
|
||||
/*
|
||||
* If it's a heap-only tuple, then it is not the start of a HOT chain. |
||||
*/ |
||||
if (ItemIdIsNormal(rootlp)) |
||||
{ |
||||
htup = (HeapTupleHeader) PageGetItem(dp, rootlp); |
||||
if (HeapTupleHeaderIsHeapOnly(htup)) |
||||
{ |
||||
/*
|
||||
* If the tuple is DEAD and doesn't chain to anything else, mark it |
||||
* unused immediately. (If it does chain, we can only remove it as |
||||
* part of pruning its chain.) |
||||
* |
||||
* We need this primarily to handle aborted HOT updates, that is, |
||||
* XMIN_INVALID heap-only tuples. Those might not be linked to |
||||
* by any chain, since the parent tuple might be re-updated before |
||||
* any pruning occurs. So we have to be able to reap them |
||||
* separately from chain-pruning. |
||||
* |
||||
* Note that we might first arrive at a dead heap-only tuple |
||||
* either here or while following a chain below. Whichever path |
||||
* gets there first will mark the tuple unused. |
||||
*/ |
||||
if (HeapTupleSatisfiesVacuum(htup, OldestXmin, buffer) |
||||
== HEAPTUPLE_DEAD && !HeapTupleHeaderIsHotUpdated(htup)) |
||||
{ |
||||
ItemIdSetUnused(rootlp); |
||||
heap_prune_record_unused(nowunused, nunused, rootoffnum); |
||||
ndeleted++; |
||||
} |
||||
|
||||
/* Nothing more to do */ |
||||
return ndeleted; |
||||
} |
||||
} |
||||
|
||||
/* Start from the root tuple */ |
||||
offnum = rootoffnum; |
||||
|
||||
/* while not end of the chain */ |
||||
for (;;) |
||||
{ |
||||
ItemId lp; |
||||
bool tupdead, |
||||
recent_dead; |
||||
|
||||
/* Some sanity checks */ |
||||
if (offnum < FirstOffsetNumber || offnum > maxoff) |
||||
break; |
||||
|
||||
lp = PageGetItemId(dp, offnum); |
||||
|
||||
if (!ItemIdIsUsed(lp)) |
||||
break; |
||||
|
||||
/*
|
||||
* If we are looking at the redirected root line pointer, |
||||
* jump to the first normal tuple in the chain. If we find |
||||
* a redirect somewhere else, stop --- it must not be same chain. |
||||
*/ |
||||
if (ItemIdIsRedirected(lp)) |
||||
{ |
||||
if (nchain > 0) |
||||
break; /* not at start of chain */ |
||||
chainitems[nchain++] = offnum; |
||||
offnum = ItemIdGetRedirect(rootlp); |
||||
continue; |
||||
} |
||||
|
||||
/*
|
||||
* Likewise, a dead item pointer can't be part of the chain. |
||||
* (We already eliminated the case of dead root tuple outside |
||||
* this function.) |
||||
*/ |
||||
if (ItemIdIsDead(lp)) |
||||
break; |
||||
|
||||
Assert(ItemIdIsNormal(lp)); |
||||
htup = (HeapTupleHeader) PageGetItem(dp, lp); |
||||
|
||||
/*
|
||||
* Check the tuple XMIN against prior XMAX, if any |
||||
*/ |
||||
if (TransactionIdIsValid(priorXmax) && |
||||
!TransactionIdEquals(HeapTupleHeaderGetXmin(htup), priorXmax)) |
||||
break; |
||||
|
||||
/*
|
||||
* OK, this tuple is indeed a member of the chain. |
||||
*/ |
||||
chainitems[nchain++] = offnum; |
||||
|
||||
/*
|
||||
* Check tuple's visibility status. |
||||
*/ |
||||
tupdead = recent_dead = false; |
||||
|
||||
switch (HeapTupleSatisfiesVacuum(htup, OldestXmin, buffer)) |
||||
{ |
||||
case HEAPTUPLE_DEAD: |
||||
tupdead = true; |
||||
break; |
||||
|
||||
case HEAPTUPLE_RECENTLY_DEAD: |
||||
recent_dead = true; |
||||
/*
|
||||
* This tuple may soon become DEAD. Re-set the hint bit so |
||||
* that the page is reconsidered for pruning in future. |
||||
*/ |
||||
PageSetPrunable(dp); |
||||
break; |
||||
|
||||
case HEAPTUPLE_DELETE_IN_PROGRESS: |
||||
/*
|
||||
* This tuple may soon become DEAD. Re-set the hint bit so |
||||
* that the page is reconsidered for pruning in future. |
||||
*/ |
||||
PageSetPrunable(dp); |
||||
break; |
||||
|
||||
case HEAPTUPLE_LIVE: |
||||
case HEAPTUPLE_INSERT_IN_PROGRESS: |
||||
/*
|
||||
* If we wanted to optimize for aborts, we might consider |
||||
* marking the page prunable when we see INSERT_IN_PROGRESS. |
||||
* But we don't. See related decisions about when to mark |
||||
* the page prunable in heapam.c. |
||||
*/ |
||||
break; |
||||
|
||||
default: |
||||
elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result"); |
||||
break; |
||||
} |
||||
|
||||
/*
|
||||
* Remember the last DEAD tuple seen. We will advance past |
||||
* RECENTLY_DEAD tuples just in case there's a DEAD one after them; |
||||
* but we can't advance past anything else. (XXX is it really worth |
||||
* continuing to scan beyond RECENTLY_DEAD? The case where we will |
||||
* find another DEAD tuple is a fairly unusual corner case.) |
||||
*/ |
||||
if (tupdead) |
||||
latestdead = offnum; |
||||
else if (!recent_dead) |
||||
break; |
||||
|
||||
/*
|
||||
* If the tuple is not HOT-updated, then we are at the end of this |
||||
* HOT-update chain. |
||||
*/ |
||||
if (!HeapTupleHeaderIsHotUpdated(htup)) |
||||
break; |
||||
|
||||
/*
|
||||
* Advance to next chain member. |
||||
*/ |
||||
Assert(ItemPointerGetBlockNumber(&htup->t_ctid) == |
||||
BufferGetBlockNumber(buffer)); |
||||
offnum = ItemPointerGetOffsetNumber(&htup->t_ctid); |
||||
priorXmax = HeapTupleHeaderGetXmax(htup); |
||||
} |
||||
|
||||
/*
|
||||
* If we found a DEAD tuple in the chain, adjust the HOT chain so that all |
||||
* the DEAD tuples at the start of the chain are removed and the root line |
||||
* pointer is appropriately redirected. |
||||
*/ |
||||
if (OffsetNumberIsValid(latestdead)) |
||||
{ |
||||
/*
|
||||
* Mark as unused each intermediate item that we are able to remove |
||||
* from the chain. |
||||
* |
||||
* When the previous item is the last dead tuple seen, we are at |
||||
* the right candidate for redirection. |
||||
*/ |
||||
for (i = 1; (i < nchain) && (chainitems[i - 1] != latestdead); i++) |
||||
{ |
||||
ItemId lp = PageGetItemId(dp, chainitems[i]); |
||||
|
||||
ItemIdSetUnused(lp); |
||||
heap_prune_record_unused(nowunused, nunused, chainitems[i]); |
||||
ndeleted++; |
||||
} |
||||
|
||||
/*
|
||||
* If the root entry had been a normal tuple, we are deleting it, |
||||
* so count it in the result. But changing a redirect (even to |
||||
* DEAD state) doesn't count. |
||||
*/ |
||||
if (ItemIdIsNormal(rootlp)) |
||||
ndeleted++; |
||||
|
||||
/*
|
||||
* If the DEAD tuple is at the end of the chain, the entire chain is |
||||
* dead and the root line pointer can be marked dead. Otherwise |
||||
* just redirect the root to the correct chain member. |
||||
*/ |
||||
if (i >= nchain) |
||||
{ |
||||
ItemIdSetDead(rootlp); |
||||
heap_prune_record_dead(nowdead, ndead, rootoffnum); |
||||
} |
||||
else |
||||
{ |
||||
ItemIdSetRedirect(rootlp, chainitems[i]); |
||||
heap_prune_record_redirect(redirected, nredirected, |
||||
rootoffnum, |
||||
chainitems[i]); |
||||
} |
||||
} |
||||
else if (nchain < 2 && ItemIdIsRedirected(rootlp)) |
||||
{ |
||||
/*
|
||||
* We found a redirect item that doesn't point to a valid follow-on |
||||
* item. This can happen if the loop in heap_page_prune caused us |
||||
* to visit the dead successor of a redirect item before visiting |
||||
* the redirect item. We can clean up by setting the redirect item |
||||
* to DEAD state. |
||||
*/ |
||||
ItemIdSetDead(rootlp); |
||||
heap_prune_record_dead(nowdead, ndead, rootoffnum); |
||||
} |
||||
|
||||
/*
|
||||
* If requested, eliminate LP_REDIRECT items by moving tuples. Note that |
||||
* if the root item is LP_REDIRECT and doesn't point to a valid follow-on |
||||
* item, we already killed it above. |
||||
*/ |
||||
if (redirect_move && ItemIdIsRedirected(rootlp)) |
||||
{ |
||||
OffsetNumber firstoffnum = ItemIdGetRedirect(rootlp); |
||||
ItemId firstlp = PageGetItemId(dp, firstoffnum); |
||||
HeapTupleData firsttup; |
||||
|
||||
Assert(ItemIdIsNormal(firstlp)); |
||||
/* Set up firsttup to reference the tuple at its existing CTID */ |
||||
firsttup.t_data = (HeapTupleHeader) PageGetItem(dp, firstlp); |
||||
firsttup.t_len = ItemIdGetLength(firstlp); |
||||
ItemPointerSet(&firsttup.t_self, |
||||
BufferGetBlockNumber(buffer), |
||||
firstoffnum); |
||||
firsttup.t_tableOid = RelationGetRelid(relation); |
||||
|
||||
/*
|
||||
* Mark the tuple for invalidation. Needed because we're changing |
||||
* its CTID. |
||||
*/ |
||||
CacheInvalidateHeapTuple(relation, &firsttup); |
||||
|
||||
/*
|
||||
* Change heap-only status of the tuple because after the line |
||||
* pointer manipulation, it's no longer a heap-only tuple, but is |
||||
* directly pointed to by index entries. |
||||
*/ |
||||
Assert(HeapTupleIsHeapOnly(&firsttup)); |
||||
HeapTupleClearHeapOnly(&firsttup); |
||||
|
||||
/* Now move the item pointer */ |
||||
*rootlp = *firstlp; |
||||
ItemIdSetUnused(firstlp); |
||||
|
||||
/*
|
||||
* If latestdead is valid, we have already recorded the redirection |
||||
* above. Otherwise, do it now. |
||||
* |
||||
* We don't record firstlp in the nowunused[] array, since the |
||||
* redirection entry is enough to tell heap_xlog_clean what to do. |
||||
*/ |
||||
if (!OffsetNumberIsValid(latestdead)) |
||||
heap_prune_record_redirect(redirected, nredirected, rootoffnum, |
||||
firstoffnum); |
||||
} |
||||
|
||||
return ndeleted; |
||||
} |
||||
|
||||
|
||||
/* Record newly-redirected item pointer */ |
||||
static void |
||||
heap_prune_record_redirect(OffsetNumber *redirected, int *nredirected, |
||||
OffsetNumber offnum, OffsetNumber rdoffnum) |
||||
{ |
||||
Assert(*nredirected < MaxHeapTuplesPerPage); |
||||
redirected[*nredirected * 2] = offnum; |
||||
redirected[*nredirected * 2 + 1] = rdoffnum; |
||||
(*nredirected)++; |
||||
} |
||||
|
||||
/* Record newly-dead item pointer */ |
||||
static void |
||||
heap_prune_record_dead(OffsetNumber *nowdead, int *ndead, |
||||
OffsetNumber offnum) |
||||
{ |
||||
Assert(*ndead < MaxHeapTuplesPerPage); |
||||
nowdead[*ndead] = offnum; |
||||
(*ndead)++; |
||||
} |
||||
|
||||
/* Record newly-unused item pointer */ |
||||
static void |
||||
heap_prune_record_unused(OffsetNumber *nowunused, int *nunused, |
||||
OffsetNumber offnum) |
||||
{ |
||||
Assert(*nunused < MaxHeapTuplesPerPage); |
||||
nowunused[*nunused] = offnum; |
||||
(*nunused)++; |
||||
} |
||||
|
||||
|
||||
/*
|
||||
* For all items in this page, find their respective root line pointers. |
||||
* If item k is part of a HOT-chain with root at item j, then we set |
||||
* root_offsets[k - 1] = j. |
||||
* |
||||
* The passed-in root_offsets array must have MaxHeapTuplesPerPage entries. |
||||
* We zero out all unused entries. |
||||
* |
||||
* The function must be called with at least share lock on the buffer, to |
||||
* prevent concurrent prune operations. |
||||
* |
||||
* Note: The information collected here is valid only as long as the caller |
||||
* holds a pin on the buffer. Once pin is released, a tuple might be pruned |
||||
* and reused by a completely unrelated tuple. |
||||
*/ |
||||
void |
||||
heap_get_root_tuples(Page page, OffsetNumber *root_offsets) |
||||
{ |
||||
OffsetNumber offnum, maxoff; |
||||
|
||||
MemSet(root_offsets, 0, MaxHeapTuplesPerPage * sizeof(OffsetNumber)); |
||||
|
||||
maxoff = PageGetMaxOffsetNumber(page); |
||||
for (offnum = FirstOffsetNumber; offnum <= maxoff; offnum++) |
||||
{ |
||||
ItemId lp = PageGetItemId(page, offnum); |
||||
HeapTupleHeader htup; |
||||
OffsetNumber nextoffnum; |
||||
TransactionId priorXmax; |
||||
|
||||
/* skip unused and dead items */ |
||||
if (!ItemIdIsUsed(lp) || ItemIdIsDead(lp)) |
||||
continue; |
||||
|
||||
if (ItemIdIsNormal(lp)) |
||||
{ |
||||
htup = (HeapTupleHeader) PageGetItem(page, lp); |
||||
|
||||
/*
|
||||
* Check if this tuple is part of a HOT-chain rooted at some other |
||||
* tuple. If so, skip it for now; we'll process it when we find |
||||
* its root. |
||||
*/ |
||||
if (HeapTupleHeaderIsHeapOnly(htup)) |
||||
continue; |
||||
|
||||
/*
|
||||
* This is either a plain tuple or the root of a HOT-chain. |
||||
* Remember it in the mapping. |
||||
*/ |
||||
root_offsets[offnum - 1] = offnum; |
||||
|
||||
/* If it's not the start of a HOT-chain, we're done with it */ |
||||
if (!HeapTupleHeaderIsHotUpdated(htup)) |
||||
continue; |
||||
|
||||
/* Set up to scan the HOT-chain */ |
||||
nextoffnum = ItemPointerGetOffsetNumber(&htup->t_ctid); |
||||
priorXmax = HeapTupleHeaderGetXmax(htup); |
||||
} |
||||
else |
||||
{ |
||||
/* Must be a redirect item. We do not set its root_offsets entry */ |
||||
Assert(ItemIdIsRedirected(lp)); |
||||
/* Set up to scan the HOT-chain */ |
||||
nextoffnum = ItemIdGetRedirect(lp); |
||||
priorXmax = InvalidTransactionId; |
||||
} |
||||
|
||||
/*
|
||||
* Now follow the HOT-chain and collect other tuples in the chain. |
||||
* |
||||
* Note: Even though this is a nested loop, the complexity of the |
||||
* function is O(N) because a tuple in the page should be visited not |
||||
* more than twice, once in the outer loop and once in HOT-chain |
||||
* chases. |
||||
*/ |
||||
for (;;) |
||||
{ |
||||
lp = PageGetItemId(page, nextoffnum); |
||||
|
||||
/* Check for broken chains */ |
||||
if (!ItemIdIsNormal(lp)) |
||||
break; |
||||
|
||||
htup = (HeapTupleHeader) PageGetItem(page, lp); |
||||
|
||||
if (TransactionIdIsValid(priorXmax) && |
||||
!TransactionIdEquals(priorXmax, HeapTupleHeaderGetXmin(htup))) |
||||
break; |
||||
|
||||
/* Remember the root line pointer for this item */ |
||||
root_offsets[nextoffnum - 1] = offnum; |
||||
|
||||
/* Advance to next chain member, if any */ |
||||
if (!HeapTupleHeaderIsHotUpdated(htup)) |
||||
break; |
||||
|
||||
nextoffnum = ItemPointerGetOffsetNumber(&htup->t_ctid); |
||||
priorXmax = HeapTupleHeaderGetXmax(htup); |
||||
} |
||||
} |
||||
} |
Loading…
Reference in new issue