|
|
|
|
src/backend/access/gist/README
|
|
|
|
|
|
|
|
|
|
GiST Indexing
|
|
|
|
|
=============
|
|
|
|
|
|
|
|
|
|
This directory contains an implementation of GiST indexing for Postgres.
|
|
|
|
|
|
|
|
|
|
GiST stands for Generalized Search Tree. It was introduced in the seminal paper
|
|
|
|
|
"Generalized Search Trees for Database Systems", 1995, Joseph M. Hellerstein,
|
|
|
|
|
Jeffrey F. Naughton, Avi Pfeffer:
|
|
|
|
|
|
|
|
|
|
http://www.sai.msu.su/~megera/postgres/gist/papers/gist.ps
|
|
|
|
|
|
|
|
|
|
and implemented by J. Hellerstein and P. Aoki in an early version of
|
|
|
|
|
PostgreSQL (more details are available from The GiST Indexing Project
|
|
|
|
|
at Berkeley at http://gist.cs.berkeley.edu/). As a "university"
|
|
|
|
|
project it had a limited number of features and was in rare use.
|
|
|
|
|
|
|
|
|
|
The current implementation of GiST supports:
|
|
|
|
|
|
|
|
|
|
* Variable length keys
|
|
|
|
|
* Composite keys (multi-key)
|
|
|
|
|
* Ordered search (nearest-neighbor search)
|
|
|
|
|
* provides NULL-safe interface to GiST core
|
|
|
|
|
* Concurrency
|
|
|
|
|
* Recovery support via WAL logging
|
|
|
|
|
|
|
|
|
|
The support for concurrency implemented in PostgreSQL was developed based on
|
|
|
|
|
the paper "Access Methods for Next-Generation Database Systems" by
|
|
|
|
|
Marcel Kornaker:
|
|
|
|
|
|
|
|
|
|
http://www.sai.msu.su/~megera/postgres/gist/papers/concurrency/access-methods-for-next-generation.pdf.gz
|
|
|
|
|
|
|
|
|
|
The original algorithms were modified in several ways:
|
|
|
|
|
|
|
|
|
|
* They had to be adapted to PostgreSQL conventions. For example, the SEARCH
|
|
|
|
|
algorithm was considerably changed, because in PostgreSQL the search function
|
|
|
|
|
should return one tuple (next), not all tuples at once. Also, it should
|
|
|
|
|
release page locks between calls.
|
|
|
|
|
* Since we added support for variable length keys, it's not possible to
|
|
|
|
|
guarantee enough free space for all keys on pages after splitting. User
|
|
|
|
|
defined function picksplit doesn't have information about size of tuples
|
|
|
|
|
(each tuple may contain several keys as in multicolumn index while picksplit
|
|
|
|
|
could work with only one key) and pages.
|
|
|
|
|
* We modified original INSERT algorithm for performance reasons. In particular,
|
|
|
|
|
it is now a single-pass algorithm.
|
|
|
|
|
* Since the papers were theoretical, some details were omitted and we
|
|
|
|
|
had to find out ourself how to solve some specific problems.
|
|
|
|
|
|
|
|
|
|
Because of the above reasons, we have revised the interaction of GiST
|
|
|
|
|
core and PostgreSQL WAL system. Moreover, we encountered (and solved)
|
|
|
|
|
a problem of uncompleted insertions when recovering after crash, which
|
|
|
|
|
was not touched in the paper.
|
|
|
|
|
|
|
|
|
|
Search Algorithm
|
|
|
|
|
----------------
|
|
|
|
|
|
|
|
|
|
The search code maintains a queue of unvisited items, where an "item" is
|
|
|
|
|
either a heap tuple known to satisfy the search conditions, or an index
|
|
|
|
|
page that is consistent with the search conditions according to inspection
|
|
|
|
|
of its parent page's downlink item. Initially the root page is searched
|
|
|
|
|
to find unvisited items in it. Then we pull items from the queue. A
|
|
|
|
|
heap tuple pointer is just returned immediately; an index page entry
|
|
|
|
|
causes that page to be searched, generating more queue entries.
|
|
|
|
|
|
|
|
|
|
The queue is kept ordered with heap tuple items at the front, then
|
|
|
|
|
index page entries, with any newly-added index page entry inserted
|
|
|
|
|
before existing index page entries. This ensures depth-first traversal
|
|
|
|
|
of the index, and in particular causes the first few heap tuples to be
|
|
|
|
|
returned as soon as possible. That is helpful in case there is a LIMIT
|
|
|
|
|
that requires only a few tuples to be produced.
|
|
|
|
|
|
|
|
|
|
To implement nearest-neighbor search, the queue entries are augmented
|
|
|
|
|
with distance data: heap tuple entries are labeled with exact distance
|
|
|
|
|
from the search argument, while index-page entries must be labeled with
|
|
|
|
|
the minimum distance that any of their children could have. Then,
|
|
|
|
|
queue entries are retrieved in smallest-distance-first order, with
|
|
|
|
|
entries having identical distances managed as stated in the previous
|
|
|
|
|
paragraph.
|
|
|
|
|
|
|
|
|
|
The search algorithm keeps an index page locked only long enough to scan
|
|
|
|
|
its entries and queue those that satisfy the search conditions. Since
|
|
|
|
|
insertions can occur concurrently with searches, it is possible for an
|
|
|
|
|
index child page to be split between the time we make a queue entry for it
|
|
|
|
|
(while visiting its parent page) and the time we actually reach and scan
|
|
|
|
|
the child page. To avoid missing the entries that were moved to the right
|
|
|
|
|
sibling, we detect whether a split has occurred by comparing the child
|
|
|
|
|
page's NSN to the LSN that the parent had when visited. If it did, the
|
|
|
|
|
sibling page is immediately added to the front of the queue, ensuring that
|
|
|
|
|
its items will be scanned in the same order as if they were still on the
|
|
|
|
|
original child page.
|
|
|
|
|
|
|
|
|
|
As is usual in Postgres, the search algorithm only guarantees to find index
|
|
|
|
|
entries that existed before the scan started; index entries added during
|
|
|
|
|
the scan might or might not be visited. This is okay as long as all
|
|
|
|
|
searches use MVCC snapshot rules to reject heap tuples newer than the time
|
|
|
|
|
of scan start. In particular, this means that we need not worry about
|
|
|
|
|
cases where a parent page's downlink key is "enlarged" after we look at it.
|
|
|
|
|
Any such enlargement would be to add child items that we aren't interested
|
|
|
|
|
in returning anyway.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Insert Algorithm
|
|
|
|
|
----------------
|
|
|
|
|
|
|
|
|
|
INSERT guarantees that the GiST tree remains balanced. User defined key method
|
|
|
|
|
Penalty is used for choosing a subtree to insert; method PickSplit is used for
|
|
|
|
|
the node splitting algorithm; method Union is used for propagating changes
|
|
|
|
|
upward to maintain the tree properties.
|
|
|
|
|
|
|
|
|
|
NOTICE: We modified original INSERT algorithm for performance reason. In
|
|
|
|
|
particularly, it is now a single-pass algorithm.
|
|
|
|
|
|
|
|
|
|
Function findLeaf is used to identify subtree for insertion. Page, in which
|
|
|
|
|
insertion is proceeded, is locked as well as its parent page. Functions
|
|
|
|
|
findParent and findPath are used to find parent pages, which could be changed
|
|
|
|
|
because of concurrent access. Function pageSplit is recurrent and could split
|
|
|
|
|
page by more than 2 pages, which could be necessary if keys have different
|
|
|
|
|
lengths or more than one key are inserted (in such situation, user defined
|
|
|
|
|
function pickSplit cannot guarantee free space on page).
|
|
|
|
|
|
|
|
|
|
findLeaf(new-key)
|
|
|
|
|
push(stack, [root, 0]) //page, LSN
|
|
|
|
|
while(true)
|
|
|
|
|
ptr = top of stack
|
|
|
|
|
latch( ptr->page, S-mode )
|
|
|
|
|
ptr->lsn = ptr->page->lsn
|
|
|
|
|
if ( exists ptr->parent AND ptr->parent->lsn < ptr->page->nsn )
|
|
|
|
|
unlatch( ptr->page )
|
|
|
|
|
pop stack
|
|
|
|
|
else if ( ptr->page is not leaf )
|
|
|
|
|
push( stack, [get_best_child(ptr->page, new-key), 0] )
|
|
|
|
|
unlatch( ptr->page )
|
|
|
|
|
else
|
|
|
|
|
unlatch( ptr->page )
|
|
|
|
|
latch( ptr->page, X-mode )
|
|
|
|
|
if ( ptr->page is not leaf )
|
|
|
|
|
//the only root page can become a non-leaf
|
|
|
|
|
unlatch( ptr->page )
|
|
|
|
|
else if ( ptr->parent->lsn < ptr->page->nsn )
|
|
|
|
|
unlatch( ptr->page )
|
|
|
|
|
pop stack
|
|
|
|
|
else
|
|
|
|
|
return stack
|
|
|
|
|
end
|
|
|
|
|
end
|
|
|
|
|
end
|
|
|
|
|
|
|
|
|
|
findPath( stack item )
|
|
|
|
|
push stack, [root, 0, 0] // page, LSN, parent
|
|
|
|
|
while( stack )
|
|
|
|
|
ptr = top of stack
|
|
|
|
|
latch( ptr->page, S-mode )
|
|
|
|
|
if ( ptr->parent->page->lsn < ptr->page->nsn )
|
|
|
|
|
push stack, [ ptr->page->rightlink, 0, ptr->parent ]
|
|
|
|
|
end
|
|
|
|
|
for( each tuple on page )
|
|
|
|
|
if ( tuple->pagepointer == item->page )
|
|
|
|
|
return stack
|
|
|
|
|
else
|
|
|
|
|
add to stack at the end [tuple->pagepointer,0, ptr]
|
|
|
|
|
end
|
|
|
|
|
end
|
|
|
|
|
unlatch( ptr->page )
|
|
|
|
|
pop stack
|
|
|
|
|
end
|
|
|
|
|
|
|
|
|
|
findParent( stack item )
|
|
|
|
|
parent = item->parent
|
|
|
|
|
latch( parent->page, X-mode )
|
|
|
|
|
if ( parent->page->lsn != parent->lsn )
|
|
|
|
|
while(true)
|
|
|
|
|
search parent tuple on parent->page, if found the return
|
|
|
|
|
rightlink = parent->page->rightlink
|
|
|
|
|
unlatch( parent->page )
|
|
|
|
|
if ( rightlink is incorrect )
|
|
|
|
|
break loop
|
|
|
|
|
end
|
|
|
|
|
parent->page = rightlink
|
|
|
|
|
latch( parent->page, X-mode )
|
|
|
|
|
end
|
|
|
|
|
newstack = findPath( item->parent )
|
|
|
|
|
replace part of stack to new one
|
|
|
|
|
return findParent( item )
|
|
|
|
|
end
|
|
|
|
|
|
|
|
|
|
pageSplit(page, allkeys)
|
|
|
|
|
(lkeys, rkeys) = pickSplit( allkeys )
|
|
|
|
|
if ( page is root )
|
|
|
|
|
lpage = new page
|
|
|
|
|
else
|
|
|
|
|
lpage = page
|
|
|
|
|
rpage = new page
|
|
|
|
|
if ( no space left on rpage )
|
|
|
|
|
newkeys = pageSplit( rpage, rkeys )
|
|
|
|
|
else
|
|
|
|
|
push newkeys, union(rkeys)
|
|
|
|
|
end
|
|
|
|
|
if ( no space left on lpage )
|
|
|
|
|
push newkeys, pageSplit( lpage, lkeys )
|
|
|
|
|
else
|
|
|
|
|
push newkeys, union(lkeys)
|
|
|
|
|
end
|
|
|
|
|
return newkeys
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
placetopage(page, keysarray)
|
|
|
|
|
if ( no space left on page )
|
|
|
|
|
keysarray = pageSplit(page, [ extract_keys(page), keysarray])
|
|
|
|
|
last page in chain gets old NSN,
|
|
|
|
|
original and others - new NSN equals to LSN
|
|
|
|
|
if ( page is root )
|
|
|
|
|
make new root with keysarray
|
|
|
|
|
end
|
|
|
|
|
else
|
|
|
|
|
put keysarray on page
|
|
|
|
|
if ( length of keysarray > 1 )
|
|
|
|
|
keysarray = [ union(keysarray) ]
|
|
|
|
|
end
|
|
|
|
|
end
|
|
|
|
|
|
|
|
|
|
insert(new-key)
|
|
|
|
|
stack = findLeaf(new-key)
|
|
|
|
|
keysarray = [new-key]
|
|
|
|
|
ptr = top of stack
|
|
|
|
|
while(true)
|
|
|
|
|
findParent( ptr ) //findParent latches parent page
|
|
|
|
|
keysarray = placetopage(ptr->page, keysarray)
|
|
|
|
|
unlatch( ptr->page )
|
|
|
|
|
pop stack;
|
|
|
|
|
ptr = top of stack
|
|
|
|
|
if (length of keysarray == 1)
|
|
|
|
|
newboundingkey = union(oldboundingkey, keysarray)
|
|
|
|
|
if (newboundingkey == oldboundingkey)
|
|
|
|
|
unlatch ptr->page
|
|
|
|
|
break loop
|
|
|
|
|
end
|
|
|
|
|
end
|
|
|
|
|
end
|
|
|
|
|
|
|
|
|
|
Authors:
|
|
|
|
|
Teodor Sigaev <teodor@sigaev.ru>
|
|
|
|
|
Oleg Bartunov <oleg@sai.msu.su>
|