Major overhaul of btree index code. Eliminate special BTP_CHAIN logic for

duplicate keys by letting search go to the left rather than right when an equal key is seen at an upper tree level. Fix poor choice of page split point (leading to insertion failures) that was forced by chaining logic. Don't store leftmost key in non-leaf pages, since it's not necessary. Don't create root page until something is first stored in the index, so an unused index is now 8K not 16K. (Doesn't seem to be as easy to get rid of the metadata page, unfortunately.) Massive cleanup of unreadable code, fix poor, obsolete, and just plain wrong documentation and comments. See src/backend/access/nbtree/README for the gory details.
26 years ago · 9e85183bfc
parent c9537ca88f
commit 9e85183bfc
11 changed files with 1613 additions and 2843 deletions
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@ -1,68 +1,175 @@
-$Header: /cvsroot/pgsql/src/backend/access/nbtree/README,v 1.1.1.1 1996/07/09 06:21:12 scrappy Exp $
+$Header: /cvsroot/pgsql/src/backend/access/nbtree/README,v 1.2 2000/07/21 06:42:32 tgl Exp $

 This directory contains a correct implementation of Lehman and Yao's
-btree management algorithm that supports concurrent access for Postgres.
+high-concurrency B-tree management algorithm (P. Lehman and S. Yao,
+Efficient Locking for Concurrent Operations on B-Trees, ACM Transactions
+on Database Systems, Vol 6, No. 4, December 1981, pp 650-670).
+
 We have made the following changes in order to incorporate their algorithm
 into Postgres:

-	+  The requirement that all btree keys be unique is too onerous,
-	   but the algorithm won't work correctly without it.  As a result,
-	   this implementation adds an OID (guaranteed to be unique) to
-	   every key in the index.  This guarantees uniqueness within a set
-	   of duplicates.  Space overhead is four bytes.
-
-	   For this reason, when we're passed an index tuple to store by the
-	   common access method code, we allocate a larger one and copy the
-	   supplied tuple into it.  No Postgres code outside of the btree
-	   access method knows about this xid or sequence number.
-
-	+  Lehman and Yao don't require read locks, but assume that in-
-	   memory copies of tree nodes are unshared.  Postgres shares
-	   in-memory buffers among backends.  As a result, we do page-
-	   level read locking on btree nodes in order to guarantee that
-	   no record is modified while we are examining it.  This reduces
-	   concurrency but guaranteees correct behavior.
-
-	+  Read locks on a page are held for as long as a scan has a pointer
-	   to the page.  However, locks are always surrendered before the
-	   sibling page lock is acquired (for readers), so we remain deadlock-
-	   free.  I will do a formal proof if I get bored anytime soon.
+  The requirement that all btree keys be unique is too onerous,
+   but the algorithm won't work correctly without it.  Fortunately, it is
+   only necessary that keys be unique on a single tree level, because L&Y
+   only use the assumption of key uniqueness when re-finding a key in a
+   parent node (to determine where to insert the key for a split page).
+   Therefore, we can use the link field to disambiguate multiple
+   occurrences of the same user key: only one entry in the parent level
+   will be pointing at the page we had split.  (Indeed we need not look at
+   the real "key" at all, just at the link field.)  We can distinguish
+   items at the leaf level in the same way, by examining their links to
+   heap tuples; we'd never have two items for the same heap tuple.
+
+  Lehman and Yao assume that the key range for a subtree S is described
+   by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
+   node.  This does not work for nonunique keys (for example, if we have
+   enough equal keys to spread across several leaf pages, there *must* be
+   some equal bounding keys in the first level up).  Therefore we assume
+   Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
+   bounding key in an upper tree level must descend to the left of that
+   key to ensure it finds any equal keys in the preceding page.  An
+   insertion that sees the high key of its target page is equal to the key
+   to be inserted has a choice whether or not to move right, since the new
+   key could go on either page.  (Currently, we try to find a page where
+   there is room for the new key without a split.)
+
+  Lehman and Yao don't require read locks, but assume that in-memory
+   copies of tree nodes are unshared.  Postgres shares in-memory buffers
+   among backends.  As a result, we do page-level read locking on btree
+   nodes in order to guarantee that no record is modified while we are
+   examining it.  This reduces concurrency but guaranteees correct
+   behavior.  An advantage is that when trading in a read lock for a
+   write lock, we need not re-read the page after getting the write lock.
+   Since we're also holding a pin on the shared buffer containing the
+   page, we know that buffer still contains the page and is up-to-date.
+
+  We support the notion of an ordered "scan" of an index as well as
+   insertions, deletions, and simple lookups.  A scan in the forward
+   direction is no problem, we just use the right-sibling pointers that
+   L&Y require anyway.  (Thus, once we have descended the tree to the
+   correct start point for the scan, the scan looks only at leaf pages
+   and never at higher tree levels.)  To support scans in the backward
+   direction, we also store a "left sibling" link much like the "right
+   sibling".  (This adds an extra step to the L&Y split algorithm: while
+   holding the write lock on the page being split, we also lock its former
+   right sibling to update that page's left-link.  This is safe since no
+   writer of that page can be interested in acquiring a write lock on our
+   page.)  A backwards scan has one additional bit of complexity: after
+   following the left-link we must account for the possibility that the
+   left sibling page got split before we could read it.  So, we have to
+   move right until we find a page whose right-link matches the page we
+   came from.
+
+  Read locks on a page are held for as long as a scan has a pointer
+   to the page.  However, locks are always surrendered before the
+   sibling page lock is acquired (for readers), so we remain deadlock-
+   free.  I will do a formal proof if I get bored anytime soon.
+   NOTE: nbtree.c arranges to drop the read lock, but not the buffer pin,
+   on the current page of a scan before control leaves nbtree.  When we
+   come back to resume the scan, we have to re-grab the read lock and
+   then move right if the current item moved (see _bt_restscan()).
+
+  Lehman and Yao fail to discuss what must happen when the root page
+   becomes full and must be split.  Our implementation is to split the
+   root in the same way that any other page would be split, then construct
+   a new root page holding pointers to both of the resulting pages (which
+   now become siblings on level 2 of the tree).  The new root page is then
+   installed by altering the root pointer in the meta-data page (see
+   below).  This works because the root is not treated specially in any
+   other way --- in particular, searches will move right using its link
+   pointer if the link is set.  Therefore, searches will find the data
+   that's been moved into the right sibling even if they read the metadata
+   page before it got updated.  This is the same reasoning that makes a
+   split of a non-root page safe.  The locking considerations are similar too.
+
+  Lehman and Yao assume fixed-size keys, but we must deal with
+   variable-size keys.  Therefore there is not a fixed maximum number of
+   keys per page; we just stuff in as many as will fit.  When we split a
+   page, we try to equalize the number of bytes, not items, assigned to
+   each of the resulting pages.  Note we must include the incoming item in
+   this calculation, otherwise it is possible to find that the incoming
+   item doesn't fit on the split page where it needs to go!

 In addition, the following things are handy to know:

-	+  Page zero of every btree is a meta-data page.  This page stores
-	   the location of the root page, a pointer to a list of free
-	   pages, and other stuff that's handy to know.
-
-	+  This algorithm doesn't really work, since it requires ordered
-	   writes, and UNIX doesn't support ordered writes.
-
-	+  There's one other case where we may screw up in this
-	   implementation.  When we start a scan, we descend the tree
-	   to the key nearest the one in the qual, and once we get there,
-	   position ourselves correctly for the qual type (eg, <, >=, etc).
-	   If we happen to step off a page, decide we want to get back to
-	   it, and fetch the page again, and if some bad person has split
-	   the page and moved the last tuple we saw off of it, then the
-	   code complains about botched concurrency in an elog(WARN, ...)
-	   and gives up the ghost.  This is the ONLY violation of Lehman
-	   and Yao's guarantee of correct behavior that I am aware of in
-	   this code.
+  Page zero of every btree is a meta-data page.  This page stores
+   the location of the root page, a pointer to a list of free
+   pages, and other stuff that's handy to know.  (Currently, we
+   never shrink btree indexes so there are never any free pages.)
+
+  The algorithm assumes we can fit at least three items per page
+   (a "high key" and two real data items).  Therefore it's unsafe
+   to accept items larger than 1/3rd page size.  Larger items would
+   work sometimes, but could cause failures later on depending on
+   what else gets put on their page.
+
+  This algorithm doesn't guarantee btree consistency after a kernel crash
+   or hardware failure.  To do that, we'd need ordered writes, and UNIX
+   doesn't support ordered writes (short of fsync'ing every update, which
+   is too high a price).  Rebuilding corrupted indexes during restart
+   seems more attractive.
+
+  On deletions, we need to adjust the position of active scans on
+   the index.  The code in nbtscan.c handles this.  We don't need to
+   do this for insertions or splits because _bt_restscan can find the
+   new position of the previously-found item.  NOTE that nbtscan.c
+   only copes with deletions issued by the current backend.  This
+   essentially means that concurrent deletions are not supported, but
+   that's true already in the Lehman and Yao algorithm.  nbtscan.c
+   exists only to support VACUUM and allow it to delete items while
+   it's scanning the index.
+
+Notes about data representation:
+
+  The right-sibling link required by L&Y is kept in the page "opaque
+   data" area, as is the left-sibling link and some flags.
+
+  We also keep a parent link in the opaque data, but this link is not
+   very trustworthy because it is not updated when the parent page splits.
+   Thus, it points to some page on the parent level, but possibly a page
+   well to the left of the page's actual current parent.  In most cases
+   we do not need this link at all.  Normally we return to a parent page
+   using a stack of entries that are made as we descend the tree, as in L&Y.
+   There is exactly one case where the stack will not help: concurrent
+   root splits.  If an inserter process needs to split what had been the
+   root when it started its descent, but finds that that page is no longer
+   the root (because someone else split it meanwhile), then it uses the
+   parent link to move up to the next level.  This is OK because we do fix
+   the parent link in a former root page when splitting it.  This logic
+   will work even if the root is split multiple times (even up to creation
+   of multiple new levels) before an inserter returns to it.  The same
+   could not be said of finding the new root via the metapage, since that
+   would work only for a single level of added root.
+
+  The Postgres disk block data format (an array of items) doesn't fit
+   Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
+   so we have to play some games.
+
+  On a page that is not rightmost in its tree level, the "high key" is
+   kept in the page's first item, and real data items start at item 2.
+   The link portion of the "high key" item goes unused.  A page that is
+   rightmost has no "high key", so data items start with the first item.
+   Putting the high key at the left, rather than the right, may seem odd,
+   but it avoids moving the high key as we add data items.
+
+  On a leaf page, the data items are simply links to (TIDs of) tuples
+   in the relation being indexed, with the associated key values.
+
+  On a non-leaf page, the data items are down-links to child pages with
+   bounding keys.  The key in each data item is the *lower* bound for
+   keys on that child page, so logically the key is to the left of that
+   downlink.  The high key (if present) is the upper bound for the last
+   downlink.  The first data item on each such page has no lower bound
+   --- or lower bound of minus infinity, if you prefer.  The comparison
+   routines must treat it accordingly.  The actual key stored in the
+   item is irrelevant, and need not be stored at all.  This arrangement
+   corresponds to the fact that an L&Y non-leaf page has one more pointer
+   than key.

 Notes to operator class implementors:

-	With this implementation, we require the user to supply us with
-	a procedure for pg_amproc.  This procedure should take two keys
-	A and B and return < 0, 0, or > 0 if A < B, A = B, or A > B,
-	respectively.  See the contents of that relation for the btree
-	access method for some samples.
-
-Notes to mao for implementation document:
-
-	On deletions, we need to adjust the position of active scans on
-	the index.  The code in nbtscan.c handles this.  We don't need to
-	do this for splits because of the way splits are handled; if they
-	happen behind us, we'll automatically go to the next page, and if
-	they happen in front of us, we're not affected by them.  For
-	insertions, if we inserted a tuple behind the current scan location
-	on the current scan page, we move one space ahead.
+  With this implementation, we require the user to supply us with
+   a procedure for pg_amproc.  This procedure should take two keys
+   A and B and return < 0, 0, or > 0 if A < B, A = B, or A > B,
+   respectively.  See the contents of that relation for the btree
+   access method for some samples.
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@ -9,7 +9,7 @@
 *
 *
 * IDENTIFICATION
- *	  $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtpage.c,v 1.36 2000/04/12 17:14:49 momjian Exp $
+ *	  $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtpage.c,v 1.37 2000/07/21 06:42:32 tgl Exp $
 *
 *	NOTES
 *	   Postgres btree pages look like ordinary relation pages.	The opaque
@ -90,7 +90,7 @@ _bt_metapinit(Relation rel)
 	metad.btm_version = BTREE_VERSION;
 	metad.btm_root = P_NONE;
 	metad.btm_level = 0;
-	memmove((char *) BTPageGetMeta(pg), (char *) &metad, sizeof(metad));
+	memcpy((char *) BTPageGetMeta(pg), (char *) &metad, sizeof(metad));

 	op = (BTPageOpaque) PageGetSpecialPointer(pg);
 	op->btpo_flags = BTP_META;
@ -102,52 +102,6 @@ _bt_metapinit(Relation rel)
 		UnlockRelation(rel, AccessExclusiveLock);
 }

-#ifdef NOT_USED
-/*
- *	_bt_checkmeta() -- Verify that the metadata stored in a btree are
- *					   reasonable.
- */
-void
-_bt_checkmeta(Relation rel)
-{
-	Buffer		metabuf;
-	Page		metap;
-	BTMetaPageData *metad;
-	BTPageOpaque op;
-	int			nblocks;
-
-	/* if the relation is empty, this is init time; don't complain */
-	if ((nblocks = RelationGetNumberOfBlocks(rel)) == 0)
-		return;
-
-	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
-	metap = BufferGetPage(metabuf);
-	op = (BTPageOpaque) PageGetSpecialPointer(metap);
-	if (!(op->btpo_flags & BTP_META))
-	{
-		elog(ERROR, "Invalid metapage for index %s",
-			 RelationGetRelationName(rel));
-	}
-	metad = BTPageGetMeta(metap);
-
-	if (metad->btm_magic != BTREE_MAGIC)
-	{
-		elog(ERROR, "Index %s is not a btree",
-			 RelationGetRelationName(rel));
-	}
-
-	if (metad->btm_version != BTREE_VERSION)
-	{
-		elog(ERROR, "Version mismatch on %s:  version %d file, version %d code",
-			 RelationGetRelationName(rel),
-			 metad->btm_version, BTREE_VERSION);
-	}
-
-	_bt_relbuf(rel, metabuf, BT_READ);
-}
-
-#endif
-
 /*
 *	_bt_getroot() -- Get the root page of the btree.
 *
@ -157,11 +111,15 @@ _bt_checkmeta(Relation rel)
 *		standard class of race conditions exists here; I think I covered
 *		them all in the Hopi Indian rain dance of lock requests below.
 *
- *		We pass in the access type (BT_READ or BT_WRITE), and return the
- *		root page's buffer with the appropriate lock type set.  Reference
- *		count on the root page gets bumped by ReadBuffer.  The metadata
- *		page is unlocked and unreferenced by this process when this routine
- *		returns.
+ *		The access type parameter (BT_READ or BT_WRITE) controls whether
+ *		a new root page will be created or not.  If access = BT_READ,
+ *		and no root page exists, we just return InvalidBuffer.  For
+ *		BT_WRITE, we try to create the root page if it doesn't exist.
+ *		NOTE that the returned root page will have only a read lock set
+ *		on it even if access = BT_WRITE!
+ *
+ *		On successful return, the root page is pinned and read-locked.
+ *		The metadata page is not locked or pinned on exit.
 */
 Buffer
 _bt_getroot(Relation rel, int access)
@ -178,78 +136,71 @@ _bt_getroot(Relation rel, int access)
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
-	Assert(metaopaque->btpo_flags & BTP_META);
 	metad = BTPageGetMeta(metapg);

-	if (metad->btm_magic != BTREE_MAGIC)
-	{
+	if (!(metaopaque->btpo_flags & BTP_META) ||
+		metad->btm_magic != BTREE_MAGIC)
 		elog(ERROR, "Index %s is not a btree",
 			 RelationGetRelationName(rel));
-	}

 	if (metad->btm_version != BTREE_VERSION)
-	{
-		elog(ERROR, "Version mismatch on %s:  version %d file, version %d code",
+		elog(ERROR, "Version mismatch on %s: version %d file, version %d code",
 			 RelationGetRelationName(rel),
 			 metad->btm_version, BTREE_VERSION);
-	}

 	/* if no root page initialized yet, do it */
 	if (metad->btm_root == P_NONE)
 	{
+		/* If access = BT_READ, caller doesn't want us to create root yet */
+		if (access == BT_READ)
+		{
+			_bt_relbuf(rel, metabuf, BT_READ);
+			return InvalidBuffer;
+		}

-		/* turn our read lock in for a write lock */
-		_bt_relbuf(rel, metabuf, BT_READ);
-		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
-		metapg = BufferGetPage(metabuf);
-		metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
-		Assert(metaopaque->btpo_flags & BTP_META);
-		metad = BTPageGetMeta(metapg);
+		/* trade in our read lock for a write lock */
+		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+		LockBuffer(metabuf, BT_WRITE);

 		/*
 		 * Race condition:	if someone else initialized the metadata
 		 * between the time we released the read lock and acquired the
-		 * write lock, above, we want to avoid doing it again.
+		 * write lock, above, we must avoid doing it again.
 		 */
-
 		if (metad->btm_root == P_NONE)
 		{

 			/*
 			 * Get, initialize, write, and leave a lock of the appropriate
 			 * type on the new root page.  Since this is the first page in
-			 * the tree, it's a leaf.
+			 * the tree, it's a leaf as well as the root.
 			 */
-
 			rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
 			rootblkno = BufferGetBlockNumber(rootbuf);
 			rootpg = BufferGetPage(rootbuf);
+
 			metad->btm_root = rootblkno;
 			metad->btm_level = 1;
+
 			_bt_pageinit(rootpg, BufferGetPageSize(rootbuf));
 			rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpg);
 			rootopaque->btpo_flags |= (BTP_LEAF | BTP_ROOT);
 			_bt_wrtnorelbuf(rel, rootbuf);

-			/* swap write lock for read lock, if appropriate */
-			if (access != BT_WRITE)
-			{
-				LockBuffer(rootbuf, BUFFER_LOCK_UNLOCK);
-				LockBuffer(rootbuf, BT_READ);
-			}
+			/* swap write lock for read lock */
+			LockBuffer(rootbuf, BUFFER_LOCK_UNLOCK);
+			LockBuffer(rootbuf, BT_READ);

-			/* okay, metadata is correct */
+			/* okay, metadata is correct, write and release it */
 			_bt_wrtbuf(rel, metabuf);
 		}
 		else
 		{
-
 			/*
 			 * Metadata initialized by someone else.  In order to
 			 * guarantee no deadlocks, we have to release the metadata
 			 * page and start all over again.
 			 */
-
 			_bt_relbuf(rel, metabuf, BT_WRITE);
 			return _bt_getroot(rel, access);
 		}
@ -259,22 +210,21 @@ _bt_getroot(Relation rel, int access)
 		rootblkno = metad->btm_root;
 		_bt_relbuf(rel, metabuf, BT_READ);		/* done with the meta page */

-		rootbuf = _bt_getbuf(rel, rootblkno, access);
+		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
 	}

 	/*
 	 * Race condition:	If the root page split between the time we looked
 	 * at the metadata page and got the root buffer, then we got the wrong
-	 * buffer.
+	 * buffer.  Release it and try again.
 	 */
-
 	rootpg = BufferGetPage(rootbuf);
 	rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpg);
-	if (!(rootopaque->btpo_flags & BTP_ROOT))
-	{

+	if (! P_ISROOT(rootopaque))
+	{
 		/* it happened, try again */
-		_bt_relbuf(rel, rootbuf, access);
+		_bt_relbuf(rel, rootbuf, BT_READ);
 		return _bt_getroot(rel, access);
 	}

@ -283,7 +233,6 @@ _bt_getroot(Relation rel, int access)
 	 * count is correct, and we have no lock set on the metadata page.
 	 * Return the root block.
 	 */
-
 	return rootbuf;
 }

@ -291,33 +240,38 @@ _bt_getroot(Relation rel, int access)
 *	_bt_getbuf() -- Get a buffer by block number for read or write.
 *
 *		When this routine returns, the appropriate lock is set on the
- *		requested buffer its reference count is correct.
+ *		requested buffer and its reference count has been incremented
+ *		(ie, the buffer is "locked and pinned").
 */
 Buffer
 _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 {
 	Buffer		buf;
-	Page		page;

 	if (blkno != P_NEW)
 	{
+		/* Read an existing block of the relation */
 		buf = ReadBuffer(rel, blkno);
 		LockBuffer(buf, access);
 	}
 	else
 	{
+		Page		page;

 		/*
-		 * Extend bufmgr code is unclean and so we have to use locking
+		 * Extend the relation by one page.
+		 *
+		 * Extend bufmgr code is unclean and so we have to use extra locking
 		 * here.
 		 */
 		LockPage(rel, 0, ExclusiveLock);
 		buf = ReadBuffer(rel, blkno);
+		LockBuffer(buf, access);
 		UnlockPage(rel, 0, ExclusiveLock);
-		blkno = BufferGetBlockNumber(buf);
+
+		/* Initialize the new page before returning it */
 		page = BufferGetPage(buf);
 		_bt_pageinit(page, BufferGetPageSize(buf));
-		LockBuffer(buf, access);
 	}

 	/* ref count and lock type are correct */
@ -326,6 +280,8 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)

 /*
 *	_bt_relbuf() -- release a locked buffer.
+ *
+ * Lock and pin (refcount) are both dropped.
 */
 void
 _bt_relbuf(Relation rel, Buffer buf, int access)
@ -337,9 +293,15 @@ _bt_relbuf(Relation rel, Buffer buf, int access)
 /*
 *	_bt_wrtbuf() -- write a btree page to disk.
 *
- *		This routine releases the lock held on the buffer and our reference
- *		to it.	It is an error to call _bt_wrtbuf() without a write lock
- *		or a reference to the buffer.
+ *		This routine releases the lock held on the buffer and our refcount
+ *		for it.  It is an error to call _bt_wrtbuf() without a write lock
+ *		and a pin on the buffer.
+ *
+ * NOTE: actually, the buffer manager just marks the shared buffer page
+ * dirty here, the real I/O happens later.  Since we can't persuade the
+ * Unix kernel to schedule disk writes in a particular order, there's not
+ * much point in worrying about this.  The most we can say is that all the
+ * writes will occur before commit.
 */
 void
 _bt_wrtbuf(Relation rel, Buffer buf)
@ -353,7 +315,9 @@ _bt_wrtbuf(Relation rel, Buffer buf)
 *						 our reference or lock.
 *
 *		It is an error to call _bt_wrtnorelbuf() without a write lock
- *		or a reference to the buffer.
+ *		and a pin on the buffer.
+ *
+ * See above NOTE.
 */
 void
 _bt_wrtnorelbuf(Relation rel, Buffer buf)
@ -389,10 +353,10 @@ _bt_pageinit(Page page, Size size)
 *		we split the root page, we record the new parent in the metadata page
 *		for the relation.  This routine does the work.
 *
- *		No direct preconditions, but if you don't have the a write lock on
+ *		No direct preconditions, but if you don't have the write lock on
 *		at least the old root page when you call this, you're making a big
 *		mistake.  On exit, metapage data is correct and we no longer have
- *		a reference to or lock on the metapage.
+ *		a pin or lock on the metapage.
 */
 void
 _bt_metaproot(Relation rel, BlockNumber rootbknum, int level)
@ -416,127 +380,8 @@ _bt_metaproot(Relation rel, BlockNumber rootbknum, int level)
 }

 /*
- *	_bt_getstackbuf() -- Walk back up the tree one step, and find the item
- *						 we last looked at in the parent.
- *
- *		This is possible because we save a bit image of the last item
- *		we looked at in the parent, and the update algorithm guarantees
- *		that if items above us in the tree move, they only move right.
- *
- *		Also, re-set bts_blkno & bts_offset if changed and
- *		bts_btitem (it may be changed - see _bt_insertonpg).
+ * Delete an item from a btree.  It had better be a leaf item...
 */
-Buffer
-_bt_getstackbuf(Relation rel, BTStack stack, int access)
-{
-	Buffer		buf;
-	BlockNumber blkno;
-	OffsetNumber start,
-				offnum,
-				maxoff;
-	OffsetNumber i;
-	Page		page;
-	ItemId		itemid;
-	BTItem		item;
-	BTPageOpaque opaque;
-	BTItem		item_save;
-	int			item_nbytes;
-
-	blkno = stack->bts_blkno;
-	buf = _bt_getbuf(rel, blkno, access);
-	page = BufferGetPage(buf);
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-	maxoff = PageGetMaxOffsetNumber(page);
-
-	if (stack->bts_offset == InvalidOffsetNumber ||
-		maxoff >= stack->bts_offset)
-	{
-
-		/*
-		 * _bt_insertonpg set bts_offset to InvalidOffsetNumber in the
-		 * case of concurrent ROOT page split
-		 */
-		if (stack->bts_offset == InvalidOffsetNumber)
-			i = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
-		else
-		{
-			itemid = PageGetItemId(page, stack->bts_offset);
-			item = (BTItem) PageGetItem(page, itemid);
-
-			/* if the item is where we left it, we're done */
-			if (BTItemSame(item, stack->bts_btitem))
-			{
-				pfree(stack->bts_btitem);
-				item_nbytes = ItemIdGetLength(itemid);
-				item_save = (BTItem) palloc(item_nbytes);
-				memmove((char *) item_save, (char *) item, item_nbytes);
-				stack->bts_btitem = item_save;
-				return buf;
-			}
-			i = OffsetNumberNext(stack->bts_offset);
-		}
-
-		/* if the item has just moved right on this page, we're done */
-		for (;
-			 i <= maxoff;
-			 i = OffsetNumberNext(i))
-		{
-			itemid = PageGetItemId(page, i);
-			item = (BTItem) PageGetItem(page, itemid);
-
-			/* if the item is where we left it, we're done */
-			if (BTItemSame(item, stack->bts_btitem))
-			{
-				stack->bts_offset = i;
-				pfree(stack->bts_btitem);
-				item_nbytes = ItemIdGetLength(itemid);
-				item_save = (BTItem) palloc(item_nbytes);
-				memmove((char *) item_save, (char *) item, item_nbytes);
-				stack->bts_btitem = item_save;
-				return buf;
-			}
-		}
-	}
-
-	/* by here, the item we're looking for moved right at least one page */
-	for (;;)
-	{
-		blkno = opaque->btpo_next;
-		if (P_RIGHTMOST(opaque))
-			elog(FATAL, "my bits moved right off the end of the world!\
-\n\tRecreate index %s.", RelationGetRelationName(rel));
-
-		_bt_relbuf(rel, buf, access);
-		buf = _bt_getbuf(rel, blkno, access);
-		page = BufferGetPage(buf);
-		maxoff = PageGetMaxOffsetNumber(page);
-		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-		/* if we have a right sibling, step over the high key */
-		start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
-
-		/* see if it's on this page */
-		for (offnum = start;
-			 offnum <= maxoff;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			itemid = PageGetItemId(page, offnum);
-			item = (BTItem) PageGetItem(page, itemid);
-			if (BTItemSame(item, stack->bts_btitem))
-			{
-				stack->bts_offset = offnum;
-				stack->bts_blkno = blkno;
-				pfree(stack->bts_btitem);
-				item_nbytes = ItemIdGetLength(itemid);
-				item_save = (BTItem) palloc(item_nbytes);
-				memmove((char *) item_save, (char *) item, item_nbytes);
-				stack->bts_btitem = item_save;
-				return buf;
-			}
-		}
-	}
-}
-
 void
 _bt_pagedel(Relation rel, ItemPointer tid)
 {
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@ -12,7 +12,7 @@
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 * IDENTIFICATION
- *	  $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtree.c,v 1.61 2000/07/14 22:17:33 tgl Exp $
+ *	  $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtree.c,v 1.62 2000/07/21 06:42:32 tgl Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -26,6 +26,7 @@
 #include "executor/executor.h"
 #include "miscadmin.h"

+
 bool		BuildingBtree = false;		/* see comment in btbuild() */
 bool		FastBuild = true;	/* use sort/build instead of insertion
 								 * build */
@ -206,8 +207,8 @@ btbuild(PG_FUNCTION_ARGS)
 		 * btree pages - NULLs greater NOT_NULLs and NULL = NULL is TRUE.
 		 * Sure, it's just rule for placing/finding items and no more -
 		 * keytest'll return FALSE for a = 5 for items having 'a' isNULL.
-		 * Look at _bt_skeycmp, _bt_compare and _bt_itemcmp for how it
-		 * works.				 - vadim 03/23/97
+		 * Look at _bt_compare for how it works.
+		 *				 - vadim 03/23/97
 		 *
 		 * if (itup->t_info & INDEX_NULL_MASK) { pfree(itup); continue; }
 		 */
@ -321,14 +322,6 @@ btinsert(PG_FUNCTION_ARGS)
 	/* generate an index tuple */
 	itup = index_formtuple(RelationGetDescr(rel), datum, nulls);
 	itup->t_tid = *ht_ctid;
-
-	/*
-	 * See comments in btbuild.
-	 *
-	 * if (itup->t_info & INDEX_NULL_MASK)
-	 *		PG_RETURN_POINTER((InsertIndexResult) NULL);
-	 */
-
 	btitem = _bt_formitem(itup);

 	res = _bt_doinsert(rel, btitem, rel->rd_uniqueindex, heapRel);
@ -357,10 +350,10 @@ btgettuple(PG_FUNCTION_ARGS)

 	if (ItemPointerIsValid(&(scan->currentItemData)))
 	{
-
 		/*
 		 * Restore scan position using heap TID returned by previous call
-		 * to btgettuple(). _bt_restscan() locks buffer.
+		 * to btgettuple(). _bt_restscan() re-grabs the read lock on
+		 * the buffer, too.
 		 */
 		_bt_restscan(scan);
 		res = _bt_next(scan, dir);
@ -369,8 +362,9 @@ btgettuple(PG_FUNCTION_ARGS)
 		res = _bt_first(scan, dir);

 	/*
-	 * Save heap TID to use it in _bt_restscan. Unlock buffer before
-	 * leaving index !
+	 * Save heap TID to use it in _bt_restscan.  Then release the read
+	 * lock on the buffer so that we aren't blocking other backends.
+	 * NOTE: we do keep the pin on the buffer!
 	 */
 	if (res)
 	{
@ -419,7 +413,18 @@ btrescan(PG_FUNCTION_ARGS)

 	so = (BTScanOpaque) scan->opaque;

-	/* we don't hold a read lock on the current page in the scan */
+	if (so == NULL)				/* if called from btbeginscan */
+	{
+		so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
+		so->btso_curbuf = so->btso_mrkbuf = InvalidBuffer;
+		so->keyData = (ScanKey) NULL;
+		if (scan->numberOfKeys > 0)
+			so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
+		scan->opaque = so;
+		scan->flags = 0x0;
+	}
+
+	/* we aren't holding any read locks, but gotta drop the pins */
 	if (ItemPointerIsValid(iptr = &(scan->currentItemData)))
 	{
 		ReleaseBuffer(so->btso_curbuf);
@ -427,7 +432,6 @@ btrescan(PG_FUNCTION_ARGS)
 		ItemPointerSetInvalid(iptr);
 	}

-	/* and we don't hold a read lock on the last marked item in the scan */
 	if (ItemPointerIsValid(iptr = &(scan->currentMarkData)))
 	{
 		ReleaseBuffer(so->btso_mrkbuf);
@ -435,17 +439,6 @@ btrescan(PG_FUNCTION_ARGS)
 		ItemPointerSetInvalid(iptr);
 	}

-	if (so == NULL)				/* if called from btbeginscan */
-	{
-		so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
-		so->btso_curbuf = so->btso_mrkbuf = InvalidBuffer;
-		so->keyData = (ScanKey) NULL;
-		if (scan->numberOfKeys > 0)
-			so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
-		scan->opaque = so;
-		scan->flags = 0x0;
-	}
-
 	/*
 	 * Reset the scan keys. Note that keys ordering stuff moved to
 	 * _bt_first.	   - vadim 05/05/97
@ -472,7 +465,7 @@ btmovescan(IndexScanDesc scan, Datum v)

 	so = (BTScanOpaque) scan->opaque;

-	/* we don't hold a read lock on the current page in the scan */
+	/* we aren't holding any read locks, but gotta drop the pin */
 	if (ItemPointerIsValid(iptr = &(scan->currentItemData)))
 	{
 		ReleaseBuffer(so->btso_curbuf);
@ -480,7 +473,6 @@ btmovescan(IndexScanDesc scan, Datum v)
 		ItemPointerSetInvalid(iptr);
 	}

-/*	  scan->keyData[0].sk_argument = v; */
 	so->keyData[0].sk_argument = v;
 }

@ -496,7 +488,7 @@ btendscan(PG_FUNCTION_ARGS)

 	so = (BTScanOpaque) scan->opaque;

-	/* we don't hold any read locks */
+	/* we aren't holding any read locks, but gotta drop the pins */
 	if (ItemPointerIsValid(iptr = &(scan->currentItemData)))
 	{
 		if (BufferIsValid(so->btso_curbuf))
@ -534,7 +526,7 @@ btmarkpos(PG_FUNCTION_ARGS)

 	so = (BTScanOpaque) scan->opaque;

-	/* we don't hold any read locks */
+	/* we aren't holding any read locks, but gotta drop the pin */
 	if (ItemPointerIsValid(iptr = &(scan->currentMarkData)))
 	{
 		ReleaseBuffer(so->btso_mrkbuf);
@ -542,7 +534,7 @@ btmarkpos(PG_FUNCTION_ARGS)
 		ItemPointerSetInvalid(iptr);
 	}

-	/* bump pin on current buffer */
+	/* bump pin on current buffer for assignment to mark buffer */
 	if (ItemPointerIsValid(&(scan->currentItemData)))
 	{
 		so->btso_mrkbuf = ReadBuffer(scan->relation,
@ -566,7 +558,7 @@ btrestrpos(PG_FUNCTION_ARGS)

 	so = (BTScanOpaque) scan->opaque;

-	/* we don't hold any read locks */
+	/* we aren't holding any read locks, but gotta drop the pin */
 	if (ItemPointerIsValid(iptr = &(scan->currentItemData)))
 	{
 		ReleaseBuffer(so->btso_curbuf);
@ -579,7 +571,6 @@ btrestrpos(PG_FUNCTION_ARGS)
 	{
 		so->btso_curbuf = ReadBuffer(scan->relation,
 								  BufferGetBlockNumber(so->btso_mrkbuf));
-
 		scan->currentItemData = scan->currentMarkData;
 		so->curHeapIptr = so->mrkHeapIptr;
 	}
@ -603,6 +594,9 @@ btdelete(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }

+/*
+ * Restore scan position when btgettuple is called to continue a scan.
+ */
 static void
 _bt_restscan(IndexScanDesc scan)
 {
@ -618,7 +612,12 @@ _bt_restscan(IndexScanDesc scan)
 	BTItem		item;
 	BlockNumber blkno;

-	LockBuffer(buf, BT_READ);	/* lock buffer first! */
+	/*
+	 * Get back the read lock we were holding on the buffer.
+	 * (We still have a reference-count pin on it, though.)
+	 */
+	LockBuffer(buf, BT_READ);
+
 	page = BufferGetPage(buf);
 	maxoff = PageGetMaxOffsetNumber(page);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@ -631,43 +630,40 @@ _bt_restscan(IndexScanDesc scan)
 	 */
 	if (!ItemPointerIsValid(&target))
 	{
-		ItemPointerSetOffsetNumber(&(scan->currentItemData),
-		   OffsetNumberPrev(P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY));
+		ItemPointerSetOffsetNumber(current,
+								   OffsetNumberPrev(P_FIRSTDATAKEY(opaque)));
 		return;
 	}

-	if (maxoff >= offnum)
+	/*
+	 * The item we were on may have moved right due to insertions.
+	 * Find it again.
+	 */
+	for (;;)
 	{
-
-		/*
-		 * if the item is where we left it or has just moved right on this
-		 * page, we're done
-		 */
+		/* Check for item on this page */
 		for (;
 			 offnum <= maxoff;
 			 offnum = OffsetNumberNext(offnum))
 		{
 			item = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
-			if (item->bti_itup.t_tid.ip_blkid.bi_hi == \
-				target.ip_blkid.bi_hi && \
-				item->bti_itup.t_tid.ip_blkid.bi_lo == \
-				target.ip_blkid.bi_lo && \
+			if (item->bti_itup.t_tid.ip_blkid.bi_hi ==
+				target.ip_blkid.bi_hi &&
+				item->bti_itup.t_tid.ip_blkid.bi_lo ==
+				target.ip_blkid.bi_lo &&
 				item->bti_itup.t_tid.ip_posid == target.ip_posid)
 			{
 				current->ip_posid = offnum;
 				return;
 			}
 		}
-	}

-	/*
-	 * By here, the item we're looking for moved right at least one page
-	 */
-	for (;;)
-	{
+		/*
+		 * By here, the item we're looking for moved right at least one page
+		 */
 		if (P_RIGHTMOST(opaque))
-			elog(FATAL, "_bt_restscan: my bits moved right off the end of the world!\
-\n\tRecreate index %s.", RelationGetRelationName(rel));
+			elog(FATAL, "_bt_restscan: my bits moved right off the end of the world!"
+				 "\n\tRecreate index %s.", RelationGetRelationName(rel));

 		blkno = opaque->btpo_next;
 		_bt_relbuf(rel, buf, BT_READ);
@ -675,23 +671,8 @@ _bt_restscan(IndexScanDesc scan)
 		page = BufferGetPage(buf);
 		maxoff = PageGetMaxOffsetNumber(page);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-		/* see if it's on this page */
-		for (offnum = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
-			 offnum <= maxoff;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			item = (BTItem) PageGetItem(page, PageGetItemId(page, offnum));
-			if (item->bti_itup.t_tid.ip_blkid.bi_hi == \
-				target.ip_blkid.bi_hi && \
-				item->bti_itup.t_tid.ip_blkid.bi_lo == \
-				target.ip_blkid.bi_lo && \
-				item->bti_itup.t_tid.ip_posid == target.ip_posid)
-			{
-				ItemPointerSet(current, blkno, offnum);
-				so->btso_curbuf = buf;
-				return;
-			}
-		}
+		offnum = P_FIRSTDATAKEY(opaque);
+		ItemPointerSet(current, blkno, offnum);
+		so->btso_curbuf = buf;
 	}
 }
--- a/src/backend/access/nbtree/nbtscan.c
+++ b/src/backend/access/nbtree/nbtscan.c
@ -8,22 +8,25 @@
 *
 *
 * IDENTIFICATION
- *	  $Header: /cvsroot/pgsql/src/backend/access/nbtree/Attic/nbtscan.c,v 1.31 2000/04/12 17:14:49 momjian Exp $
+ *	  $Header: /cvsroot/pgsql/src/backend/access/nbtree/Attic/nbtscan.c,v 1.32 2000/07/21 06:42:32 tgl Exp $
 *
 *
 * NOTES
 *	 Because we can be doing an index scan on a relation while we update
 *	 it, we need to avoid missing data that moves around in the index.
- *	 The routines and global variables in this file guarantee that all
- *	 scans in the local address space stay correctly positioned.  This
- *	 is all we need to worry about, since write locking guarantees that
- *	 no one else will be on the same page at the same time as we are.
+ *	 Insertions and page splits are no problem because _bt_restscan()
+ *	 can figure out where the current item moved to, but if a deletion
+ *	 happens at or before the current scan position, we'd better do
+ *	 something to stay in sync.
+ *
+ *	 The routines in this file handle the problem for deletions issued
+ *	 by the current backend.  Currently, that's all we need, since
+ *	 deletions are only done by VACUUM and it gets an exclusive lock.
 *
 *	 The scheme is to manage a list of active scans in the current backend.
- *	 Whenever we add or remove records from an index, or whenever we
- *	 split a leaf page, we check the list of active scans to see if any
- *	 has been affected.  A scan is affected only if it is on the same
- *	 relation, and the same page, as the update.
+ *	 Whenever we remove a record from an index, we check the list of active
+ *	 scans to see if any has been affected.  A scan is affected only if it
+ *	 is on the same relation, and the same page, as the update.
 *
 *-------------------------------------------------------------------------
 */
@ -111,7 +114,7 @@ _bt_dropscan(IndexScanDesc scan)

 /*
 *	_bt_adjscans() -- adjust all scans in the scan list to compensate
- *					  for a given deletion or insertion
+ *					  for a given deletion
 */
 void
 _bt_adjscans(Relation rel, ItemPointer tid)
@ -153,7 +156,7 @@ _bt_scandel(IndexScanDesc scan, BlockNumber blkno, OffsetNumber offno)
 	{
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-		start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
+		start = P_FIRSTDATAKEY(opaque);
 		if (ItemPointerGetOffsetNumber(current) == start)
 			ItemPointerSetInvalid(&(so->curHeapIptr));
 		else
@ -165,7 +168,6 @@ _bt_scandel(IndexScanDesc scan, BlockNumber blkno, OffsetNumber offno)
 			 */
 			LockBuffer(buf, BT_READ);
 			_bt_step(scan, &buf, BackwardScanDirection);
-			so->btso_curbuf = buf;
 			if (ItemPointerIsValid(current))
 			{
 				Page		pg = BufferGetPage(buf);
@ -183,10 +185,9 @@ _bt_scandel(IndexScanDesc scan, BlockNumber blkno, OffsetNumber offno)
 		&& ItemPointerGetBlockNumber(current) == blkno
 		&& ItemPointerGetOffsetNumber(current) >= offno)
 	{
-
 		page = BufferGetPage(so->btso_mrkbuf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-		start = P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY;
+		start = P_FIRSTDATAKEY(opaque);

 		if (ItemPointerGetOffsetNumber(current) == start)
 			ItemPointerSetInvalid(&(so->mrkHeapIptr));
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@ -6,8 +6,12 @@
 *
 * We use tuplesort.c to sort the given index tuples into order.
 * Then we scan the index tuples in order and build the btree pages
- * for each level.	When we have only one page on a level, it must be the
- * root -- it can be attached to the btree metapage and we are done.
+ * for each level.  We load source tuples into leaf-level pages.
+ * Whenever we fill a page at one level, we add a link to it to its
+ * parent level (starting a new parent level if necessary).  When
+ * done, we write out each final page on each level, adding it to
+ * its parent level.  When we have only one page on a level, it must be
+ * the root -- it can be attached to the btree metapage and we are done.
 *
 * this code is moderately slow (~10% slower) compared to the regular
 * btree (insertion) build code on sorted or well-clustered data.  on
@ -23,12 +27,20 @@
 * something like the standard 70% steady-state load factor for btrees
 * would probably be better.
 *
+ * Another limitation is that we currently load full copies of all keys
+ * into upper tree levels.  The leftmost data key in each non-leaf node
+ * could be omitted as far as normal btree operations are concerned
+ * (see README for more info).  However, because we build the tree from
+ * the bottom up, we need that data key to insert into the node's parent.
+ * This could be fixed by keeping a spare copy of the minimum key in the
+ * state stack, but I haven't time for that right now.
+ *
 *
 * Portions Copyright (c) 1996-2000, PostgreSQL, Inc
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 * IDENTIFICATION
- *	  $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtsort.c,v 1.54 2000/06/15 04:09:36 momjian Exp $
+ *	  $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtsort.c,v 1.55 2000/07/21 06:42:33 tgl Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -57,6 +69,20 @@ struct BTSpool
 	bool		isunique;
 };

+/*
+ * Status record for a btree page being built.  We have one of these
+ * for each active tree level.
+ */
+typedef struct BTPageState
+{
+	Buffer		btps_buf;		/* current buffer & page */
+	Page		btps_page;
+	OffsetNumber btps_lastoff;	/* last item offset loaded */
+	int			btps_level;
+	struct BTPageState *btps_next; /* link to parent level, if any */
+} BTPageState;
+
+
 #define BTITEMSZ(btitem) \
 	((btitem) ? \
 	 (IndexTupleDSize((btitem)->bti_itup) + \
@ -65,13 +91,11 @@ struct BTSpool


 static void _bt_load(Relation index, BTSpool *btspool);
-static BTItem _bt_buildadd(Relation index, Size keysz, ScanKey scankey,
-			 BTPageState *state, BTItem bti, int flags);
+static void _bt_buildadd(Relation index, BTPageState *state,
+						 BTItem bti, int flags);
 static BTItem _bt_minitem(Page opage, BlockNumber oblkno, int atend);
-static BTPageState *_bt_pagestate(Relation index, int flags,
-			  int level, bool doupper);
-static void _bt_uppershutdown(Relation index, Size keysz, ScanKey scankey,
-				  BTPageState *state);
+static BTPageState *_bt_pagestate(Relation index, int flags, int level);
+static void _bt_uppershutdown(Relation index, BTPageState *state);


 /*
@ -159,9 +183,6 @@ _bt_blnewpage(Relation index, Buffer *buf, Page *page, int flags)
 	BTPageOpaque opaque;

 	*buf = _bt_getbuf(index, P_NEW, BT_WRITE);
-#ifdef NOT_USED
-	printf("\tblk=%d\n", BufferGetBlockNumber(*buf));
-#endif
 	*page = BufferGetPage(*buf);
 	_bt_pageinit(*page, BufferGetPageSize(*buf));
 	opaque = (BTPageOpaque) PageGetSpecialPointer(*page);
@ -202,18 +223,15 @@ _bt_slideleft(Relation index, Buffer buf, Page page)
 * is suitable for immediate use by _bt_buildadd.
 */
 static BTPageState *
-_bt_pagestate(Relation index, int flags, int level, bool doupper)
+_bt_pagestate(Relation index, int flags, int level)
 {
 	BTPageState *state = (BTPageState *) palloc(sizeof(BTPageState));

 	MemSet((char *) state, 0, sizeof(BTPageState));
 	_bt_blnewpage(index, &(state->btps_buf), &(state->btps_page), flags);
-	state->btps_firstoff = InvalidOffsetNumber;
 	state->btps_lastoff = P_HIKEY;
-	state->btps_lastbti = (BTItem) NULL;
 	state->btps_next = (BTPageState *) NULL;
 	state->btps_level = level;
-	state->btps_doupper = doupper;

 	return state;
 }
@ -240,31 +258,27 @@ _bt_minitem(Page opage, BlockNumber oblkno, int atend)
 }

 /*
- * add an item to a disk page from a merge tape block.
+ * add an item to a disk page from the sort output.
 *
 * we must be careful to observe the following restrictions, placed
 * upon us by the conventions in nbtsearch.c:
 * - rightmost pages start data items at P_HIKEY instead of at
 *	 P_FIRSTKEY.
- * - duplicates cannot be split among pages unless the chain of
- *	 duplicates starts at the first data item.
 *
 * a leaf page being built looks like:
 *
 * +----------------+---------------------------------+
 * | PageHeaderData | linp0 linp1 linp2 ...			  |
 * +-----------+----+---------------------------------+
- * | ... linpN |				  ^ first			  |
+ * | ... linpN |									  |
 * +-----------+--------------------------------------+
 * |	 ^ last										  |
 * |												  |
- * |			   v last							  |
 * +-------------+------------------------------------+
 * |			 | itemN ...						  |
 * +-------------+------------------+-----------------+
 * |		  ... item3 item2 item1 | "special space" |
 * +--------------------------------+-----------------+
- *						^ first
 *
 * contrast this with the diagram in bufpage.h; note the mismatch
 * between linps and items.  this is because we reserve linp0 as a
@ -272,30 +286,20 @@ _bt_minitem(Page opage, BlockNumber oblkno, int atend)
 * filled up the page, we will set linp0 to point to itemN and clear
 * linpN.
 *
- * 'last' pointers indicate the last offset/item added to the page.
- * 'first' pointers indicate the first offset/item that is part of a
- * chain of duplicates extending from 'first' to 'last'.
- *
- * if all keys are unique, 'first' will always be the same as 'last'.
+ * 'last' pointer indicates the last offset added to the page.
 */
-static BTItem
-_bt_buildadd(Relation index, Size keysz, ScanKey scankey,
-			 BTPageState *state, BTItem bti, int flags)
+static void
+_bt_buildadd(Relation index, BTPageState *state, BTItem bti, int flags)
 {
 	Buffer		nbuf;
 	Page		npage;
-	BTItem		last_bti;
-	OffsetNumber first_off;
 	OffsetNumber last_off;
-	OffsetNumber off;
 	Size		pgspc;
 	Size		btisz;

 	nbuf = state->btps_buf;
 	npage = state->btps_page;
-	first_off = state->btps_firstoff;
 	last_off = state->btps_lastoff;
-	last_bti = state->btps_lastbti;

 	pgspc = PageGetFreeSpace(npage);
 	btisz = BTITEMSZ(bti);
@ -319,75 +323,55 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey,

 	if (pgspc < btisz)
 	{
+		/*
+		 * Item won't fit on this page, so finish off the page and
+		 * write it out.
+		 */
 		Buffer		obuf = nbuf;
 		Page		opage = npage;
-		OffsetNumber o,
-					n;
 		ItemId		ii;
 		ItemId		hii;
+		BTItem		nbti;

 		_bt_blnewpage(index, &nbuf, &npage, flags);

 		/*
-		 * if 'last' is part of a chain of duplicates that does not start
-		 * at the beginning of the old page, the entire chain is copied to
-		 * the new page; we delete all of the duplicates from the old page
-		 * except the first, which becomes the high key item of the old
-		 * page.
+		 * We copy the last item on the page into the new page, and then
+		 * rearrange the old page so that the 'last item' becomes its high
+		 * key rather than a true data item.
 		 *
-		 * if the chain starts at the beginning of the page or there is no
-		 * chain ('first' == 'last'), we need only copy 'last' to the new
-		 * page.  again, 'first' (== 'last') becomes the high key of the
-		 * old page.
-		 *
-		 * note that in either case, we copy at least one item to the new
-		 * page, so 'last_bti' will always be valid.  'bti' will never be
-		 * the first data item on the new page.
+		 * note that since we always copy an item to the new page,
+		 * 'bti' will never be the first data item on the new page.
 		 */
-		if (first_off == P_FIRSTKEY)
+		ii = PageGetItemId(opage, last_off);
+		if (PageAddItem(npage, PageGetItem(opage, ii), ii->lp_len,
+						P_FIRSTKEY, LP_USED) == InvalidOffsetNumber)
+			elog(FATAL, "btree: failed to add item to the page in _bt_sort (1)");
+#ifdef FASTBUILD_DEBUG
 		{
-			Assert(last_off != P_FIRSTKEY);
-			first_off = last_off;
+			bool		isnull;
+			BTItem		tmpbti =
+				(BTItem) PageGetItem(npage, PageGetItemId(npage, P_FIRSTKEY));
+			Datum		d = index_getattr(&(tmpbti->bti_itup), 1,
+										  index->rd_att, &isnull);
+
+			printf("_bt_buildadd: moved <%x> to offset %d at level %d\n",
+				   d, P_FIRSTKEY, state->btps_level);
 		}
-		for (o = first_off, n = P_FIRSTKEY;
-			 o <= last_off;
-			 o = OffsetNumberNext(o), n = OffsetNumberNext(n))
-		{
-			ii = PageGetItemId(opage, o);
-			if (PageAddItem(npage, PageGetItem(opage, ii),
-						  ii->lp_len, n, LP_USED) == InvalidOffsetNumber)
-				elog(FATAL, "btree: failed to add item to the page in _bt_sort (1)");
-#ifdef FASTBUILD_DEBUG
-			{
-				bool		isnull;
-				BTItem		tmpbti =
-				(BTItem) PageGetItem(npage, PageGetItemId(npage, n));
-				Datum		d = index_getattr(&(tmpbti->bti_itup), 1,
-											  index->rd_att, &isnull);
-
-				printf("_bt_buildadd: moved <%x> to offset %d at level %d\n",
-					   d, n, state->btps_level);
-			}
 #endif
-		}

 		/*
-		 * this loop is backward because PageIndexTupleDelete shuffles the
-		 * tuples to fill holes in the page -- by starting at the end and
-		 * working back, we won't create holes (and thereby avoid
-		 * shuffling).
+		 * Move 'last' into the high key position on opage
 		 */
-		for (o = last_off; o > first_off; o = OffsetNumberPrev(o))
-			PageIndexTupleDelete(opage, o);
 		hii = PageGetItemId(opage, P_HIKEY);
-		ii = PageGetItemId(opage, first_off);
 		*hii = *ii;
 		ii->lp_flags &= ~LP_USED;
 		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);

-		first_off = P_FIRSTKEY;
+		/*
+		 * Reset last_off to point to new page
+		 */
 		last_off = PageGetMaxOffsetNumber(npage);
-		last_bti = (BTItem) PageGetItem(npage, PageGetItemId(npage, last_off));

 		/*
 		 * set the page (side link) pointers.
@ -399,32 +383,21 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey,
 			oopaque->btpo_next = BufferGetBlockNumber(nbuf);
 			nopaque->btpo_prev = BufferGetBlockNumber(obuf);
 			nopaque->btpo_next = P_NONE;
-
-			if (_bt_itemcmp(index, keysz, scankey,
-			  (BTItem) PageGetItem(opage, PageGetItemId(opage, P_HIKEY)),
-			(BTItem) PageGetItem(opage, PageGetItemId(opage, P_FIRSTKEY)),
-							BTEqualStrategyNumber))
-				oopaque->btpo_flags |= BTP_CHAIN;
 		}

 		/*
-		 * copy the old buffer's minimum key to its parent.  if we don't
-		 * have a parent, we have to create one; this adds a new btree
-		 * level.
+		 * Link the old buffer into its parent, using its minimum key.
+		 * If we don't have a parent, we have to create one;
+		 * this adds a new btree level.
 		 */
-		if (state->btps_doupper)
+		if (state->btps_next == (BTPageState *) NULL)
 		{
-			BTItem		nbti;
-
-			if (state->btps_next == (BTPageState *) NULL)
-			{
-				state->btps_next =
-					_bt_pagestate(index, 0, state->btps_level + 1, true);
-			}
-			nbti = _bt_minitem(opage, BufferGetBlockNumber(obuf), 0);
-			_bt_buildadd(index, keysz, scankey, state->btps_next, nbti, 0);
-			pfree((void *) nbti);
+			state->btps_next =
+				_bt_pagestate(index, 0, state->btps_level + 1);
 		}
+		nbti = _bt_minitem(opage, BufferGetBlockNumber(obuf), 0);
+		_bt_buildadd(index, state->btps_next, nbti, 0);
+		pfree((void *) nbti);

 		/*
 		 * write out the old stuff.  we never want to see it again, so we
@ -435,11 +408,11 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey,
 	}

 	/*
-	 * if this item is different from the last item added, we start a new
-	 * chain of duplicates.
+	 * Add the new item into the current page.
 	 */
-	off = OffsetNumberNext(last_off);
-	if (PageAddItem(npage, (Item) bti, btisz, off, LP_USED) == InvalidOffsetNumber)
+	last_off = OffsetNumberNext(last_off);
+	if (PageAddItem(npage, (Item) bti, btisz,
+					last_off, LP_USED) == InvalidOffsetNumber)
 		elog(FATAL, "btree: failed to add item to the page in _bt_sort (2)");
 #ifdef FASTBUILD_DEBUG
 	{
@ -447,65 +420,57 @@ _bt_buildadd(Relation index, Size keysz, ScanKey scankey,
 		Datum		d = index_getattr(&(bti->bti_itup), 1, index->rd_att, &isnull);

 		printf("_bt_buildadd: inserted <%x> at offset %d at level %d\n",
-			   d, off, state->btps_level);
+			   d, last_off, state->btps_level);
 	}
 #endif
-	if (last_bti == (BTItem) NULL)
-		first_off = P_FIRSTKEY;
-	else if (!_bt_itemcmp(index, keysz, scankey,
-						  bti, last_bti, BTEqualStrategyNumber))
-		first_off = off;
-	last_off = off;
-	last_bti = (BTItem) PageGetItem(npage, PageGetItemId(npage, off));

 	state->btps_buf = nbuf;
 	state->btps_page = npage;
-	state->btps_lastbti = last_bti;
 	state->btps_lastoff = last_off;
-	state->btps_firstoff = first_off;
-
-	return last_bti;
 }

+/*
+ * Finish writing out the completed btree.
+ */
 static void
-_bt_uppershutdown(Relation index, Size keysz, ScanKey scankey,
-				  BTPageState *state)
+_bt_uppershutdown(Relation index, BTPageState *state)
 {
 	BTPageState *s;
 	BlockNumber blkno;
 	BTPageOpaque opaque;
 	BTItem		bti;

+	/*
+	 * Each iteration of this loop completes one more level of the tree.
+	 */
 	for (s = state; s != (BTPageState *) NULL; s = s->btps_next)
 	{
 		blkno = BufferGetBlockNumber(s->btps_buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(s->btps_page);

 		/*
-		 * if this is the root, attach it to the metapage.	otherwise,
-		 * stick the minimum key of the last page on this level (which has
-		 * not been split, or else it wouldn't be the last page) into its
-		 * parent.	this may cause the last page of upper levels to split,
-		 * but that's not a problem -- we haven't gotten to them yet.
+		 * We have to link the last page on this level to somewhere.
+		 *
+		 * If we're at the top, it's the root, so attach it to the metapage.
+		 * Otherwise, add an entry for it to its parent using its minimum
+		 * key.  This may cause the last page of the parent level to split,
+		 * but that's not a problem -- we haven't gotten to it yet.
 		 */
-		if (s->btps_doupper)
+		if (s->btps_next == (BTPageState *) NULL)
 		{
-			if (s->btps_next == (BTPageState *) NULL)
-			{
-				opaque->btpo_flags |= BTP_ROOT;
-				_bt_metaproot(index, blkno, s->btps_level + 1);
-			}
-			else
-			{
-				bti = _bt_minitem(s->btps_page, blkno, 0);
-				_bt_buildadd(index, keysz, scankey, s->btps_next, bti, 0);
-				pfree((void *) bti);
-			}
+			opaque->btpo_flags |= BTP_ROOT;
+			_bt_metaproot(index, blkno, s->btps_level + 1);
+		}
+		else
+		{
+			bti = _bt_minitem(s->btps_page, blkno, 0);
+			_bt_buildadd(index, s->btps_next, bti, 0);
+			pfree((void *) bti);
 		}

 		/*
-		 * this is the rightmost page, so the ItemId array needs to be
-		 * slid back one slot.
+		 * This is the rightmost page, so the ItemId array needs to be
+		 * slid back one slot.  Then we can dump out the page.
 		 */
 		_bt_slideleft(index, s->btps_buf, s->btps_page);
 		_bt_wrtbuf(index, s->btps_buf);
@ -519,32 +484,27 @@ _bt_uppershutdown(Relation index, Size keysz, ScanKey scankey,
 static void
 _bt_load(Relation index, BTSpool *btspool)
 {
-	BTPageState *state;
-	ScanKey		skey;
-	int			natts;
-	BTItem		bti;
-	bool		should_free;
-
-	/*
-	 * initialize state needed for the merge into the btree leaf pages.
-	 */
-	state = _bt_pagestate(index, BTP_LEAF, 0, true);
-
-	skey = _bt_mkscankey_nodata(index);
-	natts = RelationGetNumberOfAttributes(index);
+	BTPageState *state = NULL;

 	for (;;)
 	{
+		BTItem		bti;
+		bool		should_free;
+
 		bti = (BTItem) tuplesort_getindextuple(btspool->sortstate, true,
 											   &should_free);
 		if (bti == (BTItem) NULL)
 			break;
-		_bt_buildadd(index, natts, skey, state, bti, BTP_LEAF);
+
+		/* When we see first tuple, create first index page */
+		if (state == NULL)
+			state = _bt_pagestate(index, BTP_LEAF, 0);
+
+		_bt_buildadd(index, state, bti, BTP_LEAF);
 		if (should_free)
 			pfree((void *) bti);
 	}

-	_bt_uppershutdown(index, natts, skey, state);
-
-	_bt_freeskey(skey);
+	if (state != NULL)
+		_bt_uppershutdown(index, state);
 }
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtutils.c,v 1.37 2000/05/30 04:24:33 tgl Exp $
+ *	  $Header: /cvsroot/pgsql/src/backend/access/nbtree/nbtutils.c,v 1.38 2000/07/21 06:42:33 tgl Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -20,16 +20,13 @@
 #include "access/nbtree.h"
 #include "executor/execdebug.h"

-extern int	NIndexTupleProcessed;
-

 /*
 * _bt_mkscankey
 *		Build a scan key that contains comparison data from itup
 *		as well as comparator routines appropriate to the key datatypes.
 *
- *		The result is intended for use with _bt_skeycmp() or _bt_compare(),
- *		although it could be used with _bt_itemcmp() or _bt_tuplecompare().
+ *		The result is intended for use with _bt_compare().
 */
 ScanKey
 _bt_mkscankey(Relation rel, IndexTuple itup)
@ -68,8 +65,9 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 *		Build a scan key that contains comparator routines appropriate to
 *		the key datatypes, but no comparison data.
 *
- *		The result can be used with _bt_itemcmp() or _bt_tuplecompare(),
- *		but not with _bt_skeycmp() or _bt_compare().
+ *		The result cannot be used with _bt_compare().  Currently this
+ *		routine is only called by utils/sort/tuplesort.c, which has its
+ *		own comparison routine.
 */
 ScanKey
 _bt_mkscankey_nodata(Relation rel)
@ -114,7 +112,6 @@ _bt_freestack(BTStack stack)
 	{
 		ostack = stack;
 		stack = stack->bts_parent;
-		pfree(ostack->bts_btitem);
 		pfree(ostack);
 	}
 }
@ -331,55 +328,16 @@ _bt_formitem(IndexTuple itup)
 	Size		tuplen;
 	extern Oid	newoid();

-	/*
-	 * see comments in btbuild
-	 *
-	 * if (itup->t_info & INDEX_NULL_MASK) elog(ERROR, "btree indices cannot
-	 * include null keys");
-	 */
-
 	/* make a copy of the index tuple with room for the sequence number */
 	tuplen = IndexTupleSize(itup);
 	nbytes_btitem = tuplen + (sizeof(BTItemData) - sizeof(IndexTupleData));

 	btitem = (BTItem) palloc(nbytes_btitem);
-	memmove((char *) &(btitem->bti_itup), (char *) itup, tuplen);
+	memcpy((char *) &(btitem->bti_itup), (char *) itup, tuplen);

 	return btitem;
 }

-#ifdef NOT_USED
-bool
-_bt_checkqual(IndexScanDesc scan, IndexTuple itup)
-{
-	BTScanOpaque so;
-
-	so = (BTScanOpaque) scan->opaque;
-	if (so->numberOfKeys > 0)
-		return (index_keytest(itup, RelationGetDescr(scan->relation),
-							  so->numberOfKeys, so->keyData));
-	else
-		return true;
-}
-
-#endif
-
-#ifdef NOT_USED
-bool
-_bt_checkforkeys(IndexScanDesc scan, IndexTuple itup, Size keysz)
-{
-	BTScanOpaque so;
-
-	so = (BTScanOpaque) scan->opaque;
-	if (keysz > 0 && so->numberOfKeys >= keysz)
-		return (index_keytest(itup, RelationGetDescr(scan->relation),
-							  keysz, so->keyData));
-	else
-		return true;
-}
-
-#endif
-
 bool
 _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, Size *keysok)
 {
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@ -8,7 +8,7 @@
 *
 *
 * IDENTIFICATION
- *	  $Header: /cvsroot/pgsql/src/backend/storage/page/bufpage.c,v 1.30 2000/07/03 02:54:16 vadim Exp $
+ *	  $Header: /cvsroot/pgsql/src/backend/storage/page/bufpage.c,v 1.31 2000/07/21 06:42:33 tgl Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -19,10 +19,10 @@

 #include "storage/bufpage.h"

+
 static void PageIndexTupleDeleteAdjustLinePointers(PageHeader phdr,
 									   char *location, Size size);

-static bool PageManagerShuffle = true;	/* default is shuffle mode */

 /* ----------------------------------------------------------------
 *						Page support functions
@ -53,21 +53,17 @@ PageInit(Page page, Size pageSize, Size specialSize)
 /* ----------------
 *		PageAddItem
 *
- *		add an item to a page.
- *
- *   !!! ELOG(ERROR) IS DISALLOWED HERE !!!
+ *		Add an item to a page.  Return value is offset at which it was
+ *		inserted, or InvalidOffsetNumber if there's not room to insert.
 *
- *	 Notes on interface:
- *		If offsetNumber is valid, shuffle ItemId's down to make room
- *		to use it, if PageManagerShuffle is true.  If PageManagerShuffle is
- *		false, then overwrite the specified ItemId.  (PageManagerShuffle is
- *		true by default, and is modified by calling PageManagerModeSet.)
+ *		If offsetNumber is valid and <= current max offset in the page,
+ *		insert item into the array at that position by shuffling ItemId's
+ *		down to make room.
 *		If offsetNumber is not valid, then assign one by finding the first
 *		one that is both unused and deallocated.
 *
- *	 NOTE: If offsetNumber is valid, and PageManagerShuffle is true, it
- *		is assumed that there is room on the page to shuffle the ItemId's
- *		down by one.
+ *   !!! ELOG(ERROR) IS DISALLOWED HERE !!!
+ *
 * ----------------
 */
 OffsetNumber
@ -82,11 +78,8 @@ PageAddItem(Page page,
 	Offset		lower;
 	Offset		upper;
 	ItemId		itemId;
-	ItemId		fromitemId,
-				toitemId;
 	OffsetNumber limit;
-
-	bool		shuffled = false;
+	bool		needshuffle = false;

 	/*
 	 * Find first unallocated offsetNumber
@ -96,31 +89,12 @@ PageAddItem(Page page,
 	/* was offsetNumber passed in? */
 	if (OffsetNumberIsValid(offsetNumber))
 	{
-		if (PageManagerShuffle == true)
-		{
-			/* shuffle ItemId's (Do the PageManager Shuffle...) */
-			for (i = (limit - 1); i >= offsetNumber; i--)
-			{
-				fromitemId = &((PageHeader) page)->pd_linp[i - 1];
-				toitemId = &((PageHeader) page)->pd_linp[i];
-				*toitemId = *fromitemId;
-			}
-			shuffled = true;	/* need to increase "lower" */
-		}
-		else
-		{						/* overwrite mode */
-			itemId = &((PageHeader) page)->pd_linp[offsetNumber - 1];
-			if (((*itemId).lp_flags & LP_USED) ||
-				((*itemId).lp_len != 0))
-			{
-				elog(NOTICE, "PageAddItem: tried overwrite of used ItemId");
-				return InvalidOffsetNumber;
-			}
-		}
+		needshuffle = true;		/* need to increase "lower" */
+		/* don't actually do the shuffle till we've checked free space! */
 	}
 	else
-	{							/* offsetNumber was not passed in, so find
-								 * one */
+	{
+		/* offsetNumber was not passed in, so find one */
 		/* look for "recyclable" (unused & deallocated) ItemId */
 		for (offsetNumber = 1; offsetNumber < limit; offsetNumber++)
 		{
@ -130,9 +104,13 @@ PageAddItem(Page page,
 				break;
 		}
 	}
+
+	/*
+	 * Compute new lower and upper pointers for page, see if it'll fit
+	 */
 	if (offsetNumber > limit)
 		lower = (Offset) (((char *) (&((PageHeader) page)->pd_linp[offsetNumber])) - ((char *) page));
-	else if (offsetNumber == limit || shuffled == true)
+	else if (offsetNumber == limit || needshuffle)
 		lower = ((PageHeader) page)->pd_lower + sizeof(ItemIdData);
 	else
 		lower = ((PageHeader) page)->pd_lower;
@ -144,6 +122,23 @@ PageAddItem(Page page,
 	if (lower > upper)
 		return InvalidOffsetNumber;

+	/*
+	 * OK to insert the item.  First, shuffle the existing pointers if needed.
+	 */
+	if (needshuffle)
+	{
+		/* shuffle ItemId's (Do the PageManager Shuffle...) */
+		for (i = (limit - 1); i >= offsetNumber; i--)
+		{
+			ItemId		fromitemId,
+						toitemId;
+
+			fromitemId = &((PageHeader) page)->pd_linp[i - 1];
+			toitemId = &((PageHeader) page)->pd_linp[i];
+			*toitemId = *fromitemId;
+		}
+	}
+
 	itemId = &((PageHeader) page)->pd_linp[offsetNumber - 1];
 	(*itemId).lp_off = upper;
 	(*itemId).lp_len = size;
@ -168,9 +163,7 @@ PageGetTempPage(Page page, Size specialSize)
 	PageHeader	thdr;

 	pageSize = PageGetPageSize(page);
-
-	if ((temp = (Page) palloc(pageSize)) == (Page) NULL)
-		elog(FATAL, "Cannot allocate %d bytes for temp page.", pageSize);
+	temp = (Page) palloc(pageSize);
 	thdr = (PageHeader) temp;

 	/* copy old page in */
@ -327,23 +320,6 @@ PageGetFreeSpace(Page page)
 	return space;
 }

-/*
- * PageManagerModeSet
- *
- *	 Sets mode to either: ShufflePageManagerMode (the default) or
- *	 OverwritePageManagerMode.	For use by access methods code
- *	 for determining semantics of PageAddItem when the offsetNumber
- *	 argument is passed in.
- */
-void
-PageManagerModeSet(PageManagerMode mode)
-{
-	if (mode == ShufflePageManagerMode)
-		PageManagerShuffle = true;
-	else if (mode == OverwritePageManagerMode)
-		PageManagerShuffle = false;
-}
-
 /*
 *----------------------------------------------------------------
 * PageIndexTupleDelete
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@ -7,7 +7,7 @@
 * Portions Copyright (c) 1996-2000, PostgreSQL, Inc
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $Id: nbtree.h,v 1.38 2000/06/15 03:32:31 momjian Exp $
+ * $Id: nbtree.h,v 1.39 2000/07/21 06:42:35 tgl Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -24,14 +24,9 @@
 *	info.  In addition, we need to know what sort of page this is
 *	(leaf or internal), and whether the page is available for reuse.
 *
- *	Lehman and Yao's algorithm requires a ``high key'' on every page.
- *	The high key on a page is guaranteed to be greater than or equal
- *	to any key that appears on this page.  Our insertion algorithm
- *	guarantees that we can use the initial least key on our right
- *	sibling as the high key.  We allocate space for the line pointer
- *	to the high key in the opaque data at the end of the page.
- *
- *	Rightmost pages in the tree have no high key.
+ *	We also store a back-link to the parent page, but this cannot be trusted
+ *	very far since it does not get updated when the parent is split.
+ *	See backend/access/nbtree/README for details.
 */

 typedef struct BTPageOpaqueData
@ -41,11 +36,11 @@ typedef struct BTPageOpaqueData
 	BlockNumber btpo_parent;
 	uint16		btpo_flags;

-#define BTP_LEAF		(1 << 0)
-#define BTP_ROOT		(1 << 1)
-#define BTP_FREE		(1 << 2)
-#define BTP_META		(1 << 3)
-#define BTP_CHAIN		(1 << 4)
+/* Bits defined in btpo_flags */
+#define BTP_LEAF		(1 << 0)	/* It's a leaf page */
+#define BTP_ROOT		(1 << 1)	/* It's the root page (has no parent) */
+#define BTP_FREE		(1 << 2)	/* not currently used... */
+#define BTP_META		(1 << 3)	/* Set in the meta-page only */

 } BTPageOpaqueData;

@ -84,21 +79,24 @@ typedef struct BTScanOpaqueData
 typedef BTScanOpaqueData *BTScanOpaque;

 /*
- *	BTItems are what we store in the btree.  Each item has an index
- *	tuple, including key and pointer values.  In addition, we must
- *	guarantee that all tuples in the index are unique, in order to
- *	satisfy some assumptions in Lehman and Yao.  The way that we do
- *	this is by generating a new OID for every insertion that we do in
- *	the tree.  This adds eight bytes to the size of btree index
- *	tuples.  Note that we do not use the OID as part of a composite
- *	key; the OID only serves as a unique identifier for a given index
- *	tuple (logical position within a page).
+ *	BTItems are what we store in the btree.  Each item is an index tuple,
+ *	including key and pointer values.  (In some cases either the key or the
+ *	pointer may go unused, see backend/access/nbtree/README for details.)
+ *
+ *	Old comments:
+ *	In addition, we must guarantee that all tuples in the index are unique,
+ *	in order to satisfy some assumptions in Lehman and Yao.  The way that we
+ *	do this is by generating a new OID for every insertion that we do in the
+ *	tree.  This adds eight bytes to the size of btree index tuples.  Note
+ *	that we do not use the OID as part of a composite key; the OID only
+ *	serves as a unique identifier for a given index tuple (logical position
+ *	within a page).
 *
 *	New comments:
 *	actually, we must guarantee that all tuples in A LEVEL
 *	are unique, not in ALL INDEX. So, we can use bti_itup->t_tid
 *	as unique identifier for a given index tuple (logical position
- *	within a level).	- vadim 04/09/97
+ *	within a level). - vadim 04/09/97
 */

 typedef struct BTItemData
@ -108,12 +106,13 @@ typedef struct BTItemData

 typedef BTItemData *BTItem;

-#define BTItemSame(i1, i2)	  ( i1->bti_itup.t_tid.ip_blkid.bi_hi == \
-								i2->bti_itup.t_tid.ip_blkid.bi_hi && \
-								i1->bti_itup.t_tid.ip_blkid.bi_lo == \
-								i2->bti_itup.t_tid.ip_blkid.bi_lo && \
-								i1->bti_itup.t_tid.ip_posid == \
-								i2->bti_itup.t_tid.ip_posid )
+/* Test whether items are the "same" per the above notes */
+#define BTItemSame(i1, i2)	  ( (i1)->bti_itup.t_tid.ip_blkid.bi_hi == \
+								(i2)->bti_itup.t_tid.ip_blkid.bi_hi && \
+								(i1)->bti_itup.t_tid.ip_blkid.bi_lo == \
+								(i2)->bti_itup.t_tid.ip_blkid.bi_lo && \
+								(i1)->bti_itup.t_tid.ip_posid == \
+								(i2)->bti_itup.t_tid.ip_posid )

 /*
 *	BTStackData -- As we descend a tree, we push the (key, pointer)
@ -129,24 +128,12 @@ typedef struct BTStackData
 {
 	BlockNumber bts_blkno;
 	OffsetNumber bts_offset;
-	BTItem		bts_btitem;
+	BTItemData	bts_btitem;
 	struct BTStackData *bts_parent;
 } BTStackData;

 typedef BTStackData *BTStack;

-typedef struct BTPageState
-{
-	Buffer		btps_buf;
-	Page		btps_page;
-	BTItem		btps_lastbti;
-	OffsetNumber btps_lastoff;
-	OffsetNumber btps_firstoff;
-	int			btps_level;
-	bool		btps_doupper;
-	struct BTPageState *btps_next;
-} BTPageState;
-
 /*
 *	We need to be able to tell the difference between read and write
 *	requests for pages, in order to do locking correctly.
@ -155,31 +142,49 @@ typedef struct BTPageState
 #define BT_READ			BUFFER_LOCK_SHARE
 #define BT_WRITE		BUFFER_LOCK_EXCLUSIVE

-/*
- *	Similarly, the difference between insertion and non-insertion binary
- *	searches on a given page makes a difference when we're descending the
- *	tree.
- */
-
-#define BT_INSERTION	0
-#define BT_DESCENT		1
-
 /*
 *	In general, the btree code tries to localize its knowledge about
 *	page layout to a couple of routines.  However, we need a special
 *	value to indicate "no page number" in those places where we expect
- *	page numbers.
+ *	page numbers.  We can use zero for this because we never need to
+ *	make a pointer to the metadata page.
 */

 #define P_NONE			0
+
+/*
+ * Macros to test whether a page is leftmost or rightmost on its tree level,
+ * as well as other state info kept in the opaque data.
+ */
 #define P_LEFTMOST(opaque)		((opaque)->btpo_prev == P_NONE)
 #define P_RIGHTMOST(opaque)		((opaque)->btpo_next == P_NONE)
+#define P_ISLEAF(opaque)		((opaque)->btpo_flags & BTP_LEAF)
+#define P_ISROOT(opaque)		((opaque)->btpo_flags & BTP_ROOT)
+
+/*
+ *	Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
+ *	page.  The high key is not a data key, but gives info about what range of
+ *	keys is supposed to be on this page.  The high key on a page is required
+ *	to be greater than or equal to any data key that appears on the page.
+ *	If we find ourselves trying to insert a key > high key, we know we need
+ *	to move right (this should only happen if the page was split since we
+ *	examined the parent page).
+ *
+ *	Our insertion algorithm guarantees that we can use the initial least key
+ *	on our right sibling as the high key.  Once a page is created, its high
+ *	key changes only if the page is split.
+ *
+ *	On a non-rightmost page, the high key lives in item 1 and data items
+ *	start in item 2.  Rightmost pages have no high key, so we store data
+ *	items beginning in item 1.
+ */

 #define P_HIKEY			((OffsetNumber) 1)
 #define P_FIRSTKEY		((OffsetNumber) 2)
+#define P_FIRSTDATAKEY(opaque)  (P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY)

 /*
- *	Strategy numbers -- ordering of these is <, <=, =, >=, >
+ *	Operator strategy numbers -- ordering of these is <, <=, =, >=, >
 */

 #define BTLessStrategyNumber			1
@ -199,13 +204,27 @@ typedef struct BTPageState

 #define BTORDER_PROC	1

+/*
+ * prototypes for functions in nbtree.c (external entry points for btree)
+ */
+extern bool BuildingBtree;		/* in nbtree.c */
+
+extern Datum btbuild(PG_FUNCTION_ARGS);
+extern Datum btinsert(PG_FUNCTION_ARGS);
+extern Datum btgettuple(PG_FUNCTION_ARGS);
+extern Datum btbeginscan(PG_FUNCTION_ARGS);
+extern Datum btrescan(PG_FUNCTION_ARGS);
+extern void btmovescan(IndexScanDesc scan, Datum v);
+extern Datum btendscan(PG_FUNCTION_ARGS);
+extern Datum btmarkpos(PG_FUNCTION_ARGS);
+extern Datum btrestrpos(PG_FUNCTION_ARGS);
+extern Datum btdelete(PG_FUNCTION_ARGS);
+
 /*
 * prototypes for functions in nbtinsert.c
 */
 extern InsertIndexResult _bt_doinsert(Relation rel, BTItem btitem,
 			 bool index_is_unique, Relation heapRel);
-extern bool _bt_itemcmp(Relation rel, Size keysz, ScanKey scankey,
-			BTItem item1, BTItem item2, StrategyNumber strat);

 /*
 * prototypes for functions in nbtpage.c
@ -218,25 +237,8 @@ extern void _bt_wrtbuf(Relation rel, Buffer buf);
 extern void _bt_wrtnorelbuf(Relation rel, Buffer buf);
 extern void _bt_pageinit(Page page, Size size);
 extern void _bt_metaproot(Relation rel, BlockNumber rootbknum, int level);
-extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
 extern void _bt_pagedel(Relation rel, ItemPointer tid);

-/*
- * prototypes for functions in nbtree.c
- */
-extern bool BuildingBtree;		/* in nbtree.c */
-
-extern Datum btbuild(PG_FUNCTION_ARGS);
-extern Datum btinsert(PG_FUNCTION_ARGS);
-extern Datum btgettuple(PG_FUNCTION_ARGS);
-extern Datum btbeginscan(PG_FUNCTION_ARGS);
-extern Datum btrescan(PG_FUNCTION_ARGS);
-extern void btmovescan(IndexScanDesc scan, Datum v);
-extern Datum btendscan(PG_FUNCTION_ARGS);
-extern Datum btmarkpos(PG_FUNCTION_ARGS);
-extern Datum btrestrpos(PG_FUNCTION_ARGS);
-extern Datum btdelete(PG_FUNCTION_ARGS);
-
 /*
 * prototypes for functions in nbtscan.c
 */
@ -249,13 +251,13 @@ extern void AtEOXact_nbtree(void);
 * prototypes for functions in nbtsearch.c
 */
 extern BTStack _bt_search(Relation rel, int keysz, ScanKey scankey,
-		   Buffer *bufP);
+						  Buffer *bufP, int access);
 extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
 			  ScanKey scankey, int access);
-extern bool _bt_skeycmp(Relation rel, Size keysz, ScanKey scankey,
-			Page page, ItemId itemid, StrategyNumber strat);
 extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, int srchtype);
+								ScanKey scankey);
+extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
+						 Page page, OffsetNumber offnum);
 extern RetrieveIndexResult _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern RetrieveIndexResult _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@ -7,7 +7,7 @@
 * Portions Copyright (c) 1996-2000, PostgreSQL, Inc
 * Portions Copyright (c) 1994, Regents of the University of California
 *
- * $Id: bufpage.h,v 1.30 2000/07/03 02:54:21 vadim Exp $
+ * $Id: bufpage.h,v 1.31 2000/07/21 06:42:39 tgl Exp $
 *
 *-------------------------------------------------------------------------
 */
@ -309,7 +309,6 @@ extern Page PageGetTempPage(Page page, Size specialSize);
 extern void PageRestoreTempPage(Page tempPage, Page oldPage);
 extern void PageRepairFragmentation(Page page);
 extern Size PageGetFreeSpace(Page page);
-extern void PageManagerModeSet(PageManagerMode mode);
 extern void PageIndexTupleDelete(Page page, OffsetNumber offset);