Make handling of redundant nbtree keys more robust.

nbtree preprocessing's handling of redundant (and contradictory) keys created problems for scans with = arrays. It was just about possible for a scan with an = array key and one or more redundant keys (keys that preprocessing could not eliminate due an incomplete opfamily and a cross-type key) to get stuck. Testing has shown that infinite cycling where the scan never manages to make forward progress was possible. This could happen when the scan's arrays were reset in _bt_readpage's forcenonrequired=true path (added by bugfix commit 5f4d98d4) when the arrays weren't at least advanced up to the same point that they were in at the start of the _bt_readpage call. Earlier redundant keys prevented the finaltup call to _bt_advance_array_keys from reaching lower-order keys that needed to be used to sufficiently advance the scan's arrays. To fix, make preprocessing leave the scan's keys in a state that is as close as possible to how it'll usually leave them (in the common case where there's no redundant keys that preprocessing failed to eliminate). Now nbtree preprocessing _reliably_ leaves behind at most one required >/>= key per index column, and at most one required </<= key per index column. Columns that have one or more = keys that are eligible to be marked required (based on the traditional rules) prioritize the = keys over redundant inequality keys; they'll _reliably_ be left with only one of the = keys as the index column's only required key. Keys that are not marked required (whether due to the new preprocessing step running or for some other reason) are relocated to the end of the so->keyData[] array as needed. That way they'll always be evaluated after the scan's required keys, and so cannot prevent code in places like _bt_advance_array_keys and _bt_first from reaching a required key. Also teach _bt_first to decide which initial positioning keys to use based on the same requiredness markings that have long been used by _bt_checkkeys/_bt_advance_array_keys. This is a necessary condition for reliably avoiding infinite cycling. _bt_advance_array_keys expects to be able to reason about what'll happen in the next _bt_first call should it start another primitive index scan, by evaluating inequality keys that were marked required in the opposite-to-scan scan direction only. Now everybody (_bt_first, _bt_checkkeys, and _bt_advance_array_keys) will always agree on which exact key will be used on each index column to start and/or end the scan (except when row compare keys are involved, which have similar problems not addressed by this commit). An upcoming commit will finish off the work started by this commit by harmonizing how _bt_first, _bt_checkkeys, and _bt_advance_array_keys apply row compare keys to start and end scans. This fixes what was arguably an oversight in either commit 5f4d98d4 or commit 8a510275. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi> Discussion: https://postgr.es/m/CAH2-Wz=ds4M+3NXMgwxYxqU8MULaLf696_v5g=9WNmWL2=Uo2A@mail.gmail.com Backpatch-through: 18
2 months ago · 7c365eb504
parent 87f0d3cd8d
commit 7c365eb504
3 changed files with 455 additions and 269 deletions
--- a/src/backend/access/nbtree/nbtpreprocesskeys.c
+++ b/src/backend/access/nbtree/nbtpreprocesskeys.c
@ -16,6 +16,7 @@
 #include "postgres.h"
 #include "access/nbtree.h"
 #include "common/int.h"
 #include "lib/qunique.h"
 #include "utils/array.h"
 #include "utils/lsyscache.h"
@ -56,6 +57,8 @@ static void _bt_skiparray_strat_decrement(IndexScanDesc scan, ScanKey arraysk,
 										  BTArrayKeyInfo *array);
 static void _bt_skiparray_strat_increment(IndexScanDesc scan, ScanKey arraysk,
 										  BTArrayKeyInfo *array);
 static void _bt_unmark_keys(IndexScanDesc scan, int *keyDataMap);
 static int	_bt_reorder_array_cmp(const void *a, const void *b);
 static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan, int *new_numberOfKeys);
 static void _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap);
 static int	_bt_num_array_keys(IndexScanDesc scan, Oid *skip_eq_ops_out,
@ -96,7 +99,7 @@ static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
 * incomplete sets of cross-type operators, we may fail to detect redundant
 * or contradictory keys, but we can survive that.)
 *
- * The output keys must be sorted by index attribute.  Presently we expect
+ * Required output keys are sorted by index attribute.  Presently we expect
 * (but verify) that the input keys are already so sorted --- this is done
 * by match_clauses_to_index() in indxpath.c.  Some reordering of the keys
 * within each attribute may be done as a byproduct of the processing here.
@ -127,29 +130,36 @@ static int	_bt_compare_array_elements(const void *a, const void *b, void *arg);
 * This has the potential to be much more efficient than a full index scan
 * (though it behaves like a full scan when there's many distinct "x" values).
 *
- * If possible, redundant keys are eliminated: we keep only the tightest
+ * Typically, redundant keys are eliminated: we keep only the tightest
 * >/>= bound and the tightest </<= bound, and if there's an = key then
 * that's the only one returned.  (So, we return either a single = key,
 * or one or two boundary-condition keys for each attr.)  However, if we
 * cannot compare two keys for lack of a suitable cross-type operator,
- * we cannot eliminate either.  If there are two such keys of the same
+ * we cannot eliminate either key.
- * operator strategy, the second one is just pushed into the output array
+ *
- * without further processing here.  We may also emit both >/>= or both
+ * When all redundant keys could not be eliminated, we'll output a key array
- * </<= keys if we can't compare them.  The logic about required keys still
+ * that can more or less be treated as if it had no redundant keys.  Suppose
- * works if we don't eliminate redundant keys.
+ * we have "x > 4::int AND x > 10::bigint AND x < 70", and we are unable to
- *
+ * determine which > key is more restrictive for lack of a suitable cross-type
- * Note that one reason we need direction-sensitive required-key flags is
+ * operator.  We'll arbitrarily pick one of the > keys; the other > key won't
- * precisely that we may not be able to eliminate redundant keys.  Suppose
+ * be marked required.  Obviously, the scan will be less efficient if we
- * we have "x > 4::int AND x > 10::bigint", and we are unable to determine
+ * choose x > 4 over x > 10 -- but it can still largely proceed as if there
- * which key is more restrictive for lack of a suitable cross-type operator.
+ * was only a single > condition.  "x > 10" will be placed at the end of the
- * _bt_first will arbitrarily pick one of the keys to do the initial
+ * so->keyData[] output array.  It'll always be evaluated last, after the keys
- * positioning with.  If it picks x > 4, then the x > 10 condition will fail
+ * that could be marked required in the usual way (after "x > 4 AND x < 70").
- * until we reach index entries > 10; but we can't stop the scan just because
+ * This can sometimes result in so->keyData[] keys that aren't even in index
- * x > 10 is failing.  On the other hand, if we are scanning backwards, then
+ * attribute order (if the qual involves multiple attributes).  The scan's
- * failure of either key is indeed enough to stop the scan.  (In general, when
+ * required keys will still be in attribute order, though, so it can't matter.
- * inequality keys are present, the initial-positioning code only promises to
+ *
- * position before the first possible match, not exactly at the first match,
+ * This scheme ensures that _bt_first always uses the same set of keys at the
- * for a forward scan; or after the last match for a backward scan.)
+ * start of a forwards scan as those _bt_checkkeys uses to determine when to
 * end a similar backwards scan (and vice-versa).  _bt_advance_array_keys
 * depends on this: it expects to be able to reliably predict what the next
 * _bt_first call will do by testing whether _bt_checkkeys' routines report
 * that the final tuple on the page is past the end of matches for the scan's
 * keys with the scan direction flipped.  If it is (if continuescan=false),
 * then it follows that calling _bt_first will, at a minimum, relocate the
 * scan to the very next leaf page (in the current scan direction).
 *
 * As a byproduct of this work, we can detect contradictory quals such
 * as "x = 1 AND x > 2".  If we see that, we return so->qual_ok = false,
@ -188,7 +198,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	int			numberOfEqualCols;
 	ScanKey		inkeys;
 	BTScanKeyPreproc xform[BTMaxStrategyNumber];
-	bool		test_result;
+	bool		test_result,
 				redundant_key_kept = false;
 	AttrNumber	attno;
 	ScanKey		arrayKeyData;
 	int		   *keyDataMap = NULL;
@ -388,7 +399,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 						xform[j].inkey = NULL;
 						xform[j].inkeyi = -1;
 					}
-					/* else, cannot determine redundancy, keep both keys */
+					else
 						redundant_key_kept = true;
 				}
 				/* track number of attrs for which we have "=" keys */
 				numberOfEqualCols++;
@ -409,6 +421,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 					else
 						xform[BTLessStrategyNumber - 1].inkey = NULL;
 				}
 				else
 					redundant_key_kept = true;
 			}
 			/* try to keep only one of >, >= */
@ -426,6 +440,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 					else
 						xform[BTGreaterStrategyNumber - 1].inkey = NULL;
 				}
 				else
 					redundant_key_kept = true;
 			}
 			/*
@ -466,25 +482,6 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		/* check strategy this key's operator corresponds to */
 		j = inkey->sk_strategy - 1;
 		/* if row comparison, push it directly to the output array */
 		if (inkey->sk_flags & SK_ROW_HEADER)
 		{
 			ScanKey		outkey = &so->keyData[new_numberOfKeys++];
 			memcpy(outkey, inkey, sizeof(ScanKeyData));
 			if (arrayKeyData)
 				keyDataMap[new_numberOfKeys - 1] = i;
 			if (numberOfEqualCols == attno - 1)
 				_bt_mark_scankey_required(outkey);
 			/*
 			 * We don't support RowCompare using equality; such a qual would
 			 * mess up the numberOfEqualCols tracking.
 			 */
 			Assert(j != (BTEqualStrategyNumber - 1));
 			continue;
 		}
 		if (inkey->sk_strategy == BTEqualStrategyNumber &&
 			(inkey->sk_flags & SK_SEARCHARRAY))
 		{
@ -593,9 +590,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 				 * the new scan key.
 				 *
 				 * Note: We do things this way around so that our arrays are
-				 * always in the same order as their corresponding scan keys,
+				 * always in the same order as their corresponding scan keys.
-				 * even with incomplete opfamilies.  _bt_advance_array_keys
+				 * _bt_preprocess_array_keys_final expects this.
 				 * depends on this.
 				 */
 				ScanKey		outkey = &so->keyData[new_numberOfKeys++];
@ -607,6 +603,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 				xform[j].inkey = inkey;
 				xform[j].inkeyi = i;
 				xform[j].arrayidx = arrayidx;
 				redundant_key_kept = true;
 			}
 		}
 	}
@ -622,6 +619,15 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	if (arrayKeyData)
 		_bt_preprocess_array_keys_final(scan, keyDataMap);
 	/*
 	 * If there are remaining redundant inequality keys, we must make sure
 	 * that each index attribute has no more than one required >/>= key, and
 	 * no more than one required </<= key.  Attributes that have one or more
 	 * required = keys now must keep only one required key (the first = key).
 	 */
 	if (unlikely(redundant_key_kept) && so->qual_ok)
 		_bt_unmark_keys(scan, keyDataMap);
 	/* Could pfree arrayKeyData/keyDataMap now, but not worth the cycles */
 }
@ -847,8 +853,7 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 				cmp_op;
 	StrategyNumber strat;
-	Assert(!((leftarg->sk_flags | rightarg->sk_flags) &
+	Assert(!((leftarg->sk_flags | rightarg->sk_flags) & SK_ROW_MEMBER));
 			 (SK_ROW_HEADER | SK_ROW_MEMBER)));
 	/*
 	 * First, deal with cases where one or both args are NULL.  This should
@ -924,6 +929,16 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 		return true;
 	}
 	/*
 	 * We don't yet know how to determine redundancy when it involves a row
 	 * compare key (barring simple cases involving IS NULL/IS NOT NULL)
 	 */
 	if ((leftarg->sk_flags | rightarg->sk_flags) & SK_ROW_HEADER)
 	{
 		Assert(!((leftarg->sk_flags | rightarg->sk_flags) & SK_BT_SKIP));
 		return false;
 	}
 	/*
 	 * If either leftarg or rightarg are equality-type array scankeys, we need
 	 * specialized handling (since by now we know that IS NULL wasn't used)
@ -1467,6 +1482,283 @@ _bt_skiparray_strat_increment(IndexScanDesc scan, ScanKey arraysk,
 	}
 }
 /*
 *	_bt_unmark_keys() -- make superfluous required keys nonrequired after all
 *
 * When _bt_preprocess_keys fails to eliminate one or more redundant keys, it
 * calls here to make sure that no index attribute has more than one > or >=
 * key marked required, and no more than one required < or <= key.  Attributes
 * with = keys will always get one = key as their required key.  All other
 * keys that were initially marked required get "unmarked" here.  That way,
 * _bt_first and _bt_checkkeys will reliably agree on which keys to use to
 * start and/or to end the scan.
 *
 * We also relocate keys that become/started out nonrequired to the end of
 * so->keyData[].  That way, _bt_first and _bt_checkkeys cannot fail to reach
 * a required key due to some earlier nonrequired key getting in the way.
 *
 * Only call here when _bt_compare_scankey_args returned false at least once
 * (otherwise, calling here will just waste cycles).
 */
 static void
 _bt_unmark_keys(IndexScanDesc scan, int *keyDataMap)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	AttrNumber	attno;
 	bool	   *unmarkikey;
 	int			nunmark,
 				nunmarked,
 				nkept,
 				firsti;
 	ScanKey		keepKeys,
 				unmarkKeys;
 	FmgrInfo   *keepOrderProcs = NULL,
 			   *unmarkOrderProcs = NULL;
 	bool		haveReqEquals,
 				haveReqForward,
 				haveReqBackward;
 	/*
 	 * Do an initial pass over so->keyData[] that determines which keys to
 	 * keep as required.  We expect so->keyData[] to still be in attribute
 	 * order when we're called (though we don't expect any particular order
 	 * among each attribute's keys).
 	 *
 	 * When both equality and inequality keys remain on a single attribute, we
 	 * *must* make sure that exactly one of the equalities remains required.
 	 * Any requiredness markings that we might leave on later keys/attributes
 	 * are predicated on there being required = keys on all prior columns.
 	 */
 	unmarkikey = palloc0(so->numberOfKeys * sizeof(bool));
 	nunmark = 0;
 	/* Set things up for first key's attribute */
 	attno = so->keyData[0].sk_attno;
 	firsti = 0;
 	haveReqEquals = false;
 	haveReqForward = false;
 	haveReqBackward = false;
 	for (int i = 0; i < so->numberOfKeys; i++)
 	{
 		ScanKey		origkey = &so->keyData[i];
 		if (origkey->sk_attno != attno)
 		{
 			/* Reset for next attribute */
 			attno = origkey->sk_attno;
 			firsti = i;
 			haveReqEquals = false;
 			haveReqForward = false;
 			haveReqBackward = false;
 		}
 		/* Equalities get priority over inequalities */
 		if (haveReqEquals)
 		{
 			/*
 			 * We already found the first "=" key for this attribute.  We've
 			 * already decided that all its other keys will be unmarked.
 			 */
 			Assert(!(origkey->sk_flags & SK_SEARCHNULL));
 			unmarkikey[i] = true;
 			nunmark++;
 			continue;
 		}
 		else if ((origkey->sk_flags & SK_BT_REQFWD) &&
 				 (origkey->sk_flags & SK_BT_REQBKWD))
 		{
 			/*
 			 * Found the first "=" key for attno.  All other attno keys will
 			 * be unmarked.
 			 */
 			Assert(origkey->sk_strategy == BTEqualStrategyNumber);
 			haveReqEquals = true;
 			for (int j = firsti; j < i; j++)
 			{
 				/* Unmark any prior inequality keys on attno after all */
 				if (!unmarkikey[j])
 				{
 					unmarkikey[j] = true;
 					nunmark++;
 				}
 			}
 			continue;
 		}
 		/* Deal with inequalities next */
 		if ((origkey->sk_flags & SK_BT_REQFWD) && !haveReqForward)
 		{
 			haveReqForward = true;
 			continue;
 		}
 		else if ((origkey->sk_flags & SK_BT_REQBKWD) && !haveReqBackward)
 		{
 			haveReqBackward = true;
 			continue;
 		}
 		/*
 		 * We have either a redundant inequality key that will be unmarked, or
 		 * we have a key that wasn't marked required in the first place
 		 */
 		unmarkikey[i] = true;
 		nunmark++;
 	}
 	/* Should only be called when _bt_compare_scankey_args reported failure */
 	Assert(nunmark > 0);
 	/*
 	 * Next, allocate temp arrays: one for required keys that'll remain
 	 * required, the other for all remaining keys
 	 */
 	unmarkKeys = palloc(nunmark * sizeof(ScanKeyData));
 	keepKeys = palloc((so->numberOfKeys - nunmark) * sizeof(ScanKeyData));
 	nunmarked = 0;
 	nkept = 0;
 	if (so->numArrayKeys)
 	{
 		unmarkOrderProcs = palloc(nunmark * sizeof(FmgrInfo));
 		keepOrderProcs = palloc((so->numberOfKeys - nunmark) * sizeof(FmgrInfo));
 	}
 	/*
 	 * Next, copy the contents of so->keyData[] into the appropriate temp
 	 * array.
 	 *
 	 * Scans with = array keys need us to maintain invariants around the order
 	 * of so->orderProcs[] and so->arrayKeys[] relative to so->keyData[].  See
 	 * _bt_preprocess_array_keys_final for a full explanation.
 	 */
 	for (int i = 0; i < so->numberOfKeys; i++)
 	{
 		ScanKey		origkey = &so->keyData[i];
 		ScanKey		unmark;
 		if (!unmarkikey[i])
 		{
 			/*
 			 * Key gets to keep its original requiredness markings.
 			 *
 			 * Key will stay in its original position, unless we're going to
 			 * unmark an earlier key (in which case this key gets moved back).
 			 */
 			memcpy(keepKeys + nkept, origkey, sizeof(ScanKeyData));
 			if (so->numArrayKeys)
 			{
 				keyDataMap[i] = nkept;
 				memcpy(keepOrderProcs + nkept, &so->orderProcs[i],
 					   sizeof(FmgrInfo));
 			}
 			nkept++;
 			continue;
 		}
 		/*
 		 * Key will be unmarked as needed, and moved to the end of the array,
 		 * next to other keys that will become (or always were) nonrequired
 		 */
 		unmark = unmarkKeys + nunmarked;
 		memcpy(unmark, origkey, sizeof(ScanKeyData));
 		if (so->numArrayKeys)
 		{
 			keyDataMap[i] = (so->numberOfKeys - nunmark) + nunmarked;
 			memcpy(&unmarkOrderProcs[nunmarked], &so->orderProcs[i],
 				   sizeof(FmgrInfo));
 		}
 		/*
 		 * Preprocessing only generates skip arrays when it knows that they'll
 		 * be the only required = key on the attr.  We'll never unmark them.
 		 */
 		Assert(!(unmark->sk_flags & SK_BT_SKIP));
 		/*
 		 * Also shouldn't have to unmark an IS NULL or an IS NOT NULL key.
 		 * They aren't cross-type, so an incomplete opfamily can't matter.
 		 */
 		Assert(!(unmark->sk_flags & SK_ISNULL) ||
 			   !(unmark->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)));
 		/* Clear requiredness flags on redundant key (and on any subkeys) */
 		unmark->sk_flags &= ~(SK_BT_REQFWD | SK_BT_REQBKWD);
 		if (unmark->sk_flags & SK_ROW_HEADER)
 		{
 			ScanKey		subkey = (ScanKey) DatumGetPointer(unmark->sk_argument);
 			Assert(subkey->sk_strategy == unmark->sk_strategy);
 			for (;;)
 			{
 				Assert(subkey->sk_flags & SK_ROW_MEMBER);
 				subkey->sk_flags &= ~(SK_BT_REQFWD | SK_BT_REQBKWD);
 				if (subkey->sk_flags & SK_ROW_END)
 					break;
 				subkey++;
 			}
 		}
 		nunmarked++;
 	}
 	/* Copy both temp arrays back into so->keyData[] to reorder */
 	Assert(nkept == so->numberOfKeys - nunmark);
 	Assert(nunmarked == nunmark);
 	memcpy(so->keyData, keepKeys, sizeof(ScanKeyData) * nkept);
 	memcpy(so->keyData + nkept, unmarkKeys, sizeof(ScanKeyData) * nunmarked);
 	/* Done with temp arrays */
 	pfree(unmarkikey);
 	pfree(keepKeys);
 	pfree(unmarkKeys);
 	/*
 	 * Now copy so->orderProcs[] temp entries needed by scans with = array
 	 * keys back (just like with the so->keyData[] temp arrays)
 	 */
 	if (so->numArrayKeys)
 	{
 		memcpy(so->orderProcs, keepOrderProcs, sizeof(FmgrInfo) * nkept);
 		memcpy(so->orderProcs + nkept, unmarkOrderProcs,
 			   sizeof(FmgrInfo) * nunmarked);
 		/* Also fix-up array->scan_key references */
 		for (int arridx = 0; arridx < so->numArrayKeys; arridx++)
 		{
 			BTArrayKeyInfo *array = &so->arrayKeys[arridx];
 			array->scan_key = keyDataMap[array->scan_key];
 		}
 		/*
 		 * Sort so->arrayKeys[] based on its new BTArrayKeyInfo.scan_key
 		 * offsets, so that its order matches so->keyData[] order as expected
 		 */
 		qsort(so->arrayKeys, so->numArrayKeys, sizeof(BTArrayKeyInfo),
 			  _bt_reorder_array_cmp);
 		/* Done with temp arrays */
 		pfree(unmarkOrderProcs);
 		pfree(keepOrderProcs);
 	}
 }
 /*
 * qsort comparator for reordering so->arrayKeys[] BTArrayKeyInfo entries
 */
 static int
 _bt_reorder_array_cmp(const void *a, const void *b)
 {
 	BTArrayKeyInfo *arraya = (BTArrayKeyInfo *) a;
 	BTArrayKeyInfo *arrayb = (BTArrayKeyInfo *) b;
 	return pg_cmp_s32(arraya->scan_key, arrayb->scan_key);
 }
 /*
 *	_bt_preprocess_array_keys() -- Preprocess SK_SEARCHARRAY scan keys
 *
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@ -960,46 +960,51 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/*----------
 	 * Examine the scan keys to discover where we need to start the scan.
 	 * The selected scan keys (at most one per index column) are remembered by
 	 * storing their addresses into the local startKeys[] array.  The final
 	 * startKeys[] entry's strategy is set in strat_total. (Actually, there
 	 * are a couple of cases where we force a less/more restrictive strategy.)
 	 *
-	 * We want to identify the keys that can be used as starting boundaries;
+	 * We must use the key that was marked required (in the direction opposite
-	 * these are =, >, or >= keys for a forward scan or =, <, <= keys for
+	 * our own scan's) during preprocessing.  Each index attribute can only
-	 * a backwards scan.  We can use keys for multiple attributes so long as
+	 * have one such required key.  In general, the keys that we use to find
-	 * the prior attributes had only =, >= (resp. =, <=) keys.  Once we accept
+	 * an initial position when scanning forwards are the same keys that end
-	 * a > or < boundary or find an attribute with no boundary (which can be
+	 * the scan on the leaf level when scanning backwards (and vice-versa).
 	 * thought of as the same as "> -infinity"), we can't use keys for any
 	 * attributes to its right, because it would break our simplistic notion
 	 * of what initial positioning strategy to use.
 	 *
 	 * When the scan keys include cross-type operators, _bt_preprocess_keys
-	 * may not be able to eliminate redundant keys; in such cases we will
+	 * may not be able to eliminate redundant keys; in such cases it will
-	 * arbitrarily pick a usable one for each attribute.  This is correct
+	 * arbitrarily pick a usable key for each attribute (and scan direction),
-	 * but possibly not optimal behavior.  (For example, with keys like
+	 * ensuring that there is no more than one key required in each direction.
-	 * "x >= 4 AND x >= 5" we would elect to scan starting at x=4 when
+	 * We stop considering further keys once we reach the first nonrequired
-	 * x=5 would be more efficient.)  Since the situation only arises given
+	 * key (which must come after all required keys), so this can't affect us.
-	 * a poorly-worded query plus an incomplete opfamily, live with it.
+	 *
 	 * The required keys that we use as starting boundaries have to be =, >,
 	 * or >= keys for a forward scan or =, <, <= keys for a backwards scan.
 	 * We can use keys for multiple attributes so long as the prior attributes
 	 * had only =, >= (resp. =, <=) keys.  These rules are very similar to the
 	 * rules that preprocessing used to determine which keys to mark required.
 	 * We cannot always use every required key as a positioning key, though.
 	 * Skip arrays necessitate independently applying our own rules here.
 	 * Skip arrays are always generally considered = array keys, but we'll
 	 * nevertheless treat them as inequalities at certain points of the scan.
 	 * When that happens, it _might_ have implications for the number of
 	 * required keys that we can safely use for initial positioning purposes.
 	 *
-	 * When both equality and inequality keys appear for a single attribute
+	 * For example, a forward scan with a skip array on its leading attribute
-	 * (again, only possible when cross-type operators appear), we *must*
+	 * (with no low_compare/high_compare) will have at least two required scan
-	 * select one of the equality keys for the starting point, because
+	 * keys, but we won't use any of them as boundary keys during the scan's
-	 * _bt_checkkeys() will stop the scan as soon as an equality qual fails.
+	 * initial call here.  Our positioning key during the first call here can
-	 * For example, if we have keys like "x >= 4 AND x = 10" and we elect to
+	 * be thought of as representing "> -infinity".  Similarly, if such a skip
-	 * start at x=4, we will fail and stop before reaching x=10.  If multiple
+	 * array's low_compare is "a > 'foo'", then we position using "a > 'foo'"
-	 * equality quals survive preprocessing, however, it doesn't matter which
+	 * during the scan's initial call here; a lower-order key such as "b = 42"
-	 * one we use --- by definition, they are either redundant or
+	 * can't be used until the "a" array advances beyond MINVAL/low_compare.
 	 * contradictory.
 	 *
-	 * In practice we rarely see any "attribute boundary key gaps" here.
+	 * On the other hand, if such a skip array's low_compare was "a >= 'foo'",
-	 * Preprocessing can usually backfill skip array keys for any attributes
+	 * then we _can_ use "a >= 'foo' AND b = 42" during the initial call here.
-	 * that were omitted from the original scan->keyData[] input keys.  All
+	 * A subsequent call here might have us use "a = 'fop' AND b = 42".  Note
-	 * array keys are always considered = keys, but we'll sometimes need to
+	 * that we treat = and >= as equivalent when scanning forwards (just as we
-	 * treat the current key value as if we were using an inequality strategy.
+	 * treat = and <= as equivalent when scanning backwards).  We effectively
-	 * This happens with range skip arrays, which store inequality keys in the
+	 * do the same thing (though with a distinct "a" element/value) each time.
 	 * array's low_compare/high_compare fields (used to find the first/last
 	 * set of matches, when = key will lack a usable sk_argument value).
 	 * These are always preferred over any redundant "standard" inequality
 	 * keys on the same column (per the usual rule about preferring = keys).
 	 * Note also that any column with an = skip array key can never have an
 	 * additional, contradictory = key.
 	 *
 	 * All keys (with the exception of SK_SEARCHNULL keys and SK_BT_SKIP
 	 * array keys whose array is "null_elem=true") imply a NOT NULL qualifier.
@ -1014,18 +1019,17 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * first (leftmost) columns.  We'll add on lower-order columns of the row
 	 * comparison below, if possible.
 	 *
-	 * The selected scan keys (at most one per index column) are remembered by
+	 * _bt_advance_array_keys needs to know exactly how we'll reposition the
-	 * storing their addresses into the local startKeys[] array.
+	 * scan (should it opt to schedule another primitive index scan).  It is
-	 *
+	 * critical that primscans only be scheduled when they'll definitely make
-	 * _bt_checkkeys/_bt_advance_array_keys decide whether and when to start
+	 * some useful progress.  _bt_advance_array_keys does this by calling
-	 * the next primitive index scan (for scans with array keys) based in part
+	 * _bt_checkkeys routines that report whether a tuple is past the end of
-	 * on an understanding of how it'll enable us to reposition the scan.
+	 * matches for the scan's keys (given the scan's current array elements).
-	 * They're directly aware of how we'll sometimes cons up an explicit
+	 * If the page's final tuple is "after the end of matches" for a scan that
-	 * SK_SEARCHNOTNULL key.  They'll even end primitive scans by applying a
+	 * uses the *opposite* scan direction, then it must follow that it's also
-	 * symmetric "deduce NOT NULL" rule of their own.  This allows top-level
+	 * "before the start of matches" for the actual current scan direction.
-	 * scans to skip large groups of NULLs through repeated deductions about
+	 * It is therefore essential that all of our initial positioning rules are
-	 * key strictness (for a required inequality key) and whether NULLs in the
+	 * symmetric with _bt_checkkeys's corresponding continuescan=false rule.
 	 * key's index column are stored last or first (relative to non-NULLs).
 	 * If you update anything here, _bt_checkkeys/_bt_advance_array_keys might
 	 * need to be kept in sync.
 	 *----------
@ -1034,18 +1038,17 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	if (so->numberOfKeys > 0)
 	{
 		AttrNumber	curattr;
-		ScanKey		chosen;
+		ScanKey		bkey;
 		ScanKey		impliesNN;
 		ScanKey		cur;
 		/*
-		 * chosen is the so-far-chosen key for the current attribute, if any.
+		 * bkey will be set to the key that preprocessing left behind as the
-		 * We don't cast the decision in stone until we reach keys for the
+		 * boundary key for this attribute, in this scan direction (if any)
 		 * next attribute.
 		 */
 		cur = so->keyData;
 		curattr = 1;
-		chosen = NULL;
+		bkey = NULL;
 		/* Also remember any scankey that implies a NOT NULL constraint */
 		impliesNN = NULL;
@ -1058,23 +1061,29 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		{
 			if (i >= so->numberOfKeys || cur->sk_attno != curattr)
 			{
 				/* Done looking for the curattr boundary key */
 				Assert(bkey == NULL ||
 					   (bkey->sk_attno == curattr &&
 						(bkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD))));
 				Assert(impliesNN == NULL ||
 					   (impliesNN->sk_attno == curattr &&
 						(impliesNN->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD))));
 				/*
 				 * Done looking at keys for curattr.
 				 *
 				 * If this is a scan key for a skip array whose current
 				 * element is MINVAL, choose low_compare (when scanning
 				 * backwards it'll be MAXVAL, and we'll choose high_compare).
 				 *
-				 * Note: if the array's low_compare key makes 'chosen' NULL,
+				 * Note: if the array's low_compare key makes 'bkey' NULL,
 				 * then we behave as if the array's first element is -inf,
 				 * except when !array->null_elem implies a usable NOT NULL
 				 * constraint.
 				 */
-				if (chosen != NULL &&
+				if (bkey != NULL &&
-					(chosen->sk_flags & (SK_BT_MINVAL | SK_BT_MAXVAL)))
+					(bkey->sk_flags & (SK_BT_MINVAL | SK_BT_MAXVAL)))
 				{
-					int			ikey = chosen - so->keyData;
+					int			ikey = bkey - so->keyData;
-					ScanKey		skipequalitykey = chosen;
+					ScanKey		skipequalitykey = bkey;
 					BTArrayKeyInfo *array = NULL;
 					for (int arridx = 0; arridx < so->numArrayKeys; arridx++)
@ -1087,35 +1096,35 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					if (ScanDirectionIsForward(dir))
 					{
 						Assert(!(skipequalitykey->sk_flags & SK_BT_MAXVAL));
-						chosen = array->low_compare;
+						bkey = array->low_compare;
 					}
 					else
 					{
 						Assert(!(skipequalitykey->sk_flags & SK_BT_MINVAL));
-						chosen = array->high_compare;
+						bkey = array->high_compare;
 					}
-					Assert(chosen == NULL ||
+					Assert(bkey == NULL ||
-						   chosen->sk_attno == skipequalitykey->sk_attno);
+						   bkey->sk_attno == skipequalitykey->sk_attno);
 					if (!array->null_elem)
 						impliesNN = skipequalitykey;
 					else
-						Assert(chosen == NULL && impliesNN == NULL);
+						Assert(bkey == NULL && impliesNN == NULL);
 				}
 				/*
 				 * If we didn't find a usable boundary key, see if we can
 				 * deduce a NOT NULL key
 				 */
-				if (chosen == NULL && impliesNN != NULL &&
+				if (bkey == NULL && impliesNN != NULL &&
 					((impliesNN->sk_flags & SK_BT_NULLS_FIRST) ?
 					 ScanDirectionIsForward(dir) :
 					 ScanDirectionIsBackward(dir)))
 				{
 					/* Yes, so build the key in notnullkeys[keysz] */
-					chosen = &notnullkeys[keysz];
+					bkey = &notnullkeys[keysz];
-					ScanKeyEntryInitialize(chosen,
+					ScanKeyEntryInitialize(bkey,
 										   (SK_SEARCHNOTNULL | SK_ISNULL |
 											(impliesNN->sk_flags &
 											 (SK_BT_DESC | SK_BT_NULLS_FIRST))),
@ -1130,12 +1139,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				}
 				/*
-				 * If we still didn't find a usable boundary key, quit; else
+				 * If preprocessing didn't leave a usable boundary key, quit;
-				 * save the boundary key pointer in startKeys.
+				 * else save the boundary key pointer in startKeys[]
 				 */
-				if (chosen == NULL)
+				if (bkey == NULL)
 					break;
-				startKeys[keysz++] = chosen;
+				startKeys[keysz++] = bkey;
 				/*
 				 * We can only consider adding more boundary keys when the one
@ -1143,7 +1152,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				 * (during backwards scans we can only do so when the key that
 				 * we just added to startKeys[] uses the = or <= strategy)
 				 */
-				strat_total = chosen->sk_strategy;
+				strat_total = bkey->sk_strategy;
 				if (strat_total == BTGreaterStrategyNumber ||
 					strat_total == BTLessStrategyNumber)
 					break;
@ -1154,19 +1163,19 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				 * make strat_total > or < (and stop adding boundary keys).
 				 * This can only happen with opclasses that lack skip support.
 				 */
-				if (chosen->sk_flags & (SK_BT_NEXT | SK_BT_PRIOR))
+				if (bkey->sk_flags & (SK_BT_NEXT | SK_BT_PRIOR))
 				{
-					Assert(chosen->sk_flags & SK_BT_SKIP);
+					Assert(bkey->sk_flags & SK_BT_SKIP);
 					Assert(strat_total == BTEqualStrategyNumber);
 					if (ScanDirectionIsForward(dir))
 					{
-						Assert(!(chosen->sk_flags & SK_BT_PRIOR));
+						Assert(!(bkey->sk_flags & SK_BT_PRIOR));
 						strat_total = BTGreaterStrategyNumber;
 					}
 					else
 					{
-						Assert(!(chosen->sk_flags & SK_BT_NEXT));
+						Assert(!(bkey->sk_flags & SK_BT_NEXT));
 						strat_total = BTLessStrategyNumber;
 					}
@ -1180,24 +1189,30 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				/*
 				 * Done if that was the last scan key output by preprocessing.
-				 * Also done if there is a gap index attribute that lacks a
+				 * Also done if we've now examined all keys marked required.
 				 * usable key (only possible when preprocessing was unable to
 				 * generate a skip array key to "fill in the gap").
 				 */
 				if (i >= so->numberOfKeys ||
-					cur->sk_attno != curattr + 1)
+					!(cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
 					break;
 				/*
 				 * Reset for next attr.
 				 */
 				Assert(cur->sk_attno == curattr + 1);
 				curattr = cur->sk_attno;
-				chosen = NULL;
+				bkey = NULL;
 				impliesNN = NULL;
 			}
 			/*
-			 * Can we use this key as a starting boundary for this attr?
+			 * If we've located the starting boundary key for curattr, we have
 			 * no interest in curattr's other required key
 			 */
 			if (bkey != NULL)
 				continue;
 			/*
 			 * Is this key the starting boundary key for curattr?
 			 *
 			 * If not, does it imply a NOT NULL constraint?  (Because
 			 * SK_SEARCHNULL keys are always assigned BTEqualStrategyNumber,
@ -1207,27 +1222,20 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			{
 				case BTLessStrategyNumber:
 				case BTLessEqualStrategyNumber:
-					if (chosen == NULL)
+					if (ScanDirectionIsBackward(dir))
-					{
+						bkey = cur;
-						if (ScanDirectionIsBackward(dir))
+					else if (impliesNN == NULL)
-							chosen = cur;
+						impliesNN = cur;
 						else
 							impliesNN = cur;
 					}
 					break;
 				case BTEqualStrategyNumber:
-					/* override any non-equality choice */
+					bkey = cur;
 					chosen = cur;
 					break;
 				case BTGreaterEqualStrategyNumber:
 				case BTGreaterStrategyNumber:
-					if (chosen == NULL)
+					if (ScanDirectionIsForward(dir))
-					{
+						bkey = cur;
-						if (ScanDirectionIsForward(dir))
+					else if (impliesNN == NULL)
-							chosen = cur;
+						impliesNN = cur;
 						else
 							impliesNN = cur;
 					}
 					break;
 			}
 		}
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@ -44,7 +44,6 @@ static bool _bt_array_decrement(Relation rel, ScanKey skey, BTArrayKeyInfo *arra
 static bool _bt_array_increment(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
 static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir,
 											 bool *skip_array_set);
 static void _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
 										 IndexTuple tuple, TupleDesc tupdesc, int tupnatts,
 										 bool readpagetup, int sktrig, bool *scanBehind);
@ -52,7 +51,6 @@ static bool _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
 								   IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 								   int sktrig, bool sktrig_required);
 #ifdef USE_ASSERT_CHECKING
 static bool _bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
 #endif
 static bool _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
@ -1034,73 +1032,6 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir,
 	return false;
 }
 /*
 * _bt_rewind_nonrequired_arrays() -- Rewind SAOP arrays not marked required
 *
 * Called when _bt_advance_array_keys decides to start a new primitive index
 * scan on the basis of the current scan position being before the position
 * that _bt_first is capable of repositioning the scan to by applying an
 * inequality operator required in the opposite-to-scan direction only.
 *
 * Although equality strategy scan keys (for both arrays and non-arrays alike)
 * are either marked required in both directions or in neither direction,
 * there is a sense in which non-required arrays behave like required arrays.
 * With a qual such as "WHERE a IN (100, 200) AND b >= 3 AND c IN (5, 6, 7)",
 * the scan key on "c" is non-required, but nevertheless enables positioning
 * the scan at the first tuple >= "(100, 3, 5)" on the leaf level during the
 * first descent of the tree by _bt_first.  Later on, there could also be a
 * second descent, that places the scan right before tuples >= "(200, 3, 5)".
 * _bt_first must never be allowed to build an insertion scan key whose "c"
 * entry is set to a value other than 5, the "c" array's first element/value.
 * (Actually, it's the first in the current scan direction.  This example uses
 * a forward scan.)
 *
 * Calling here resets the array scan key elements for the scan's non-required
 * arrays.  This is strictly necessary for correctness in a subset of cases
 * involving "required in opposite direction"-triggered primitive index scans.
 * Not all callers are at risk of _bt_first using a non-required array like
 * this, but advancement always resets the arrays when another primitive scan
 * is scheduled, just to keep things simple.  Array advancement even makes
 * sure to reset non-required arrays during scans that have no inequalities.
 * (Advancement still won't call here when there are no inequalities, though
 * that's just because it's all handled indirectly instead.)
 *
 * Note: _bt_verify_arrays_bt_first is called by an assertion to enforce that
 * everybody got this right.
 *
 * Note: In practice almost all SAOP arrays are marked required during
 * preprocessing (if necessary by generating skip arrays).  It is hardly ever
 * truly necessary to call here, but consistently doing so is simpler.
 */
 static void
 _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	int			arrayidx = 0;
 	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
 	{
 		ScanKey		cur = so->keyData + ikey;
 		BTArrayKeyInfo *array = NULL;
 		if (!(cur->sk_flags & SK_SEARCHARRAY) ||
 			cur->sk_strategy != BTEqualStrategyNumber)
 			continue;
 		array = &so->arrayKeys[arrayidx++];
 		Assert(array->scan_key == ikey);
 		if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
 			continue;
 		Assert(array->num_elems != -1); /* No non-required skip arrays */
 		_bt_array_set_low_or_high(rel, cur, array,
 								  ScanDirectionIsForward(dir));
 	}
 }
 /*
 * _bt_tuple_before_array_skeys() -- too early to advance required arrays?
 *
@ -1380,8 +1311,6 @@ _bt_start_prim_scan(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (so->needPrimScan)
 	{
 		Assert(_bt_verify_arrays_bt_first(scan, dir));
 		/*
 		 * Flag was set -- must call _bt_first again, which will reset the
 		 * scan's needPrimScan flag
@ -2007,14 +1936,7 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
 	 */
 	else if (has_required_opposite_direction_only && pstate->finaltup &&
 			 unlikely(!_bt_oppodir_checkkeys(scan, dir, pstate->finaltup)))
 	{
 		/*
 		 * Make sure that any SAOP arrays that were not marked required by
 		 * preprocessing are reset to their first element for this direction
 		 */
 		_bt_rewind_nonrequired_arrays(scan, dir);
 		goto new_prim_scan;
 	}
 continue_scan:
@ -2045,8 +1967,6 @@ continue_scan:
 		 */
 		so->oppositeDirCheck = has_required_opposite_direction_only;
 		_bt_rewind_nonrequired_arrays(scan, dir);
 		/*
 		 * skip by setting "look ahead" mechanism's offnum for forwards scans
 		 * (backwards scans check scanBehind flag directly instead)
@ -2142,48 +2062,6 @@ end_toplevel_scan:
 }
 #ifdef USE_ASSERT_CHECKING
 /*
 * Verify that the scan's qual state matches what we expect at the point that
 * _bt_start_prim_scan is about to start a just-scheduled new primitive scan.
 *
 * We enforce a rule against non-required array scan keys: they must start out
 * with whatever element is the first for the scan's current scan direction.
 * See _bt_rewind_nonrequired_arrays comments for an explanation.
 */
 static bool
 _bt_verify_arrays_bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	int			arrayidx = 0;
 	for (int ikey = 0; ikey < so->numberOfKeys; ikey++)
 	{
 		ScanKey		cur = so->keyData + ikey;
 		BTArrayKeyInfo *array = NULL;
 		int			first_elem_dir;
 		if (!(cur->sk_flags & SK_SEARCHARRAY) ||
 			cur->sk_strategy != BTEqualStrategyNumber)
 			continue;
 		array = &so->arrayKeys[arrayidx++];
 		if (((cur->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
 			((cur->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
 			continue;
 		if (ScanDirectionIsForward(dir))
 			first_elem_dir = 0;
 		else
 			first_elem_dir = array->num_elems - 1;
 		if (array->cur_elem != first_elem_dir)
 			return false;
 	}
 	return _bt_verify_keys_with_arraykeys(scan);
 }
 /*
 * Verify that the scan's "so->keyData[]" scan keys are in agreement with
 * its array key state
@ -2194,6 +2072,7 @@ _bt_verify_keys_with_arraykeys(IndexScanDesc scan)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	int			last_sk_attno = InvalidAttrNumber,
 				arrayidx = 0;
 	bool		nonrequiredseen = false;
 	if (!so->qual_ok)
 		return false;
@ -2217,8 +2096,16 @@ _bt_verify_keys_with_arraykeys(IndexScanDesc scan)
 		if (array->num_elems != -1 &&
 			cur->sk_argument != array->elem_values[array->cur_elem])
 			return false;
-		if (last_sk_attno > cur->sk_attno)
+		if (cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD))
-			return false;
+		{
 			if (last_sk_attno > cur->sk_attno)
 				return false;
 			if (nonrequiredseen)
 				return false;
 		}
 		else
 			nonrequiredseen = true;
 		last_sk_attno = cur->sk_attno;
 	}
@ -2551,7 +2438,6 @@ _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate)
 		if (!(key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
 		{
 			/* Scan key isn't marked required (corner case) */
 			Assert(!(key->sk_flags & SK_ROW_HEADER));
 			break;				/* unsafe */
 		}
 		if (key->sk_flags & SK_ROW_HEADER)