Collect and use multi-column dependency stats

Follow on patch in the multi-variate statistics patch series. CREATE STATISTICS s1 WITH (dependencies) ON (a, b) FROM t; ANALYZE; will collect dependency stats on (a, b) and then use the measured dependency in subsequent query planning. Commit 7b504eb282 added CREATE STATISTICS with n-distinct coefficients. These are now specified using the mutually exclusive option WITH (ndistinct). Author: Tomas Vondra, David Rowley Reviewed-by: Kyotaro HORIGUCHI, Álvaro Herrera, Dean Rasheed, Robert Haas and many other comments and contributions Discussion: https://postgr.es/m/56f40b20-c464-fad2-ff39-06b668fac47c@2ndquadrant.com
9 years ago · 2686ee1b7c
parent 00b6b6feb1
commit 2686ee1b7c
31 changed files with 2035 additions and 79 deletions
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@ -1013,6 +1013,7 @@ estimate_size(PlannerInfo *root, RelOptInfo *baserel,
 							   baserel->baserestrictinfo,
 							   0,
 							   JOIN_INNER,
 							   NULL,
 							   NULL);
 	nrows = clamp_row_est(nrows);
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@ -591,6 +591,7 @@ postgresGetForeignRelSize(PlannerInfo *root,
 													 fpinfo->local_conds,
 													 baserel->relid,
 													 JOIN_INNER,
 													 NULL,
 													 NULL);
 	cost_qual_eval(&fpinfo->local_conds_cost, fpinfo->local_conds, root);
@ -2572,6 +2573,7 @@ estimate_path_cost_size(PlannerInfo *root,
 										   local_param_join_conds,
 										   foreignrel->relid,
 										   JOIN_INNER,
 										   NULL,
 										   NULL);
 		local_sel *= fpinfo->local_conds_sel;
@ -4455,6 +4457,7 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 													 fpinfo->local_conds,
 													 0,
 													 JOIN_INNER,
 													 NULL,
 													 NULL);
 	cost_qual_eval(&fpinfo->local_conds_cost, fpinfo->local_conds, root);
@ -4465,7 +4468,7 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	if (!fpinfo->use_remote_estimate)
 		fpinfo->joinclause_sel = clauselist_selectivity(root, fpinfo->joinclauses,
 														0, fpinfo->jointype,
-														extra->sjinfo);
+														extra->sjinfo, NULL);
 	/* Estimate costs for bare join relation */
 	estimate_path_cost_size(root, joinrel, NIL, NIL, &rows,
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@ -4339,6 +4339,15 @@
      </entry>
     </row>
     <row>
      <entry><structfield>stadependencies</structfield></entry>
      <entry><type>pg_dependencies</type></entry>
      <entry></entry>
      <entry>
       Functional dependencies, serialized as <structname>pg_dependencies</> type.
      </entry>
     </row>
    </tbody>
   </tgroup>
  </table>
--- a/doc/src/sgml/planstats.sgml
+++ b/doc/src/sgml/planstats.sgml
@ -446,6 +446,160 @@ rows = (outer_cardinality * inner_cardinality) * selectivity
   in <filename>src/backend/utils/adt/selfuncs.c</filename>.
  </para>
  <sect2 id="functional-dependencies">
   <title>Functional Dependencies</title>
   <para>
    The simplest type of extended statistics are functional dependencies,
    used in definitions of database normal forms. When simplified, saying that
    <literal>b</> is functionally dependent on <literal>a</> means that
    knowledge of value of <literal>a</> is sufficient to determine value of
    <literal>b</>.
   </para>
   <para>
    In normalized databases, only functional dependencies on primary keys
    and superkeys are allowed. However, in practice, many data sets are not
    fully normalized, for example, due to intentional denormalization for
    performance reasons.
   </para>
   <para>
    Functional dependencies directly affect accuracy of the estimates, as
    conditions on the dependent column(s) do not restrict the result set,
    resulting in underestimates.
   </para>
   <para>
    To inform the planner about the functional dependencies, we collect
    measurements of dependency during <command>ANALYZE</>. Assessing
    dependency between all sets of columns would be prohibitively
    expensive, so we limit our search to potential dependencies defined
    using the <command>CREATE STATISTICS</> command.
 <programlisting>
 CREATE TABLE t (a INT, b INT);
 INSERT INTO t SELECT i/100, i/100 FROM generate_series(1,10000) s(i);
 CREATE STATISTICS s1 WITH (dependencies) ON (a, b) FROM t;
 ANALYZE t;
 EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 1;
                                          QUERY PLAN                                             
 -------------------------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..195.00 rows=100 width=8) (actual time=0.095..3.118 rows=100 loops=1)
   Filter: ((a = 1) AND (b = 1))
   Rows Removed by Filter: 9900
 Planning time: 0.367 ms
 Execution time: 3.380 ms
 (5 rows)
 </programlisting>
    The planner is now aware of the functional dependencies and considers
    them when computing the selectivity of the second condition.  Running
    the query without the statistics would lead to quite different estimates.
 <programlisting>
 DROP STATISTICS s1;
 EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 1;
                                          QUERY PLAN                                           
 -----------------------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..195.00 rows=1 width=8) (actual time=0.000..6.379 rows=100 loops=1)
   Filter: ((a = 1) AND (b = 1))
   Rows Removed by Filter: 9900
 Planning time: 0.000 ms
 Execution time: 6.379 ms
 (5 rows)
 </programlisting>
   </para>
   <para>
    If no dependency exists, the collected statistics do not influence the
    query plan. The only effect is to slow down <command>ANALYZE</>. Should
    partial dependencies exist these will also be stored and applied
    during planning.
   </para>
   <para>
    Similarly to per-column statistics, extended statistics are stored in
    a system catalog called <structname>pg_statistic_ext</structname>, but
    there is also a more convenient view <structname>pg_stats_ext</structname>.
    To inspect the statistics <literal>s1</literal> defined above,
    you may do this:
 <programlisting>
 SELECT tablename, staname, attnums, depsbytes
  FROM pg_stats_ext WHERE staname = 's1';
 tablename | staname | attnums | depsbytes 
 -----------+---------+---------+-----------
 t         | s1      | 1 2     |        40
 (1 row)
 </programlisting>
     This shows that the statistics are defined on table <structname>t</>,
     <structfield>attnums</structfield> lists attribute numbers of columns
     (references <structname>pg_attribute</structname>). It also shows
     the length in bytes of the functional dependencies, as found by
     <command>ANALYZE</> when serialized into a <literal>bytea</> column.
   </para>
   <para>
    When computing the selectivity, the planner inspects all conditions and
    attempts to identify which conditions are already implied by other
    conditions.  The selectivity estimates from any redundant conditions are
    ignored from a selectivity point of view. In the example query above,
    the selectivity estimates for either of the conditions may be eliminated,
    thus improving the overall estimate.
   </para>
    <sect3 id="functional-dependencies-limitations">
     <title>Limitations of functional dependencies</title>
     <para>
      Functional dependencies are a very simple type of statistics, and
      as such have several limitations. The first limitation is that they
      only work with simple equality conditions, comparing columns and constant
      values. It's not possible to use them to eliminate equality conditions
      comparing two columns or a column to an expression, range clauses,
      <literal>LIKE</> or any other type of conditions.
     </para>
     <para>
      When eliminating the implied conditions, the planner assumes that the
      conditions are compatible. Consider the following example, violating
      this assumption:
 <programlisting>
 EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 10;
                                          QUERY PLAN
 -----------------------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..195.00 rows=100 width=8) (actual time=2.992..2.992 rows=0 loops=1)
   Filter: ((a = 1) AND (b = 10))
   Rows Removed by Filter: 10000
 Planning time: 0.232 ms
 Execution time: 3.033 ms
 (5 rows)
 </programlisting>
      While there are no rows with such combination of values, the planner
      is unable to verify whether the values match - it only knows that
      the columns are functionally dependent.
     </para>
     <para>
      This assumption is more about queries executed on the database - in many
      cases, it's actually satisfied (e.g. when the GUI only allows selecting
      compatible values). But if that's not the case, functional dependencies
      may not be a viable option.
     </para>
     <para>
      For additional information about functional dependencies, see
      <filename>src/backend/statistics/README.dependencies</>.
     </para>
    </sect3>
  </sect2>
 </sect1>
 </chapter>
--- a/doc/src/sgml/ref/create_statistics.sgml
+++ b/doc/src/sgml/ref/create_statistics.sgml
@ -21,8 +21,9 @@ PostgreSQL documentation
 <refsynopsisdiv>
 <synopsis>
-CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="PARAMETER">statistics_name</replaceable> ON (
+CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="PARAMETER">statistics_name</replaceable>
-  <replaceable class="PARAMETER">column_name</replaceable>, <replaceable class="PARAMETER">column_name</replaceable> [, ...])
+  WITH ( <replaceable class="PARAMETER">option</replaceable> [= <replaceable class="PARAMETER">value</replaceable>] [, ... ] )
  ON ( <replaceable class="PARAMETER">column_name</replaceable>, <replaceable class="PARAMETER">column_name</replaceable> [, ...])
  FROM <replaceable class="PARAMETER">table_name</replaceable>
 </synopsis>
@ -94,6 +95,41 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="PARAMETER">statistics_na
  </variablelist>
  <refsect2 id="SQL-CREATESTATISTICS-parameters">
   <title id="SQL-CREATESTATISTICS-parameters-title">Parameters</title>
 <indexterm zone="sql-createstatistics-parameters">
  <primary>statistics parameters</primary>
 </indexterm>
   <para>
    The <literal>WITH</> clause can specify <firstterm>options</>
    for the statistics. Available options are listed below.
   </para>
   <variablelist>
   <varlistentry>
    <term><literal>dependencies</> (<type>boolean</>)</term>
    <listitem>
     <para>
      Enables functional dependencies for the statistics.
     </para>
    </listitem>
   </varlistentry>
   <varlistentry>
    <term><literal>ndistinct</> (<type>boolean</>)</term>
    <listitem>
     <para>
      Enables ndistinct coefficients for the statistics.
     </para>
    </listitem>
   </varlistentry>
   </variablelist>
  </refsect2>
 </refsect1>
 <refsect1>
@ -122,7 +158,7 @@ CREATE TABLE t1 (
 INSERT INTO t1 SELECT i/100, i/500
                 FROM generate_series(1,1000000) s(i);
-CREATE STATISTICS s1 ON (a, b) FROM t1;
+CREATE STATISTICS s1 WITH (dependencies) ON (a, b) FROM t1;
 ANALYZE t1;
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@ -192,7 +192,8 @@ CREATE VIEW pg_stats_ext AS
        C.relname AS tablename,
        S.staname AS staname,
        S.stakeys AS attnums,
-        length(s.standistinct) AS ndistbytes
+        length(s.standistinct::bytea) AS ndistbytes,
        length(S.stadependencies::bytea) AS depsbytes
    FROM (pg_statistic_ext S JOIN pg_class C ON (C.oid = S.starelid))
        LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace);
--- a/src/backend/commands/statscmds.c
+++ b/src/backend/commands/statscmds.c
@ -62,10 +62,11 @@ CreateStatistics(CreateStatsStmt *stmt)
 	Oid			relid;
 	ObjectAddress parentobject,
 				childobject;
-	Datum		types[1];		/* only ndistinct defined now */
+	Datum		types[2];		/* one for each possible type of statistics */
 	int			ntypes;
 	ArrayType  *staenabled;
 	bool		build_ndistinct;
 	bool		build_dependencies;
 	bool		requested_type = false;
 	Assert(IsA(stmt, CreateStatsStmt));
@ -159,7 +160,7 @@ CreateStatistics(CreateStatsStmt *stmt)
 				 errmsg("statistics require at least 2 columns")));
 	/*
-	 * Sort the attnums, which makes detecting duplicies somewhat easier, and
+	 * Sort the attnums, which makes detecting duplicities somewhat easier, and
 	 * it does not hurt (it does not affect the efficiency, unlike for
 	 * indexes, for example).
 	 */
@ -182,6 +183,7 @@ CreateStatistics(CreateStatsStmt *stmt)
 	 * recognized.
 	 */
 	build_ndistinct = false;
 	build_dependencies = false;
 	foreach(l, stmt->options)
 	{
 		DefElem    *opt = (DefElem *) lfirst(l);
@ -191,6 +193,11 @@ CreateStatistics(CreateStatsStmt *stmt)
 			build_ndistinct = defGetBoolean(opt);
 			requested_type = true;
 		}
 		else if (strcmp(opt->defname, "dependencies") == 0)
 		{
 			build_dependencies = defGetBoolean(opt);
 			requested_type = true;
 		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@ -199,12 +206,17 @@ CreateStatistics(CreateStatsStmt *stmt)
 	}
 	/* If no statistic type was specified, build them all. */
 	if (!requested_type)
 	{
 		build_ndistinct = true;
 		build_dependencies = true;
 	}
 	/* construct the char array of enabled statistic types */
 	ntypes = 0;
 	if (build_ndistinct)
 		types[ntypes++] = CharGetDatum(STATS_EXT_NDISTINCT);
 	if (build_dependencies)
 		types[ntypes++] = CharGetDatum(STATS_EXT_DEPENDENCIES);
 	Assert(ntypes > 0);
 	staenabled = construct_array(types, ntypes, CHAROID, 1, true, 'c');
@ -222,6 +234,7 @@ CreateStatistics(CreateStatsStmt *stmt)
 	/* no statistics build yet */
 	nulls[Anum_pg_statistic_ext_standistinct - 1] = true;
 	nulls[Anum_pg_statistic_ext_stadependencies - 1] = true;
 	/* insert it into pg_statistic_ext */
 	statrel = heap_open(StatisticExtRelationId, RowExclusiveLock);
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@ -22,6 +22,7 @@
 #include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
 #include "statistics/statistics.h"
 /*
@ -60,23 +61,30 @@ static void addRangeClause(RangeQueryClause **rqlist, Node *clause,
 * subclauses.  However, that's only right if the subclauses have independent
 * probabilities, and in reality they are often NOT independent.  So,
 * we want to be smarter where we can.
-
+ *
- * Currently, the only extra smarts we have is to recognize "range queries",
+ * When 'rel' is not null and rtekind = RTE_RELATION, we'll try to apply
- * such as "x > 34 AND x < 42".  Clauses are recognized as possible range
+ * selectivity estimates using any extended statistcs on 'rel'.
- * query components if they are restriction opclauses whose operators have
+ *
- * scalarltsel() or scalargtsel() as their restriction selectivity estimator.
+ * If we identify such extended statistics exist, we try to apply them.
- * We pair up clauses of this form that refer to the same variable.  An
+ * Currently we only have (soft) functional dependencies, so apply these in as
- * unpairable clause of this kind is simply multiplied into the selectivity
+ * many cases as possible, and fall back on normal estimates for remaining
- * product in the normal way.  But when we find a pair, we know that the
+ * clauses.
- * selectivities represent the relative positions of the low and high bounds
+ *
- * within the column's range, so instead of figuring the selectivity as
+ * We also recognize "range queries", such as "x > 34 AND x < 42".  Clauses
- * hisel * losel, we can figure it as hisel + losel - 1.  (To visualize this,
+ * are recognized as possible range query components if they are restriction
- * see that hisel is the fraction of the range below the high bound, while
+ * opclauses whose operators have scalarltsel() or scalargtsel() as their
- * losel is the fraction above the low bound; so hisel can be interpreted
+ * restriction selectivity estimator.  We pair up clauses of this form that
- * directly as a 0..1 value but we need to convert losel to 1-losel before
+ * refer to the same variable.  An unpairable clause of this kind is simply
- * interpreting it as a value.  Then the available range is 1-losel to hisel.
+ * multiplied into the selectivity product in the normal way.  But when we
- * However, this calculation double-excludes nulls, so really we need
+ * find a pair, we know that the selectivities represent the relative
- * hisel + losel + null_frac - 1.)
+ * positions of the low and high bounds within the column's range, so instead
 * of figuring the selectivity as hisel * losel, we can figure it as hisel +
 * losel - 1.  (To visualize this, see that hisel is the fraction of the range
 * below the high bound, while losel is the fraction above the low bound; so
 * hisel can be interpreted directly as a 0..1 value but we need to convert
 * losel to 1-losel before interpreting it as a value.  Then the available
 * range is 1-losel to hisel.  However, this calculation double-excludes
 * nulls, so really we need hisel + losel + null_frac - 1.)
 *
 * If either selectivity is exactly DEFAULT_INEQ_SEL, we forget this equation
 * and instead use DEFAULT_RANGE_INEQ_SEL.  The same applies if the equation
@ -93,33 +101,70 @@ clauselist_selectivity(PlannerInfo *root,
 					   List *clauses,
 					   int varRelid,
 					   JoinType jointype,
-					   SpecialJoinInfo *sjinfo)
+					   SpecialJoinInfo *sjinfo,
 					   RelOptInfo *rel)
 {
 	Selectivity s1 = 1.0;
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	Bitmapset  *estimatedclauses = NULL;
 	int			listidx;
 	/*
-	 * If there's exactly one clause, then no use in trying to match up pairs,
+	 * If there's exactly one clause, then extended statistics is futile at
-	 * so just go directly to clause_selectivity().
+	 * this level (we might be able to apply them later if it's AND/OR
 	 * clause). So just go directly to clause_selectivity().
 	 */
 	if (list_length(clauses) == 1)
 		return clause_selectivity(root, (Node *) linitial(clauses),
-								  varRelid, jointype, sjinfo);
+								  varRelid, jointype, sjinfo, rel);
 	/*
 	 * When a relation of RTE_RELATION is given as 'rel', we'll try to
 	 * perform selectivity estimation using extended statistics.
 	 */
 	if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
 	{
 		/*
 		 * Perform selectivity estimations on any clauses found applicable by
 		 * dependencies_clauselist_selectivity. The 0-based list position of
 		 * estimated clauses will be populated in 'estimatedclauses'.
 		 */
 		s1 *= dependencies_clauselist_selectivity(root, clauses, varRelid,
 								   jointype, sjinfo, rel, &estimatedclauses);
 		/*
 		 * This would be the place to apply any other types of extended
 		 * statistics selectivity estimations for remaining clauses.
 		 */
 	}
 	/*
-	 * Initial scan over clauses.  Anything that doesn't look like a potential
+	 * Apply normal selectivity estimates for remaining clauses. We'll be
-	 * rangequery clause gets multiplied into s1 and forgotten. Anything that
+	 * careful to skip any clauses which were already estimated above.
-	 * does gets inserted into an rqlist entry.
+	 *
 	 * Anything that doesn't look like a potential rangequery clause gets
 	 * multiplied into s1 and forgotten. Anything that does gets inserted into
 	 * an rqlist entry.
 	 */
 	listidx = -1;
 	foreach(l, clauses)
 	{
 		Node	   *clause = (Node *) lfirst(l);
 		RestrictInfo *rinfo;
 		Selectivity s2;
 		listidx++;
 		/*
 		 * Skip this clause if it's already been estimated by some other
 		 * statistics above.
 		 */
 		if (bms_is_member(listidx, estimatedclauses))
 			continue;
 		/* Always compute the selectivity using clause_selectivity */
-		s2 = clause_selectivity(root, clause, varRelid, jointype, sjinfo);
+		s2 = clause_selectivity(root, clause, varRelid, jointype, sjinfo, rel);
 		/*
 		 * Check for being passed a RestrictInfo.
@ -484,7 +529,8 @@ clause_selectivity(PlannerInfo *root,
 				   Node *clause,
 				   int varRelid,
 				   JoinType jointype,
-				   SpecialJoinInfo *sjinfo)
+				   SpecialJoinInfo *sjinfo,
 				   RelOptInfo *rel)
 {
 	Selectivity s1 = 0.5;		/* default for any unhandled clause type */
 	RestrictInfo *rinfo = NULL;
@ -604,7 +650,8 @@ clause_selectivity(PlannerInfo *root,
 								  (Node *) get_notclausearg((Expr *) clause),
 									  varRelid,
 									  jointype,
-									  sjinfo);
+									  sjinfo,
 									  rel);
 	}
 	else if (and_clause(clause))
 	{
@ -613,7 +660,8 @@ clause_selectivity(PlannerInfo *root,
 									((BoolExpr *) clause)->args,
 									varRelid,
 									jointype,
-									sjinfo);
+									sjinfo,
 									rel);
 	}
 	else if (or_clause(clause))
 	{
@ -632,7 +680,8 @@ clause_selectivity(PlannerInfo *root,
 												(Node *) lfirst(arg),
 												varRelid,
 												jointype,
-												sjinfo);
+												sjinfo,
 												rel);
 			s1 = s1 + s2 - s1 * s2;
 		}
@ -725,7 +774,8 @@ clause_selectivity(PlannerInfo *root,
 								(Node *) ((RelabelType *) clause)->arg,
 								varRelid,
 								jointype,
-								sjinfo);
+								sjinfo,
 								rel);
 	}
 	else if (IsA(clause, CoerceToDomain))
 	{
@ -734,7 +784,8 @@ clause_selectivity(PlannerInfo *root,
 								(Node *) ((CoerceToDomain *) clause)->arg,
 								varRelid,
 								jointype,
-								sjinfo);
+								sjinfo,
 								rel);
 	}
 	else
 	{
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@ -3750,7 +3750,8 @@ compute_semi_anti_join_factors(PlannerInfo *root,
 									joinquals,
 									0,
 									jointype,
-									sjinfo);
+									sjinfo,
 									NULL);
 	/*
 	 * Also get the normal inner-join selectivity of the join clauses.
@ -3773,7 +3774,8 @@ compute_semi_anti_join_factors(PlannerInfo *root,
 									joinquals,
 									0,
 									JOIN_INNER,
-									&norm_sjinfo);
+									&norm_sjinfo,
 									NULL);
 	/* Avoid leaking a lot of ListCells */
 	if (jointype == JOIN_ANTI)
@ -3940,7 +3942,7 @@ approx_tuple_count(PlannerInfo *root, JoinPath *path, List *quals)
 		Node	   *qual = (Node *) lfirst(l);
 		/* Note that clause_selectivity will be able to cache its result */
-		selec *= clause_selectivity(root, qual, 0, JOIN_INNER, &sjinfo);
+		selec *= clause_selectivity(root, qual, 0, JOIN_INNER, &sjinfo, NULL);
 	}
 	/* Apply it to the input relation sizes */
@ -3976,7 +3978,8 @@ set_baserel_size_estimates(PlannerInfo *root, RelOptInfo *rel)
 							   rel->baserestrictinfo,
 							   0,
 							   JOIN_INNER,
-							   NULL);
+							   NULL,
 							   rel);
 	rel->rows = clamp_row_est(nrows);
@ -4013,7 +4016,8 @@ get_parameterized_baserel_size(PlannerInfo *root, RelOptInfo *rel,
 							   allclauses,
 							   rel->relid,		/* do not use 0! */
 							   JOIN_INNER,
-							   NULL);
+							   NULL,
 							   rel);
 	nrows = clamp_row_est(nrows);
 	/* For safety, make sure result is not more than the base estimate */
 	if (nrows > rel->rows)
@ -4179,12 +4183,14 @@ calc_joinrel_size_estimate(PlannerInfo *root,
 										joinquals,
 										0,
 										jointype,
-										sjinfo);
+										sjinfo,
 										NULL);
 		pselec = clauselist_selectivity(root,
 										pushedquals,
 										0,
 										jointype,
-										sjinfo);
+										sjinfo,
 										NULL);
 		/* Avoid leaking a lot of ListCells */
 		list_free(joinquals);
@ -4196,7 +4202,8 @@ calc_joinrel_size_estimate(PlannerInfo *root,
 										restrictlist,
 										0,
 										jointype,
-										sjinfo);
+										sjinfo,
 										NULL);
 		pselec = 0.0;			/* not used, keep compiler quiet */
 	}
@ -4491,7 +4498,7 @@ get_foreign_key_join_selectivity(PlannerInfo *root,
 				Selectivity csel;
 				csel = clause_selectivity(root, (Node *) rinfo,
-										  0, jointype, sjinfo);
+										  0, jointype, sjinfo, NULL);
 				thisfksel = Min(thisfksel, csel);
 			}
 			fkselec *= thisfksel;
--- a/src/backend/optimizer/util/orclauses.c
+++ b/src/backend/optimizer/util/orclauses.c
@ -280,7 +280,7 @@ consider_new_or_clause(PlannerInfo *root, RelOptInfo *rel,
 	 * saving work later.)
 	 */
 	or_selec = clause_selectivity(root, (Node *) or_rinfo,
-								  0, JOIN_INNER, NULL);
+								  0, JOIN_INNER, NULL, rel);
 	/*
 	 * The clause is only worth adding to the query if it rejects a useful
@ -344,7 +344,7 @@ consider_new_or_clause(PlannerInfo *root, RelOptInfo *rel,
 		/* Compute inner-join size */
 		orig_selec = clause_selectivity(root, (Node *) join_or_rinfo,
-										0, JOIN_INNER, &sjinfo);
+										0, JOIN_INNER, &sjinfo, NULL);
 		/* And hack cached selectivity so join size remains the same */
 		join_or_rinfo->norm_selec = orig_selec / or_selec;
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@ -1308,6 +1308,18 @@ get_relation_statistics(RelOptInfo *rel, Relation relation)
 			stainfos = lcons(info, stainfos);
 		}
 		if (statext_is_kind_built(htup, STATS_EXT_DEPENDENCIES))
 		{
 			StatisticExtInfo *info = makeNode(StatisticExtInfo);
 			info->statOid = statOid;
 			info->rel = rel;
 			info->kind = STATS_EXT_DEPENDENCIES;
 			info->keys = bms_copy(keys);
 			stainfos = lcons(info, stainfos);
 		}
 		ReleaseSysCache(htup);
 		bms_free(keys);
 	}
--- a/src/backend/statistics/Makefile
+++ b/src/backend/statistics/Makefile
@ -12,6 +12,6 @@ subdir = src/backend/statistics
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
-OBJS = extended_stats.o mvdistinct.o
+OBJS = extended_stats.o dependencies.o mvdistinct.o
 include $(top_srcdir)/src/backend/common.mk
--- a/src/backend/statistics/README
+++ b/src/backend/statistics/README
@ -8,10 +8,72 @@ not true, resulting in estimation errors.
 Extended statistics track different types of dependencies between the columns,
 hopefully improving the estimates and producing better plans.
 Currently we only have one type of extended statistics - ndistinct
 coefficients, and we use it to improve estimates of grouping queries. See
 README.ndistinct for details.
 Types of statistics
 -------------------
 There are two kinds of extended statistics:
    (a) ndistinct coefficients
    (b) soft functional dependencies (README.dependencies)
 Compatible clause types
 -----------------------
 Each type of statistics may be used to estimate some subset of clause types.
    (a) functional dependencies - equality clauses (AND), possibly IS NULL
 Currently, only OpExprs in the form Var op Const, or Const op Var are
 supported, however it's feasible to expand the code later to also estimate the
 selectivities on clauses such as Var op Var.
 Complex clauses
 ---------------
 We also support estimating more complex clauses - essentially AND/OR clauses
 with (Var op Const) as leaves, as long as all the referenced attributes are
 covered by a single statistics.
 For example this condition
    (a=1) AND ((b=2) OR ((c=3) AND (d=4)))
 may be estimated using statistics on (a,b,c,d). If we only have statistics on
 (b,c,d) we may estimate the second part, and estimate (a=1) using simple stats.
 If we only have statistics on (a,b,c) we can't apply it at all at this point,
 but it's worth pointing out clauselist_selectivity() works recursively and when
 handling the second part (the OR-clause), we'll be able to apply the statistics.
 Note: The multi-statistics estimation patch also makes it possible to pass some
 clauses as 'conditions' into the deeper parts of the expression tree.
 Selectivity estimation
 ----------------------
 Throughout the planner clauselist_selectivity() still remains in charge of
 most selectivity estimate requests. clauselist_selectivity() can be instructed
 to try to make use of any extended statistics on the given RelOptInfo, which
 it will do, if:
    (a) An actual valid RelOptInfo was given. Join relations are passed in as
        NULL, therefore are invalid.
    (b) The relation given actually has any extended statistics defined which
        are actually built.
 When the above conditions are met, clauselist_selectivity() first attempts to
 pass the clause list off to the extended statistics selectivity estimation
 function. This functions may not find any clauses which is can perform any
 estimations on. In such cases these clauses are simply ignored. When actual
 estimation work is performed in these functions they're expected to mark which
 clauses they've performed estimations for so that any other function
 performing estimations knows which clauses are to be skipped.
 Size of sample in ANALYZE
 -------------------------
--- a/src/backend/statistics/README.dependencies
+++ b/src/backend/statistics/README.dependencies
@ -0,0 +1,119 @@
 Soft functional dependencies
 ============================
 Functional dependencies are a concept well described in relational theory,
 particularly in the definition of normalization and "normal forms". Wikipedia
 has a nice definition of a functional dependency [1]:
    In a given table, an attribute Y is said to have a functional dependency
    on a set of attributes X (written X -> Y) if and only if each X value is
    associated with precisely one Y value. For example, in an "Employee"
    table that includes the attributes "Employee ID" and "Employee Date of
    Birth", the functional dependency
        {Employee ID} -> {Employee Date of Birth}
    would hold. It follows from the previous two sentences that each
    {Employee ID} is associated with precisely one {Employee Date of Birth}.
    [1] https://en.wikipedia.org/wiki/Functional_dependency
 In practical terms, functional dependencies mean that a value in one column
 determines values in some other column. Consider for example this trivial
 table with two integer columns:
    CREATE TABLE t (a INT, b INT)
        AS SELECT i, i/10 FROM generate_series(1,100000) s(i);
 Clearly, knowledge of the value in column 'a' is sufficient to determine the
 value in column 'b', as it's simply (a/10). A more practical example may be
 addresses, where the knowledge of a ZIP code (usually) determines city. Larger
 cities may have multiple ZIP codes, so the dependency can't be reversed.
 Many datasets might be normalized not to contain such dependencies, but often
 it's not practical for various reasons. In some cases, it's actually a conscious
 design choice to model the dataset in a denormalized way, either because of
 performance or to make querying easier.
 Soft dependencies
 -----------------
 Real-world data sets often contain data errors, either because of data entry
 mistakes (user mistyping the ZIP code) or perhaps issues in generating the
 data (e.g. a ZIP code mistakenly assigned to two cities in different states).
 A strict implementation would either ignore dependencies in such cases,
 rendering the approach mostly useless even for slightly noisy data sets, or
 result in sudden changes in behavior depending on minor differences between
 samples provided to ANALYZE.
 For this reason, the statistics implements "soft" functional dependencies,
 associating each functional dependency with a degree of validity (a number
 between 0 and 1). This degree is then used to combine selectivities in a
 smooth manner.
 Mining dependencies (ANALYZE)
 -----------------------------
 The current algorithm is fairly simple - generate all possible functional
 dependencies, and for each one count the number of rows consistent with it.
 Then use the fraction of rows (supporting/total) as the degree.
 To count the rows consistent with the dependency (a => b):
 (a) Sort the data lexicographically, i.e. first by 'a' then 'b'.
 (b) For each group of rows with the same 'a' value, count the number of
     distinct values in 'b'.
 (c) If there's a single distinct value in 'b', the rows are consistent with
     the functional dependency, otherwise they contradict it.
 The algorithm also requires a minimum size of the group to consider it
 consistent (currently 3 rows in the sample). Small groups make it less likely
 to break the consistency.
 Clause reduction (planner/optimizer)
 ------------------------------------
 Applying the functional dependencies is fairly simple - given a list of
 equality clauses, we compute selectivities of each clause and then use the
 degree to combine them using this formula
    P(a=?,b=?) = P(a=?) * (d + (1-d) * P(b=?))
 Where 'd' is the degree of functional dependence (a=>b).
 With more than two equality clauses, this process happens recursively. For
 example for (a,b,c) we first use (a,b=>c) to break the computation into
    P(a=?,b=?,c=?) = P(a=?,b=?) * (d + (1-d)*P(b=?))
 and then apply (a=>b) the same way on P(a=?,b=?).
 Consistency of clauses
 ----------------------
 Functional dependencies only express general dependencies between columns,
 without referencing particular values. This assumes that the equality clauses
 are in fact consistent with the functional dependency, i.e. that given a
 dependency (a=>b), the value in (b=?) clause is the value determined by (a=?).
 If that's not the case, the clauses are "inconsistent" with the functional
 dependency and the result will be over-estimation.
 This may happen, for example, when using conditions on the ZIP code and city
 name with mismatching values (ZIP code for a different city), etc. In such a
 case, the result set will be empty, but we'll estimate the selectivity using
 the ZIP code condition.
 In this case, the default estimation based on AVIA principle happens to work
 better, but mostly by chance.
 This issue is the price for the simplicity of functional dependencies. If the
 application frequently constructs queries with clauses inconsistent with
 functional dependencies present in the data, the best solution is not to
 use functional dependencies, but one of the more complex types of statistics.
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@ -47,7 +47,7 @@ static List *fetch_statentries_for_relation(Relation pg_statext, Oid relid);
 static VacAttrStats **lookup_var_attr_stats(Relation rel, Bitmapset *attrs,
 					  int natts, VacAttrStats **vacattrstats);
 static void statext_store(Relation pg_stext, Oid relid,
-			  MVNDistinct *ndistinct,
+			  MVNDistinct *ndistinct, MVDependencies *dependencies,
 			  VacAttrStats **stats);
@ -74,6 +74,7 @@ BuildRelationExtStatistics(Relation onerel, double totalrows,
 	{
 		StatExtEntry   *stat = (StatExtEntry *) lfirst(lc);
 		MVNDistinct	   *ndistinct = NULL;
 		MVDependencies *dependencies = NULL;
 		VacAttrStats  **stats;
 		ListCell	   *lc2;
@ -93,10 +94,13 @@ BuildRelationExtStatistics(Relation onerel, double totalrows,
 			if (t == STATS_EXT_NDISTINCT)
 				ndistinct = statext_ndistinct_build(totalrows, numrows, rows,
 													stat->columns, stats);
 			else if (t == STATS_EXT_DEPENDENCIES)
 				dependencies = statext_dependencies_build(numrows, rows,
 													   stat->columns, stats);
 		}
 		/* store the statistics in the catalog */
-		statext_store(pg_stext, stat->statOid, ndistinct, stats);
+		statext_store(pg_stext, stat->statOid, ndistinct, dependencies, stats);
 	}
 	heap_close(pg_stext, RowExclusiveLock);
@ -117,6 +121,10 @@ statext_is_kind_built(HeapTuple htup, char type)
 			attnum = Anum_pg_statistic_ext_standistinct;
 			break;
 		case STATS_EXT_DEPENDENCIES:
 			attnum = Anum_pg_statistic_ext_stadependencies;
 			break;
 		default:
 			elog(ERROR, "unexpected statistics type requested: %d", type);
 	}
@ -178,7 +186,8 @@ fetch_statentries_for_relation(Relation pg_statext, Oid relid)
 		enabled = (char *) ARR_DATA_PTR(arr);
 		for (i = 0; i < ARR_DIMS(arr)[0]; i++)
 		{
-			Assert(enabled[i] == STATS_EXT_NDISTINCT);
+			Assert((enabled[i] == STATS_EXT_NDISTINCT) ||
 				   (enabled[i] == STATS_EXT_DEPENDENCIES));
 			entry->types = lappend_int(entry->types, (int) enabled[i]);
 		}
@ -256,7 +265,7 @@ lookup_var_attr_stats(Relation rel, Bitmapset *attrs, int natts,
 */
 static void
 statext_store(Relation pg_stext, Oid statOid,
-			  MVNDistinct *ndistinct,
+			  MVNDistinct *ndistinct, MVDependencies *dependencies,
 			  VacAttrStats **stats)
 {
 	HeapTuple	stup,
@ -280,8 +289,17 @@ statext_store(Relation pg_stext, Oid statOid,
 		values[Anum_pg_statistic_ext_standistinct - 1] = PointerGetDatum(data);
 	}
 	if (dependencies != NULL)
 	{
 		bytea	   *data = statext_dependencies_serialize(dependencies);
 		nulls[Anum_pg_statistic_ext_stadependencies - 1] = (data == NULL);
 		values[Anum_pg_statistic_ext_stadependencies - 1] = PointerGetDatum(data);
 	}
 	/* always replace the value (either by bytea or NULL) */
 	replaces[Anum_pg_statistic_ext_standistinct - 1] = true;
 	replaces[Anum_pg_statistic_ext_stadependencies - 1] = true;
 	/* there should already be a pg_statistic_ext tuple */
 	oldtup = SearchSysCache1(STATEXTOID, ObjectIdGetDatum(statOid));
@ -387,3 +405,82 @@ multi_sort_compare_dims(int start, int end,
 	return 0;
 }
 /*
 * has_stats_of_kind
 *	Check that the list contains statistic of a given kind
 */
 bool
 has_stats_of_kind(List *stats, char requiredkind)
 {
 	ListCell   *l;
 	foreach(l, stats)
 	{
 		StatisticExtInfo *stat = (StatisticExtInfo *) lfirst(l);
 		if (stat->kind == requiredkind)
 			return true;
 	}
 	return false;
 }
 /*
 * choose_best_statistics
 *		Look for statistics with the specified 'requiredkind' which have keys
 *		that match at least two attnums.
 *
 * The current selection criteria is very simple - we choose the statistics
 * referencing the most attributes with the least keys.
 *
 * XXX if multiple statistics exists of the same size matching the same number
 * of keys, then the statistics which are chosen depend on the order that they
 * appear in the stats list. Perhaps this needs to be more definitive.
 */
 StatisticExtInfo *
 choose_best_statistics(List *stats, Bitmapset *attnums, char requiredkind)
 {
 	ListCell   *lc;
 	StatisticExtInfo *best_match = NULL;
 	int			best_num_matched = 2;	/* goal #1: maximize */
 	int			best_match_keys = (STATS_MAX_DIMENSIONS + 1);	/* goal #2: minimize */
 	foreach(lc, stats)
 	{
 		StatisticExtInfo *info = (StatisticExtInfo *) lfirst(lc);
 		int			num_matched;
 		int			numkeys;
 		Bitmapset  *matched;
 		/* skip statistics that are not the correct type */
 		if (info->kind != requiredkind)
 			continue;
 		/* determine how many attributes of these stats can be matched to */
 		matched = bms_intersect(attnums, info->keys);
 		num_matched = bms_num_members(matched);
 		bms_free(matched);
 		/*
 		 * save the actual number of keys in the stats so that we can choose
 		 * the narrowest stats with the most matching keys.
 		 */
 		numkeys = bms_num_members(info->keys);
 		/*
 		 * Use these statistics when it increases the number of matched
 		 * clauses or when it matches the same number of attributes but these
 		 * stats have fewer keys than any previous match.
 		 */
 		if (num_matched > best_num_matched ||
 			(num_matched == best_num_matched && numkeys < best_match_keys))
 		{
 			best_match = info;
 			best_num_matched = num_matched;
 			best_match_keys = numkeys;
 		}
 	}
 	return best_match;
 }
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@ -1452,6 +1452,13 @@ pg_get_statisticsext_worker(Oid statextid, bool missing_ok)
 	StringInfoData buf;
 	int			colno;
 	char	   *nsp;
 	ArrayType  *arr;
 	char	   *enabled;
 	Datum		datum;
 	bool		isnull;
 	bool		ndistinct_enabled;
 	bool		dependencies_enabled;
 	int			i;
 	statexttup = SearchSysCache1(STATEXTOID, ObjectIdGetDatum(statextid));
@ -1467,10 +1474,55 @@ pg_get_statisticsext_worker(Oid statextid, bool missing_ok)
 	initStringInfo(&buf);
 	nsp = get_namespace_name(statextrec->stanamespace);
-	appendStringInfo(&buf, "CREATE STATISTICS %s ON (",
+	appendStringInfo(&buf, "CREATE STATISTICS %s",
 					 quote_qualified_identifier(nsp,
 												NameStr(statextrec->staname)));
 	/*
 	 * Lookup the staenabled column so that we know how to handle the WITH
 	 * clause.
 	 */
 	datum = SysCacheGetAttr(STATEXTOID, statexttup,
 							Anum_pg_statistic_ext_staenabled, &isnull);
 	Assert(!isnull);
 	arr = DatumGetArrayTypeP(datum);
 	if (ARR_NDIM(arr) != 1 ||
 		ARR_HASNULL(arr) ||
 		ARR_ELEMTYPE(arr) != CHAROID)
 		elog(ERROR, "staenabled is not a 1-D char array");
 	enabled = (char *) ARR_DATA_PTR(arr);
 	ndistinct_enabled = false;
 	dependencies_enabled = false;
 	for (i = 0; i < ARR_DIMS(arr)[0]; i++)
 	{
 		if (enabled[i] == STATS_EXT_NDISTINCT)
 			ndistinct_enabled = true;
 		if (enabled[i] == STATS_EXT_DEPENDENCIES)
 			dependencies_enabled = true;
 	}
 	/*
 	 * If any option is disabled, then we'll need to append a WITH clause to
 	 * show which options are enabled.  We omit the WITH clause on purpose
 	 * when all options are enabled, so a pg_dump/pg_restore will create all
 	 * statistics types on a newer postgres version, if the statistics had all
 	 * options enabled on the original version.
 	 */
 	if (!ndistinct_enabled || !dependencies_enabled)
 	{
 		appendStringInfoString(&buf, " WITH (");
 		if (ndistinct_enabled)
 			appendStringInfoString(&buf, "ndistinct");
 		else if (dependencies_enabled)
 			appendStringInfoString(&buf, "dependencies");
 		appendStringInfoChar(&buf, ')');
 	}
 	appendStringInfoString(&buf, " ON (");
 	for (colno = 0; colno < statextrec->stakeys.dim1; colno++)
 	{
 		AttrNumber	attnum = statextrec->stakeys.values[colno];
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@ -1633,13 +1633,17 @@ booltestsel(PlannerInfo *root, BoolTestType booltesttype, Node *arg,
 			case IS_NOT_FALSE:
 				selec = (double) clause_selectivity(root, arg,
 													varRelid,
-													jointype, sjinfo);
+													jointype,
 													sjinfo,
 													NULL);
 				break;
 			case IS_FALSE:
 			case IS_NOT_TRUE:
 				selec = 1.0 - (double) clause_selectivity(root, arg,
 														  varRelid,
-														  jointype, sjinfo);
+														  jointype,
 														  sjinfo,
 														  NULL);
 				break;
 			default:
 				elog(ERROR, "unrecognized booltesttype: %d",
@ -6436,7 +6440,8 @@ genericcostestimate(PlannerInfo *root,
 	indexSelectivity = clauselist_selectivity(root, selectivityQuals,
 											  index->rel->relid,
 											  JOIN_INNER,
-											  NULL);
+											  NULL,
 											  index->rel);
 	/*
 	 * If caller didn't give us an estimate, estimate the number of index
@ -6757,7 +6762,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 		btreeSelectivity = clauselist_selectivity(root, selectivityQuals,
 												  index->rel->relid,
 												  JOIN_INNER,
-												  NULL);
+												  NULL,
 												  index->rel);
 		numIndexTuples = btreeSelectivity * index->rel->tuples;
 		/*
@ -7516,7 +7522,8 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	*indexSelectivity = clauselist_selectivity(root, selectivityQuals,
 											   index->rel->relid,
 											   JOIN_INNER,
-											   NULL);
+											   NULL,
 											   index->rel);
 	/* fetch estimated page cost for tablespace containing index */
 	get_tablespace_page_costs(index->reltablespace,
@ -7748,7 +7755,8 @@ brincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	*indexSelectivity =
 		clauselist_selectivity(root, indexQuals,
 							   path->indexinfo->rel->relid,
-							   JOIN_INNER, NULL);
+							   JOIN_INNER, NULL,
 							   path->indexinfo->rel);
 	*indexCorrelation = 1;
 	/*
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@ -2331,7 +2331,8 @@ describeOneTableDetails(const char *schemaname,
 						   "    FROM ((SELECT pg_catalog.unnest(stakeys) AS attnum) s\n"
 			   "         JOIN pg_catalog.pg_attribute a ON (starelid = a.attrelid AND\n"
 							  "a.attnum = s.attnum AND not attisdropped))) AS columns,\n"
-							  "  (staenabled::char[] @> '{d}'::char[]) AS ndist_enabled\n"
+							  "  (staenabled::char[] @> '{d}'::char[]) AS ndist_enabled,\n"
 							  "  (staenabled::char[] @> '{f}'::char[]) AS deps_enabled\n"
 			  "FROM pg_catalog.pg_statistic_ext stat WHERE starelid  = '%s'\n"
 			  "ORDER BY 1;",
 							  oid);
@ -2348,7 +2349,7 @@ describeOneTableDetails(const char *schemaname,
 				for (i = 0; i < tuples; i++)
 				{
-					int		cnt = 0;
+					bool		gotone = false;
 					printfPQExpBuffer(&buf, "    ");
@ -2361,7 +2362,12 @@ describeOneTableDetails(const char *schemaname,
 					if (strcmp(PQgetvalue(result, i, 5), "t") == 0)
 					{
 						appendPQExpBufferStr(&buf, "ndistinct");
-						cnt++;
+						gotone = true;
 					}
 					if (strcmp(PQgetvalue(result, i, 6), "t") == 0)
 					{
 						appendPQExpBuffer(&buf, "%sdependencies", gotone ? ", " : "");
 					}
 					appendPQExpBuffer(&buf, ") ON (%s)",
--- a/src/include/catalog/pg_cast.h
+++ b/src/include/catalog/pg_cast.h
@ -258,6 +258,10 @@ DATA(insert (  194	 25    0 i b ));
 DATA(insert (  3361  17    0 i b ));
 DATA(insert (  3361  25    0 i i ));
 /* pg_dependencies can be coerced to, but not from, bytea and text */
 DATA(insert (  3402  17    0 i b ));
 DATA(insert (  3402  25    0 i i ));
 /*
 * Datetime category
 */
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@ -2775,6 +2775,15 @@ DESCR("I/O");
 DATA(insert OID = 3358 (  pg_ndistinct_send PGNSP PGUID 12 1 0 0 0 f f f f t f s s 1 0 17 "3361" _null_ _null_ _null_ _null_ _null_ pg_ndistinct_send _null_ _null_ _null_ ));
 DESCR("I/O");
 DATA(insert OID = 3404 (  pg_dependencies_in	PGNSP PGUID 12 1 0 0 0 f f f f t f i s 1 0 3402 "2275" _null_ _null_ _null_ _null_ _null_ pg_dependencies_in _null_ _null_ _null_ ));
 DESCR("I/O");
 DATA(insert OID = 3405 (  pg_dependencies_out	PGNSP PGUID 12 1 0 0 0 f f f f t f i s 1 0 2275 "3402" _null_ _null_ _null_ _null_ _null_ pg_dependencies_out _null_ _null_ _null_ ));
 DESCR("I/O");
 DATA(insert OID = 3406 (  pg_dependencies_recv	PGNSP PGUID 12 1 0 0 0 f f f f t f s s 1 0 3402 "2281" _null_ _null_ _null_ _null_ _null_ pg_dependencies_recv _null_ _null_ _null_ ));
 DESCR("I/O");
 DATA(insert OID = 3407 (  pg_dependencies_send	PGNSP PGUID 12 1 0 0 0 f f f f t f s s 1 0 17 "3402" _null_ _null_ _null_ _null_ _null_ pg_dependencies_send _null_ _null_ _null_ ));
 DESCR("I/O");
 DATA(insert OID = 1928 (  pg_stat_get_numscans			PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_numscans _null_ _null_ _null_ ));
 DESCR("statistics: number of scans done for table/index");
 DATA(insert OID = 1929 (  pg_stat_get_tuples_returned	PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_tuples_returned _null_ _null_ _null_ ));
--- a/src/include/catalog/pg_statistic_ext.h
+++ b/src/include/catalog/pg_statistic_ext.h
@ -46,6 +46,7 @@ CATALOG(pg_statistic_ext,3381)
 	char		staenabled[1] BKI_FORCE_NOT_NULL;	/* statistic types
 													 * requested to build */
 	pg_ndistinct standistinct;	/* ndistinct coefficients (serialized) */
 	pg_dependencies stadependencies;	/* dependencies (serialized) */
 #endif
 } FormData_pg_statistic_ext;
@ -61,7 +62,7 @@ typedef FormData_pg_statistic_ext *Form_pg_statistic_ext;
 *		compiler constants for pg_statistic_ext
 * ----------------
 */
-#define Natts_pg_statistic_ext					7
+#define Natts_pg_statistic_ext					8
 #define Anum_pg_statistic_ext_starelid			1
 #define Anum_pg_statistic_ext_staname			2
 #define Anum_pg_statistic_ext_stanamespace		3
@ -69,7 +70,9 @@ typedef FormData_pg_statistic_ext *Form_pg_statistic_ext;
 #define Anum_pg_statistic_ext_stakeys			5
 #define Anum_pg_statistic_ext_staenabled		6
 #define Anum_pg_statistic_ext_standistinct		7
 #define Anum_pg_statistic_ext_stadependencies	8
 #define STATS_EXT_NDISTINCT			'd'
 #define STATS_EXT_DEPENDENCIES		'f'
 #endif   /* PG_STATISTIC_EXT_H */
--- a/src/include/catalog/pg_type.h
+++ b/src/include/catalog/pg_type.h
@ -368,6 +368,10 @@ DATA(insert OID = 3361 ( pg_ndistinct		PGNSP PGUID -1 f b S f t \054 0 0 0 pg_nd
 DESCR("multivariate ndistinct coefficients");
 #define PGNDISTINCTOID	3361
 DATA(insert OID = 3402 ( pg_dependencies		PGNSP PGUID -1 f b S f t \054 0 0 0 pg_dependencies_in pg_dependencies_out pg_dependencies_recv pg_dependencies_send - - - i x f 0 -1 0 100 _null_ _null_ _null_ ));
 DESCR("multivariate dependencies");
 #define PGDEPENDENCIESOID	3402
 DATA(insert OID = 32 ( pg_ddl_command	PGNSP PGUID SIZEOF_POINTER t p P f t \054 0 0 0 pg_ddl_command_in pg_ddl_command_out pg_ddl_command_recv pg_ddl_command_send - - - ALIGNOF_POINTER p f 0 -1 0 0 _null_ _null_ _null_ ));
 DESCR("internal type for passing CollectedCommand");
 #define PGDDLCOMMANDOID 32
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@ -203,12 +203,14 @@ extern Selectivity clauselist_selectivity(PlannerInfo *root,
 					   List *clauses,
 					   int varRelid,
 					   JoinType jointype,
-					   SpecialJoinInfo *sjinfo);
+					   SpecialJoinInfo *sjinfo,
 					   RelOptInfo *rel);
 extern Selectivity clause_selectivity(PlannerInfo *root,
 				   Node *clause,
 				   int varRelid,
 				   JoinType jointype,
-				   SpecialJoinInfo *sjinfo);
+				   SpecialJoinInfo *sjinfo,
 				   RelOptInfo *rel);
 extern void cost_gather_merge(GatherMergePath *path, PlannerInfo *root,
 							  RelOptInfo *rel, ParamPathInfo *param_info,
 							  Cost input_startup_cost, Cost input_total_cost,
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@ -52,6 +52,11 @@ extern MVNDistinct *statext_ndistinct_build(double totalrows,
 extern bytea *statext_ndistinct_serialize(MVNDistinct *ndistinct);
 extern MVNDistinct *statext_ndistinct_deserialize(bytea *data);
 extern MVDependencies *statext_dependencies_build(int numrows, HeapTuple *rows,
 						Bitmapset *attrs, VacAttrStats **stats);
 extern bytea *statext_dependencies_serialize(MVDependencies *dependencies);
 extern MVDependencies *statext_dependencies_deserialize(bytea *data);
 extern MultiSortSupport multi_sort_init(int ndims);
 extern void multi_sort_add_dimension(MultiSortSupport mss, int sortdim,
 						 Oid oper);
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@ -14,6 +14,7 @@
 #define STATISTICS_H
 #include "commands/vacuum.h"
 #include "nodes/relation.h"
 #define STATS_MAX_DIMENSIONS	8		/* max number of attributes */
@ -44,11 +45,54 @@ typedef struct MVNDistinct
 #define SizeOfMVNDistinct	(offsetof(MVNDistinct, nitems) + sizeof(uint32))
 /* size of the struct excluding the items array */
 #define SizeOfMVNDistinct	(offsetof(MVNDistinct, nitems) + sizeof(uint32))
 #define STATS_DEPS_MAGIC		0xB4549A2C		/* marks serialized bytea */
 #define STATS_DEPS_TYPE_BASIC	1		/* basic dependencies type */
 /*
 * Functional dependencies, tracking column-level relationships (values
 * in one column determine values in another one).
 */
 typedef struct MVDependency
 {
 	double		degree;			/* degree of validity (0-1) */
 	AttrNumber	nattributes;	/* number of attributes */
 	AttrNumber	attributes[FLEXIBLE_ARRAY_MEMBER];		/* attribute numbers */
 } MVDependency;
 /* size of the struct excluding the deps array */
 #define SizeOfDependency \
 	(offsetof(MVDependency, nattributes) + sizeof(AttrNumber))
 typedef struct MVDependencies
 {
 	uint32		magic;			/* magic constant marker */
 	uint32		type;			/* type of MV Dependencies (BASIC) */
 	uint32		ndeps;			/* number of dependencies */
 	MVDependency *deps[FLEXIBLE_ARRAY_MEMBER];	/* dependencies */
 } MVDependencies;
 /* size of the struct excluding the deps array */
 #define SizeOfDependencies	(offsetof(MVDependencies, ndeps) + sizeof(uint32))
 extern MVNDistinct *statext_ndistinct_load(Oid mvoid);
 extern MVDependencies *staext_dependencies_load(Oid mvoid);
 extern void BuildRelationExtStatistics(Relation onerel, double totalrows,
 						   int numrows, HeapTuple *rows,
 						   int natts, VacAttrStats **vacattrstats);
 extern bool statext_is_kind_built(HeapTuple htup, char kind);
 extern Selectivity dependencies_clauselist_selectivity(PlannerInfo *root,
 									List *clauses,
 									int varRelid,
 									JoinType jointype,
 									SpecialJoinInfo *sjinfo,
 									RelOptInfo *rel,
 									Bitmapset **estimatedclauses);
 extern bool has_stats_of_kind(List *stats, char requiredkind);
 extern StatisticExtInfo *choose_best_statistics(List *stats,
 					   Bitmapset *attnums, char requiredkind);
 #endif   /* STATISTICS_H */
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@ -824,11 +824,12 @@ WHERE c.castmethod = 'b' AND
 character varying | character         |        0 | i
 pg_node_tree      | text              |        0 | i
 pg_ndistinct      | bytea             |        0 | i
 pg_dependencies   | bytea             |        0 | i
 cidr              | inet              |        0 | i
 xml               | text              |        0 | a
 xml               | character varying |        0 | a
 xml               | character         |        0 | a
-(8 rows)
+(9 rows)
 -- **************** pg_conversion ****************
 -- Look for illegal values in pg_conversion fields.
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@ -2192,7 +2192,8 @@ pg_stats_ext| SELECT n.nspname AS schemaname,
    c.relname AS tablename,
    s.staname,
    s.stakeys AS attnums,
-    length((s.standistinct)::text) AS ndistbytes
+    length((s.standistinct)::bytea) AS ndistbytes,
    length((s.stadependencies)::bytea) AS depsbytes
   FROM ((pg_statistic_ext s
     JOIN pg_class c ON ((c.oid = s.starelid)))
     LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)));
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@ -31,7 +31,7 @@ ALTER TABLE ab1 DROP COLUMN a;
 b      | integer |           |          | 
 c      | integer |           |          | 
 Statistics:
-    "public.ab1_b_c_stats" WITH (ndistinct) ON (b, c)
+    "public.ab1_b_c_stats" WITH (ndistinct, dependencies) ON (b, c)
 DROP TABLE ab1;
 -- Ensure things work sanely with SET STATISTICS 0
@ -135,7 +135,7 @@ SELECT staenabled, standistinct
  FROM pg_statistic_ext WHERE starelid = 'ndistinct'::regclass;
 staenabled |                                          standistinct                                          
 ------------+------------------------------------------------------------------------------------------------
- {d}        | [{(b 3 4), 301.000000}, {(b 3 6), 301.000000}, {(b 4 6), 301.000000}, {(b 3 4 6), 301.000000}]
+ {d,f}      | [{(b 3 4), 301.000000}, {(b 3 6), 301.000000}, {(b 4 6), 301.000000}, {(b 3 4 6), 301.000000}]
 (1 row)
 -- Hash Aggregate, thanks to estimates improved by the statistic
@ -201,7 +201,7 @@ SELECT staenabled, standistinct
  FROM pg_statistic_ext WHERE starelid = 'ndistinct'::regclass;
 staenabled |                                            standistinct                                            
 ------------+----------------------------------------------------------------------------------------------------
- {d}        | [{(b 3 4), 2550.000000}, {(b 3 6), 800.000000}, {(b 4 6), 1632.000000}, {(b 3 4 6), 10000.000000}]
+ {d,f}      | [{(b 3 4), 2550.000000}, {(b 3 6), 800.000000}, {(b 4 6), 1632.000000}, {(b 3 4 6), 10000.000000}]
 (1 row)
 -- plans using Group Aggregate, thanks to using correct esimates
@ -311,3 +311,107 @@ EXPLAIN (COSTS off)
 (3 rows)
 DROP TABLE ndistinct;
 -- functional dependencies tests
 CREATE TABLE functional_dependencies (
    filler1 TEXT,
    filler2 NUMERIC,
    a INT,
    b TEXT,
    filler3 DATE,
    c INT,
    d TEXT
 );
 SET random_page_cost = 1.2;
 CREATE INDEX fdeps_ab_idx ON functional_dependencies (a, b);
 CREATE INDEX fdeps_abc_idx ON functional_dependencies (a, b, c);
 -- random data (no functional dependencies)
 INSERT INTO functional_dependencies (a, b, c, filler1)
     SELECT mod(i, 23), mod(i, 29), mod(i, 31), i FROM generate_series(1,5000) s(i);
 ANALYZE functional_dependencies;
 EXPLAIN (COSTS OFF)
 SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1';
                    QUERY PLAN                     
 ---------------------------------------------------
 Bitmap Heap Scan on functional_dependencies
   Recheck Cond: ((a = 1) AND (b = '1'::text))
   ->  Bitmap Index Scan on fdeps_abc_idx
         Index Cond: ((a = 1) AND (b = '1'::text))
 (4 rows)
 EXPLAIN (COSTS OFF)
 SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1' AND c = 1;
                        QUERY PLAN                         
 -----------------------------------------------------------
 Index Scan using fdeps_abc_idx on functional_dependencies
   Index Cond: ((a = 1) AND (b = '1'::text) AND (c = 1))
 (2 rows)
 -- create statistics
 CREATE STATISTICS func_deps_stat WITH (dependencies) ON (a, b, c) FROM functional_dependencies;
 ANALYZE functional_dependencies;
 EXPLAIN (COSTS OFF)
 SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1';
                    QUERY PLAN                     
 ---------------------------------------------------
 Bitmap Heap Scan on functional_dependencies
   Recheck Cond: ((a = 1) AND (b = '1'::text))
   ->  Bitmap Index Scan on fdeps_abc_idx
         Index Cond: ((a = 1) AND (b = '1'::text))
 (4 rows)
 EXPLAIN (COSTS OFF)
 SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1' AND c = 1;
                        QUERY PLAN                         
 -----------------------------------------------------------
 Index Scan using fdeps_abc_idx on functional_dependencies
   Index Cond: ((a = 1) AND (b = '1'::text) AND (c = 1))
 (2 rows)
 -- a => b, a => c, b => c
 TRUNCATE functional_dependencies;
 DROP STATISTICS func_deps_stat;
 INSERT INTO functional_dependencies (a, b, c, filler1)
     SELECT mod(i,100), mod(i,50), mod(i,25), i FROM generate_series(1,5000) s(i);
 ANALYZE functional_dependencies;
 EXPLAIN (COSTS OFF)
 SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1';
                        QUERY PLAN                         
 -----------------------------------------------------------
 Index Scan using fdeps_abc_idx on functional_dependencies
   Index Cond: ((a = 1) AND (b = '1'::text))
 (2 rows)
 EXPLAIN (COSTS OFF)
 SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1' AND c = 1;
                        QUERY PLAN                         
 -----------------------------------------------------------
 Index Scan using fdeps_abc_idx on functional_dependencies
   Index Cond: ((a = 1) AND (b = '1'::text) AND (c = 1))
 (2 rows)
 -- create statistics
 CREATE STATISTICS func_deps_stat WITH (dependencies) ON (a, b, c) FROM functional_dependencies;
 ANALYZE functional_dependencies;
 EXPLAIN (COSTS OFF)
 SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1';
                    QUERY PLAN                     
 ---------------------------------------------------
 Bitmap Heap Scan on functional_dependencies
   Recheck Cond: ((a = 1) AND (b = '1'::text))
   ->  Bitmap Index Scan on fdeps_abc_idx
         Index Cond: ((a = 1) AND (b = '1'::text))
 (4 rows)
 EXPLAIN (COSTS OFF)
 SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1' AND c = 1;
                    QUERY PLAN                     
 ---------------------------------------------------
 Bitmap Heap Scan on functional_dependencies
   Recheck Cond: ((a = 1) AND (b = '1'::text))
   Filter: (c = 1)
   ->  Bitmap Index Scan on fdeps_ab_idx
         Index Cond: ((a = 1) AND (b = '1'::text))
 (5 rows)
 RESET random_page_cost;
 DROP TABLE functional_dependencies;
--- a/src/test/regress/expected/type_sanity.out
+++ b/src/test/regress/expected/type_sanity.out
@ -68,11 +68,12 @@ WHERE p1.typtype not in ('c','d','p') AND p1.typname NOT LIKE E'\\_%'
     WHERE p2.typname = ('_' || p1.typname)::name AND
           p2.typelem = p1.oid and p1.typarray = p2.oid);
 oid  |     typname     
------+--------------
+------+-----------------
  194 | pg_node_tree
 3361 | pg_ndistinct
 3402 | pg_dependencies
  210 | smgr
-(3 rows)
+(4 rows)
 -- Make sure typarray points to a varlena array type of our own base
 SELECT p1.oid, p1.typname as basetype, p2.typname as arraytype,
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@ -163,3 +163,71 @@ EXPLAIN (COSTS off)
 SELECT COUNT(*) FROM ndistinct GROUP BY a, d;
 DROP TABLE ndistinct;
 -- functional dependencies tests
 CREATE TABLE functional_dependencies (
    filler1 TEXT,
    filler2 NUMERIC,
    a INT,
    b TEXT,
    filler3 DATE,
    c INT,
    d TEXT
 );
 SET random_page_cost = 1.2;
 CREATE INDEX fdeps_ab_idx ON functional_dependencies (a, b);
 CREATE INDEX fdeps_abc_idx ON functional_dependencies (a, b, c);
 -- random data (no functional dependencies)
 INSERT INTO functional_dependencies (a, b, c, filler1)
     SELECT mod(i, 23), mod(i, 29), mod(i, 31), i FROM generate_series(1,5000) s(i);
 ANALYZE functional_dependencies;
 EXPLAIN (COSTS OFF)
 SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1';
 EXPLAIN (COSTS OFF)
 SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1' AND c = 1;
 -- create statistics
 CREATE STATISTICS func_deps_stat WITH (dependencies) ON (a, b, c) FROM functional_dependencies;
 ANALYZE functional_dependencies;
 EXPLAIN (COSTS OFF)
 SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1';
 EXPLAIN (COSTS OFF)
 SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1' AND c = 1;
 -- a => b, a => c, b => c
 TRUNCATE functional_dependencies;
 DROP STATISTICS func_deps_stat;
 INSERT INTO functional_dependencies (a, b, c, filler1)
     SELECT mod(i,100), mod(i,50), mod(i,25), i FROM generate_series(1,5000) s(i);
 ANALYZE functional_dependencies;
 EXPLAIN (COSTS OFF)
 SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1';
 EXPLAIN (COSTS OFF)
 SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1' AND c = 1;
 -- create statistics
 CREATE STATISTICS func_deps_stat WITH (dependencies) ON (a, b, c) FROM functional_dependencies;
 ANALYZE functional_dependencies;
 EXPLAIN (COSTS OFF)
 SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1';
 EXPLAIN (COSTS OFF)
 SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1' AND c = 1;
 RESET random_page_cost;
 DROP TABLE functional_dependencies;