mirror of https://github.com/postgres/postgres
Follow on patch in the multi-variate statistics patch series.
CREATE STATISTICS s1 WITH (dependencies) ON (a, b) FROM t;
ANALYZE;
will collect dependency stats on (a, b) and then use the measured
dependency in subsequent query planning.
Commit 7b504eb282
added
CREATE STATISTICS with n-distinct coefficients. These are now
specified using the mutually exclusive option WITH (ndistinct).
Author: Tomas Vondra, David Rowley
Reviewed-by: Kyotaro HORIGUCHI, Álvaro Herrera, Dean Rasheed, Robert Haas
and many other comments and contributions
Discussion: https://postgr.es/m/56f40b20-c464-fad2-ff39-06b668fac47c@2ndquadrant.com
pull/3/merge
parent
00b6b6feb1
commit
2686ee1b7c
@ -0,0 +1,119 @@ |
|||||||
|
Soft functional dependencies |
||||||
|
============================ |
||||||
|
|
||||||
|
Functional dependencies are a concept well described in relational theory, |
||||||
|
particularly in the definition of normalization and "normal forms". Wikipedia |
||||||
|
has a nice definition of a functional dependency [1]: |
||||||
|
|
||||||
|
In a given table, an attribute Y is said to have a functional dependency |
||||||
|
on a set of attributes X (written X -> Y) if and only if each X value is |
||||||
|
associated with precisely one Y value. For example, in an "Employee" |
||||||
|
table that includes the attributes "Employee ID" and "Employee Date of |
||||||
|
Birth", the functional dependency |
||||||
|
|
||||||
|
{Employee ID} -> {Employee Date of Birth} |
||||||
|
|
||||||
|
would hold. It follows from the previous two sentences that each |
||||||
|
{Employee ID} is associated with precisely one {Employee Date of Birth}. |
||||||
|
|
||||||
|
[1] https://en.wikipedia.org/wiki/Functional_dependency |
||||||
|
|
||||||
|
In practical terms, functional dependencies mean that a value in one column |
||||||
|
determines values in some other column. Consider for example this trivial |
||||||
|
table with two integer columns: |
||||||
|
|
||||||
|
CREATE TABLE t (a INT, b INT) |
||||||
|
AS SELECT i, i/10 FROM generate_series(1,100000) s(i); |
||||||
|
|
||||||
|
Clearly, knowledge of the value in column 'a' is sufficient to determine the |
||||||
|
value in column 'b', as it's simply (a/10). A more practical example may be |
||||||
|
addresses, where the knowledge of a ZIP code (usually) determines city. Larger |
||||||
|
cities may have multiple ZIP codes, so the dependency can't be reversed. |
||||||
|
|
||||||
|
Many datasets might be normalized not to contain such dependencies, but often |
||||||
|
it's not practical for various reasons. In some cases, it's actually a conscious |
||||||
|
design choice to model the dataset in a denormalized way, either because of |
||||||
|
performance or to make querying easier. |
||||||
|
|
||||||
|
|
||||||
|
Soft dependencies |
||||||
|
----------------- |
||||||
|
|
||||||
|
Real-world data sets often contain data errors, either because of data entry |
||||||
|
mistakes (user mistyping the ZIP code) or perhaps issues in generating the |
||||||
|
data (e.g. a ZIP code mistakenly assigned to two cities in different states). |
||||||
|
|
||||||
|
A strict implementation would either ignore dependencies in such cases, |
||||||
|
rendering the approach mostly useless even for slightly noisy data sets, or |
||||||
|
result in sudden changes in behavior depending on minor differences between |
||||||
|
samples provided to ANALYZE. |
||||||
|
|
||||||
|
For this reason, the statistics implements "soft" functional dependencies, |
||||||
|
associating each functional dependency with a degree of validity (a number |
||||||
|
between 0 and 1). This degree is then used to combine selectivities in a |
||||||
|
smooth manner. |
||||||
|
|
||||||
|
|
||||||
|
Mining dependencies (ANALYZE) |
||||||
|
----------------------------- |
||||||
|
|
||||||
|
The current algorithm is fairly simple - generate all possible functional |
||||||
|
dependencies, and for each one count the number of rows consistent with it. |
||||||
|
Then use the fraction of rows (supporting/total) as the degree. |
||||||
|
|
||||||
|
To count the rows consistent with the dependency (a => b): |
||||||
|
|
||||||
|
(a) Sort the data lexicographically, i.e. first by 'a' then 'b'. |
||||||
|
|
||||||
|
(b) For each group of rows with the same 'a' value, count the number of |
||||||
|
distinct values in 'b'. |
||||||
|
|
||||||
|
(c) If there's a single distinct value in 'b', the rows are consistent with |
||||||
|
the functional dependency, otherwise they contradict it. |
||||||
|
|
||||||
|
The algorithm also requires a minimum size of the group to consider it |
||||||
|
consistent (currently 3 rows in the sample). Small groups make it less likely |
||||||
|
to break the consistency. |
||||||
|
|
||||||
|
|
||||||
|
Clause reduction (planner/optimizer) |
||||||
|
------------------------------------ |
||||||
|
|
||||||
|
Applying the functional dependencies is fairly simple - given a list of |
||||||
|
equality clauses, we compute selectivities of each clause and then use the |
||||||
|
degree to combine them using this formula |
||||||
|
|
||||||
|
P(a=?,b=?) = P(a=?) * (d + (1-d) * P(b=?)) |
||||||
|
|
||||||
|
Where 'd' is the degree of functional dependence (a=>b). |
||||||
|
|
||||||
|
With more than two equality clauses, this process happens recursively. For |
||||||
|
example for (a,b,c) we first use (a,b=>c) to break the computation into |
||||||
|
|
||||||
|
P(a=?,b=?,c=?) = P(a=?,b=?) * (d + (1-d)*P(b=?)) |
||||||
|
|
||||||
|
and then apply (a=>b) the same way on P(a=?,b=?). |
||||||
|
|
||||||
|
|
||||||
|
Consistency of clauses |
||||||
|
---------------------- |
||||||
|
|
||||||
|
Functional dependencies only express general dependencies between columns, |
||||||
|
without referencing particular values. This assumes that the equality clauses |
||||||
|
are in fact consistent with the functional dependency, i.e. that given a |
||||||
|
dependency (a=>b), the value in (b=?) clause is the value determined by (a=?). |
||||||
|
If that's not the case, the clauses are "inconsistent" with the functional |
||||||
|
dependency and the result will be over-estimation. |
||||||
|
|
||||||
|
This may happen, for example, when using conditions on the ZIP code and city |
||||||
|
name with mismatching values (ZIP code for a different city), etc. In such a |
||||||
|
case, the result set will be empty, but we'll estimate the selectivity using |
||||||
|
the ZIP code condition. |
||||||
|
|
||||||
|
In this case, the default estimation based on AVIA principle happens to work |
||||||
|
better, but mostly by chance. |
||||||
|
|
||||||
|
This issue is the price for the simplicity of functional dependencies. If the |
||||||
|
application frequently constructs queries with clauses inconsistent with |
||||||
|
functional dependencies present in the data, the best solution is not to |
||||||
|
use functional dependencies, but one of the more complex types of statistics. |
File diff suppressed because it is too large
Load Diff
Loading…
Reference in new issue