mirror of https://github.com/postgres/postgres
Follow on patch in the multi-variate statistics patch series.
CREATE STATISTICS s1 WITH (dependencies) ON (a, b) FROM t;
ANALYZE;
will collect dependency stats on (a, b) and then use the measured
dependency in subsequent query planning.
Commit 7b504eb282
added
CREATE STATISTICS with n-distinct coefficients. These are now
specified using the mutually exclusive option WITH (ndistinct).
Author: Tomas Vondra, David Rowley
Reviewed-by: Kyotaro HORIGUCHI, Álvaro Herrera, Dean Rasheed, Robert Haas
and many other comments and contributions
Discussion: https://postgr.es/m/56f40b20-c464-fad2-ff39-06b668fac47c@2ndquadrant.com
pull/3/merge
parent
00b6b6feb1
commit
2686ee1b7c
@ -0,0 +1,119 @@ |
||||
Soft functional dependencies |
||||
============================ |
||||
|
||||
Functional dependencies are a concept well described in relational theory, |
||||
particularly in the definition of normalization and "normal forms". Wikipedia |
||||
has a nice definition of a functional dependency [1]: |
||||
|
||||
In a given table, an attribute Y is said to have a functional dependency |
||||
on a set of attributes X (written X -> Y) if and only if each X value is |
||||
associated with precisely one Y value. For example, in an "Employee" |
||||
table that includes the attributes "Employee ID" and "Employee Date of |
||||
Birth", the functional dependency |
||||
|
||||
{Employee ID} -> {Employee Date of Birth} |
||||
|
||||
would hold. It follows from the previous two sentences that each |
||||
{Employee ID} is associated with precisely one {Employee Date of Birth}. |
||||
|
||||
[1] https://en.wikipedia.org/wiki/Functional_dependency |
||||
|
||||
In practical terms, functional dependencies mean that a value in one column |
||||
determines values in some other column. Consider for example this trivial |
||||
table with two integer columns: |
||||
|
||||
CREATE TABLE t (a INT, b INT) |
||||
AS SELECT i, i/10 FROM generate_series(1,100000) s(i); |
||||
|
||||
Clearly, knowledge of the value in column 'a' is sufficient to determine the |
||||
value in column 'b', as it's simply (a/10). A more practical example may be |
||||
addresses, where the knowledge of a ZIP code (usually) determines city. Larger |
||||
cities may have multiple ZIP codes, so the dependency can't be reversed. |
||||
|
||||
Many datasets might be normalized not to contain such dependencies, but often |
||||
it's not practical for various reasons. In some cases, it's actually a conscious |
||||
design choice to model the dataset in a denormalized way, either because of |
||||
performance or to make querying easier. |
||||
|
||||
|
||||
Soft dependencies |
||||
----------------- |
||||
|
||||
Real-world data sets often contain data errors, either because of data entry |
||||
mistakes (user mistyping the ZIP code) or perhaps issues in generating the |
||||
data (e.g. a ZIP code mistakenly assigned to two cities in different states). |
||||
|
||||
A strict implementation would either ignore dependencies in such cases, |
||||
rendering the approach mostly useless even for slightly noisy data sets, or |
||||
result in sudden changes in behavior depending on minor differences between |
||||
samples provided to ANALYZE. |
||||
|
||||
For this reason, the statistics implements "soft" functional dependencies, |
||||
associating each functional dependency with a degree of validity (a number |
||||
between 0 and 1). This degree is then used to combine selectivities in a |
||||
smooth manner. |
||||
|
||||
|
||||
Mining dependencies (ANALYZE) |
||||
----------------------------- |
||||
|
||||
The current algorithm is fairly simple - generate all possible functional |
||||
dependencies, and for each one count the number of rows consistent with it. |
||||
Then use the fraction of rows (supporting/total) as the degree. |
||||
|
||||
To count the rows consistent with the dependency (a => b): |
||||
|
||||
(a) Sort the data lexicographically, i.e. first by 'a' then 'b'. |
||||
|
||||
(b) For each group of rows with the same 'a' value, count the number of |
||||
distinct values in 'b'. |
||||
|
||||
(c) If there's a single distinct value in 'b', the rows are consistent with |
||||
the functional dependency, otherwise they contradict it. |
||||
|
||||
The algorithm also requires a minimum size of the group to consider it |
||||
consistent (currently 3 rows in the sample). Small groups make it less likely |
||||
to break the consistency. |
||||
|
||||
|
||||
Clause reduction (planner/optimizer) |
||||
------------------------------------ |
||||
|
||||
Applying the functional dependencies is fairly simple - given a list of |
||||
equality clauses, we compute selectivities of each clause and then use the |
||||
degree to combine them using this formula |
||||
|
||||
P(a=?,b=?) = P(a=?) * (d + (1-d) * P(b=?)) |
||||
|
||||
Where 'd' is the degree of functional dependence (a=>b). |
||||
|
||||
With more than two equality clauses, this process happens recursively. For |
||||
example for (a,b,c) we first use (a,b=>c) to break the computation into |
||||
|
||||
P(a=?,b=?,c=?) = P(a=?,b=?) * (d + (1-d)*P(b=?)) |
||||
|
||||
and then apply (a=>b) the same way on P(a=?,b=?). |
||||
|
||||
|
||||
Consistency of clauses |
||||
---------------------- |
||||
|
||||
Functional dependencies only express general dependencies between columns, |
||||
without referencing particular values. This assumes that the equality clauses |
||||
are in fact consistent with the functional dependency, i.e. that given a |
||||
dependency (a=>b), the value in (b=?) clause is the value determined by (a=?). |
||||
If that's not the case, the clauses are "inconsistent" with the functional |
||||
dependency and the result will be over-estimation. |
||||
|
||||
This may happen, for example, when using conditions on the ZIP code and city |
||||
name with mismatching values (ZIP code for a different city), etc. In such a |
||||
case, the result set will be empty, but we'll estimate the selectivity using |
||||
the ZIP code condition. |
||||
|
||||
In this case, the default estimation based on AVIA principle happens to work |
||||
better, but mostly by chance. |
||||
|
||||
This issue is the price for the simplicity of functional dependencies. If the |
||||
application frequently constructs queries with clauses inconsistent with |
||||
functional dependencies present in the data, the best solution is not to |
||||
use functional dependencies, but one of the more complex types of statistics. |
File diff suppressed because it is too large
Load Diff
Loading…
Reference in new issue