mirror of https://github.com/postgres/postgres
Thanks to Christopher Kings-Lynne <chris.kingslynne@gmail.com> for initial version and Jeff Davis <pgsql@j-davis.com> for inspectionREL8_2_STABLE
parent
4eef745fb1
commit
0ca9907ce4
@ -0,0 +1,231 @@ |
|||||||
|
<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.1 2006/09/14 11:16:27 teodor Exp $ --> |
||||||
|
|
||||||
|
<chapter id="GIN"> |
||||||
|
<title>GIN Indexes</title> |
||||||
|
|
||||||
|
<indexterm> |
||||||
|
<primary>index</primary> |
||||||
|
<secondary>GIN</secondary> |
||||||
|
</indexterm> |
||||||
|
|
||||||
|
<sect1 id="gin-intro"> |
||||||
|
<title>Introduction</title> |
||||||
|
|
||||||
|
<para> |
||||||
|
<acronym>GIN</acronym> stands for Generalized Inverted Index. It is |
||||||
|
an index structure storing a set of (key, posting list) pairs, where |
||||||
|
'posting list' is a set of rows in which the key occurs. The |
||||||
|
row may contain many keys. |
||||||
|
</para> |
||||||
|
|
||||||
|
<para> |
||||||
|
It is generalized in the sense that a <acronym>GIN</acronym> index |
||||||
|
does not need to be aware of the operation that it accelerates. |
||||||
|
Instead, it uses custom strategies defined for particular data types. |
||||||
|
</para> |
||||||
|
|
||||||
|
<para> |
||||||
|
One advantage of <acronym>GIN</acronym> is that it allows the development |
||||||
|
of custom data types with the appropriate access methods, by |
||||||
|
an expert in the domain of the data type, rather than a database expert. |
||||||
|
This is much the same advantage as using <acronym>GiST</acronym>. |
||||||
|
</para> |
||||||
|
|
||||||
|
<para> |
||||||
|
The <acronym>GIN</acronym> |
||||||
|
implementation in <productname>PostgreSQL</productname> is primarily |
||||||
|
maintained by Teodor Sigaev and Oleg Bartunov, and there is more |
||||||
|
information on their |
||||||
|
<ulink url="http://www.sai.msu.su/~megera/oddmuse/index.cgi/Gin">website</ulink>. |
||||||
|
</para> |
||||||
|
|
||||||
|
</sect1> |
||||||
|
|
||||||
|
<sect1 id="gin-extensibility"> |
||||||
|
<title>Extensibility</title> |
||||||
|
|
||||||
|
<para> |
||||||
|
The <acronym>GIN</acronym> interface has a high level of abstraction, |
||||||
|
requiring the access method implementer to only implement the semantics of |
||||||
|
the data type being accessed. The <acronym>GIN</acronym> layer itself |
||||||
|
takes care of concurrency, logging and searching the tree structure. |
||||||
|
</para> |
||||||
|
|
||||||
|
<para> |
||||||
|
All it takes to get a <acronym>GIN</acronym> access method working |
||||||
|
is to implement four user-defined methods, which define the behavior of |
||||||
|
keys in the tree. In short, <acronym>GIN</acronym> combines extensibility |
||||||
|
along with generality, code reuse, and a clean interface. |
||||||
|
</para> |
||||||
|
|
||||||
|
</sect1> |
||||||
|
|
||||||
|
<sect1 id="gin-implementation"> |
||||||
|
<title>Implementation</title> |
||||||
|
|
||||||
|
<para> |
||||||
|
Internally, <acronym>GIN</acronym> consists of a B-tree index constructed |
||||||
|
over keys, where each key is an element of the indexed value |
||||||
|
(element of array, for example) and where each tuple in a leaf page is |
||||||
|
either a pointer to a B-tree over heap pointers (PT, posting tree), or a |
||||||
|
list of heap pointers (PL, posting list) if the tuple is small enough. |
||||||
|
</para> |
||||||
|
|
||||||
|
<para> |
||||||
|
There are four methods that an index operator class for |
||||||
|
<acronym>GIN</acronym> must provide (prototypes are in pseudocode): |
||||||
|
</para> |
||||||
|
|
||||||
|
<variablelist> |
||||||
|
<varlistentry> |
||||||
|
<term>int compare( Datum a, Datum b )</term> |
||||||
|
<listitem> |
||||||
|
<para> |
||||||
|
Compares keys (not indexed values!) and returns an integer less than |
||||||
|
zero, zero, or greater than zero, indicating whether the first key is |
||||||
|
less than, equal to, or greater than the second. |
||||||
|
</para> |
||||||
|
</listitem> |
||||||
|
</varlistentry> |
||||||
|
|
||||||
|
<varlistentry> |
||||||
|
<term>Datum* extractValue(Datum inputValue, uint32 *nkeys)</term> |
||||||
|
<listitem> |
||||||
|
<para> |
||||||
|
Returns an array of keys of value to be indexed, nkeys should |
||||||
|
contain the number of returned keys. |
||||||
|
</para> |
||||||
|
</listitem> |
||||||
|
</varlistentry> |
||||||
|
|
||||||
|
<varlistentry> |
||||||
|
<term>Datum* extractQuery(Datum query, uint32 nkeys, |
||||||
|
StrategyNumber n)</term> |
||||||
|
<listitem> |
||||||
|
<para> |
||||||
|
Returns an array of keys of the query to be executed. n contains |
||||||
|
strategy number of operation (see <xref linkend="xindex-strategies">). |
||||||
|
Depending on n, query may be different type. |
||||||
|
</para> |
||||||
|
</listitem> |
||||||
|
</varlistentry> |
||||||
|
|
||||||
|
<varlistentry> |
||||||
|
<term>bool consistent( bool check[], StrategyNumber n, Datum query)</term> |
||||||
|
<listitem> |
||||||
|
<para> |
||||||
|
Returns TRUE if indexed value satisfies query qualifier with strategy n |
||||||
|
(or may satisfy in case of RECHECK mark in operator class). |
||||||
|
Each element of the check array is TRUE if indexed value has a |
||||||
|
corresponding key in the query: if (check[i] == TRUE ) the i-th key of |
||||||
|
the query is present in the indexed value. |
||||||
|
</para> |
||||||
|
</listitem> |
||||||
|
</varlistentry> |
||||||
|
|
||||||
|
</variablelist> |
||||||
|
|
||||||
|
</sect1> |
||||||
|
|
||||||
|
<sect1 id="gin-tips"> |
||||||
|
<title>GIN tips and trics</title> |
||||||
|
|
||||||
|
<variablelist> |
||||||
|
<varlistentry> |
||||||
|
<term>Create vs insert</term> |
||||||
|
<listitem> |
||||||
|
<para> |
||||||
|
In most cases, insertion into <acronym>GIN</acronym> index is slow because |
||||||
|
many GIN keys may be inserted for each table row. So, when loading data |
||||||
|
in bulk it may be useful to drop index and recreate it |
||||||
|
after the data is loaded in the table. |
||||||
|
</para> |
||||||
|
</listitem> |
||||||
|
</varlistentry> |
||||||
|
|
||||||
|
<varlistentry> |
||||||
|
<term>gin_fuzzy_search_limit</term> |
||||||
|
<listitem> |
||||||
|
<para> |
||||||
|
The primary goal of development <acronym>GIN</acronym> indices was |
||||||
|
support for highly scalable, full-text search in |
||||||
|
<productname>PostgreSQL</productname> and there are often situations when |
||||||
|
a full-text search returns a very large set of results. Since reading |
||||||
|
tuples from the disk and sorting them could take a lot of time, this is |
||||||
|
unacceptable for production. (Note that the index search itself is very |
||||||
|
fast.) |
||||||
|
</para> |
||||||
|
<para> |
||||||
|
Such queries usually contain very frequent words, so the results are not |
||||||
|
very helpful. To facilitate execution of such queries |
||||||
|
<acronym>GIN</acronym> has a configurable soft upper limit of the size |
||||||
|
of the returned set, determined by the |
||||||
|
<varname>gin_fuzzy_search_limit</varname> GUC variable. It is set to 0 by |
||||||
|
default (no limit). |
||||||
|
</para> |
||||||
|
<para> |
||||||
|
If a non-zero search limit is set, then the returned set is a subset of |
||||||
|
the whole result set, chosen at random. |
||||||
|
</para> |
||||||
|
<para> |
||||||
|
"Soft" means that the actual number of returned results could slightly |
||||||
|
differ from the specified limit, depending on the query and the quality |
||||||
|
of the system's random number generator. |
||||||
|
</para> |
||||||
|
</listitem> |
||||||
|
</varlistentry> |
||||||
|
<variablelist> |
||||||
|
|
||||||
|
</sect1> |
||||||
|
|
||||||
|
<sect1 id="gin-limit"> |
||||||
|
<title>Limitations</title> |
||||||
|
|
||||||
|
<para> |
||||||
|
<acronym>GIN</acronym> doesn't support full scan of index due to it's |
||||||
|
extremely inefficiency: because of a lot of keys per value, |
||||||
|
each heap pointer will returned several times. |
||||||
|
</para> |
||||||
|
|
||||||
|
<para> |
||||||
|
When extractQuery returns zero number of keys, <acronym>GIN</acronym> will |
||||||
|
emit a error: for different opclass and strategy semantic meaning of void |
||||||
|
query may be different (for example, any array contains void array, |
||||||
|
but they aren't overlapped with void one), and <acronym>GIN</acronym> can't |
||||||
|
suggest reasonable answer. |
||||||
|
</para> |
||||||
|
|
||||||
|
<para> |
||||||
|
<acronym>GIN</acronym> searches keys only by equality matching. This may |
||||||
|
be improved in future. |
||||||
|
</para> |
||||||
|
</sect1> |
||||||
|
<sect1 id="gin-examples"> |
||||||
|
<title>Examples</title> |
||||||
|
|
||||||
|
<para> |
||||||
|
The <productname>PostgreSQL</productname> source distribution includes |
||||||
|
<acronym>GIN</acronym> classes for one-dimensional arrays of all internal |
||||||
|
types. The following |
||||||
|
<filename>contrib</> modules also contain <acronym>GIN</acronym> |
||||||
|
operator classes: |
||||||
|
</para> |
||||||
|
|
||||||
|
<variablelist> |
||||||
|
<varlistentry> |
||||||
|
<term>intarray</term> |
||||||
|
<listitem> |
||||||
|
<para>Enhanced support for int4[]</para> |
||||||
|
</listitem> |
||||||
|
</varlistentry> |
||||||
|
|
||||||
|
<varlistentry> |
||||||
|
<term>tsearch2</term> |
||||||
|
<listitem> |
||||||
|
<para>Support for inverted text indexing. This is much faster for very |
||||||
|
large, mostly-static sets of documents. |
||||||
|
</para> |
||||||
|
</listitem> |
||||||
|
</varlistentry> |
||||||
|
|
||||||
|
</chapter> |
||||||
Loading…
Reference in new issue