mirror of https://github.com/postgres/postgres
Thanks to Christopher Kings-Lynne <chris.kingslynne@gmail.com> for initial version and Jeff Davis <pgsql@j-davis.com> for inspectionREL8_2_STABLE
parent
4eef745fb1
commit
0ca9907ce4
@ -0,0 +1,231 @@ |
||||
<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.1 2006/09/14 11:16:27 teodor Exp $ --> |
||||
|
||||
<chapter id="GIN"> |
||||
<title>GIN Indexes</title> |
||||
|
||||
<indexterm> |
||||
<primary>index</primary> |
||||
<secondary>GIN</secondary> |
||||
</indexterm> |
||||
|
||||
<sect1 id="gin-intro"> |
||||
<title>Introduction</title> |
||||
|
||||
<para> |
||||
<acronym>GIN</acronym> stands for Generalized Inverted Index. It is |
||||
an index structure storing a set of (key, posting list) pairs, where |
||||
'posting list' is a set of rows in which the key occurs. The |
||||
row may contain many keys. |
||||
</para> |
||||
|
||||
<para> |
||||
It is generalized in the sense that a <acronym>GIN</acronym> index |
||||
does not need to be aware of the operation that it accelerates. |
||||
Instead, it uses custom strategies defined for particular data types. |
||||
</para> |
||||
|
||||
<para> |
||||
One advantage of <acronym>GIN</acronym> is that it allows the development |
||||
of custom data types with the appropriate access methods, by |
||||
an expert in the domain of the data type, rather than a database expert. |
||||
This is much the same advantage as using <acronym>GiST</acronym>. |
||||
</para> |
||||
|
||||
<para> |
||||
The <acronym>GIN</acronym> |
||||
implementation in <productname>PostgreSQL</productname> is primarily |
||||
maintained by Teodor Sigaev and Oleg Bartunov, and there is more |
||||
information on their |
||||
<ulink url="http://www.sai.msu.su/~megera/oddmuse/index.cgi/Gin">website</ulink>. |
||||
</para> |
||||
|
||||
</sect1> |
||||
|
||||
<sect1 id="gin-extensibility"> |
||||
<title>Extensibility</title> |
||||
|
||||
<para> |
||||
The <acronym>GIN</acronym> interface has a high level of abstraction, |
||||
requiring the access method implementer to only implement the semantics of |
||||
the data type being accessed. The <acronym>GIN</acronym> layer itself |
||||
takes care of concurrency, logging and searching the tree structure. |
||||
</para> |
||||
|
||||
<para> |
||||
All it takes to get a <acronym>GIN</acronym> access method working |
||||
is to implement four user-defined methods, which define the behavior of |
||||
keys in the tree. In short, <acronym>GIN</acronym> combines extensibility |
||||
along with generality, code reuse, and a clean interface. |
||||
</para> |
||||
|
||||
</sect1> |
||||
|
||||
<sect1 id="gin-implementation"> |
||||
<title>Implementation</title> |
||||
|
||||
<para> |
||||
Internally, <acronym>GIN</acronym> consists of a B-tree index constructed |
||||
over keys, where each key is an element of the indexed value |
||||
(element of array, for example) and where each tuple in a leaf page is |
||||
either a pointer to a B-tree over heap pointers (PT, posting tree), or a |
||||
list of heap pointers (PL, posting list) if the tuple is small enough. |
||||
</para> |
||||
|
||||
<para> |
||||
There are four methods that an index operator class for |
||||
<acronym>GIN</acronym> must provide (prototypes are in pseudocode): |
||||
</para> |
||||
|
||||
<variablelist> |
||||
<varlistentry> |
||||
<term>int compare( Datum a, Datum b )</term> |
||||
<listitem> |
||||
<para> |
||||
Compares keys (not indexed values!) and returns an integer less than |
||||
zero, zero, or greater than zero, indicating whether the first key is |
||||
less than, equal to, or greater than the second. |
||||
</para> |
||||
</listitem> |
||||
</varlistentry> |
||||
|
||||
<varlistentry> |
||||
<term>Datum* extractValue(Datum inputValue, uint32 *nkeys)</term> |
||||
<listitem> |
||||
<para> |
||||
Returns an array of keys of value to be indexed, nkeys should |
||||
contain the number of returned keys. |
||||
</para> |
||||
</listitem> |
||||
</varlistentry> |
||||
|
||||
<varlistentry> |
||||
<term>Datum* extractQuery(Datum query, uint32 nkeys, |
||||
StrategyNumber n)</term> |
||||
<listitem> |
||||
<para> |
||||
Returns an array of keys of the query to be executed. n contains |
||||
strategy number of operation (see <xref linkend="xindex-strategies">). |
||||
Depending on n, query may be different type. |
||||
</para> |
||||
</listitem> |
||||
</varlistentry> |
||||
|
||||
<varlistentry> |
||||
<term>bool consistent( bool check[], StrategyNumber n, Datum query)</term> |
||||
<listitem> |
||||
<para> |
||||
Returns TRUE if indexed value satisfies query qualifier with strategy n |
||||
(or may satisfy in case of RECHECK mark in operator class). |
||||
Each element of the check array is TRUE if indexed value has a |
||||
corresponding key in the query: if (check[i] == TRUE ) the i-th key of |
||||
the query is present in the indexed value. |
||||
</para> |
||||
</listitem> |
||||
</varlistentry> |
||||
|
||||
</variablelist> |
||||
|
||||
</sect1> |
||||
|
||||
<sect1 id="gin-tips"> |
||||
<title>GIN tips and trics</title> |
||||
|
||||
<variablelist> |
||||
<varlistentry> |
||||
<term>Create vs insert</term> |
||||
<listitem> |
||||
<para> |
||||
In most cases, insertion into <acronym>GIN</acronym> index is slow because |
||||
many GIN keys may be inserted for each table row. So, when loading data |
||||
in bulk it may be useful to drop index and recreate it |
||||
after the data is loaded in the table. |
||||
</para> |
||||
</listitem> |
||||
</varlistentry> |
||||
|
||||
<varlistentry> |
||||
<term>gin_fuzzy_search_limit</term> |
||||
<listitem> |
||||
<para> |
||||
The primary goal of development <acronym>GIN</acronym> indices was |
||||
support for highly scalable, full-text search in |
||||
<productname>PostgreSQL</productname> and there are often situations when |
||||
a full-text search returns a very large set of results. Since reading |
||||
tuples from the disk and sorting them could take a lot of time, this is |
||||
unacceptable for production. (Note that the index search itself is very |
||||
fast.) |
||||
</para> |
||||
<para> |
||||
Such queries usually contain very frequent words, so the results are not |
||||
very helpful. To facilitate execution of such queries |
||||
<acronym>GIN</acronym> has a configurable soft upper limit of the size |
||||
of the returned set, determined by the |
||||
<varname>gin_fuzzy_search_limit</varname> GUC variable. It is set to 0 by |
||||
default (no limit). |
||||
</para> |
||||
<para> |
||||
If a non-zero search limit is set, then the returned set is a subset of |
||||
the whole result set, chosen at random. |
||||
</para> |
||||
<para> |
||||
"Soft" means that the actual number of returned results could slightly |
||||
differ from the specified limit, depending on the query and the quality |
||||
of the system's random number generator. |
||||
</para> |
||||
</listitem> |
||||
</varlistentry> |
||||
<variablelist> |
||||
|
||||
</sect1> |
||||
|
||||
<sect1 id="gin-limit"> |
||||
<title>Limitations</title> |
||||
|
||||
<para> |
||||
<acronym>GIN</acronym> doesn't support full scan of index due to it's |
||||
extremely inefficiency: because of a lot of keys per value, |
||||
each heap pointer will returned several times. |
||||
</para> |
||||
|
||||
<para> |
||||
When extractQuery returns zero number of keys, <acronym>GIN</acronym> will |
||||
emit a error: for different opclass and strategy semantic meaning of void |
||||
query may be different (for example, any array contains void array, |
||||
but they aren't overlapped with void one), and <acronym>GIN</acronym> can't |
||||
suggest reasonable answer. |
||||
</para> |
||||
|
||||
<para> |
||||
<acronym>GIN</acronym> searches keys only by equality matching. This may |
||||
be improved in future. |
||||
</para> |
||||
</sect1> |
||||
<sect1 id="gin-examples"> |
||||
<title>Examples</title> |
||||
|
||||
<para> |
||||
The <productname>PostgreSQL</productname> source distribution includes |
||||
<acronym>GIN</acronym> classes for one-dimensional arrays of all internal |
||||
types. The following |
||||
<filename>contrib</> modules also contain <acronym>GIN</acronym> |
||||
operator classes: |
||||
</para> |
||||
|
||||
<variablelist> |
||||
<varlistentry> |
||||
<term>intarray</term> |
||||
<listitem> |
||||
<para>Enhanced support for int4[]</para> |
||||
</listitem> |
||||
</varlistentry> |
||||
|
||||
<varlistentry> |
||||
<term>tsearch2</term> |
||||
<listitem> |
||||
<para>Support for inverted text indexing. This is much faster for very |
||||
large, mostly-static sets of documents. |
||||
</para> |
||||
</listitem> |
||||
</varlistentry> |
||||
|
||||
</chapter> |
||||
Loading…
Reference in new issue