Editorial overhaul for text search documentation. Organize the info

more clearly, improve a lot of unclear descriptions, add some missing
material.  We still need a migration guide though.
REL8_3_STABLE
Tom Lane 18 years ago
parent 6cb00e67ef
commit dfc6f130b4
  1. 329
      doc/src/sgml/datatype.sgml
  2. 1252
      doc/src/sgml/func.sgml
  3. 3232
      doc/src/sgml/textsearch.sgml

@ -1,4 +1,4 @@
<!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.210 2007/10/13 23:06:26 tgl Exp $ --> <!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.211 2007/10/21 20:04:37 tgl Exp $ -->
<chapter id="datatype"> <chapter id="datatype">
<title id="datatype-title">Data Types</title> <title id="datatype-title">Data Types</title>
@ -237,13 +237,13 @@
<row> <row>
<entry><type>tsquery</type></entry> <entry><type>tsquery</type></entry>
<entry></entry> <entry></entry>
<entry>full text search query</entry> <entry>text search query</entry>
</row> </row>
<row> <row>
<entry><type>tsvector</type></entry> <entry><type>tsvector</type></entry>
<entry></entry> <entry></entry>
<entry>full text search document</entry> <entry>text search document</entry>
</row> </row>
<row> <row>
@ -3232,73 +3232,46 @@ SELECT * FROM test;
</para> </para>
</sect1> </sect1>
<sect1 id="datatype-uuid"> <sect1 id="datatype-textsearch">
<title><acronym>UUID</acronym> Type</title> <title>Text Search Types</title>
<indexterm zone="datatype-uuid"> <indexterm zone="datatype-textsearch">
<primary>UUID</primary> <primary>full text search</primary>
<secondary>data types</secondary>
</indexterm> </indexterm>
<para> <indexterm zone="datatype-textsearch">
The data type <type>uuid</type> stores Universally Unique <primary>text search</primary>
Identifiers (UUID) as per RFC 4122, ISO/IEC 9834-8:2005, and <secondary>data types</secondary>
related standards. (Some systems refer to this data type as </indexterm>
globally unique
identifier/GUID<indexterm><primary>GUID</primary></indexterm>
instead.) Such an identifier is a 128-bit quantity that is
generated by a suitable algorithm so that it is very unlikely to
be generated by anyone else in the known universe using the same
algorithm. Therefore, for distributed systems, these identifiers
provide a better uniqueness guarantee than that which can be
achieved using sequence generators, which are only unique within a
single database.
</para>
<para> <para>
A UUID is written as a sequence of lower-case hexadecimal digits, <productname>PostgreSQL</productname> provides two data types that
in several groups separated by hyphens, specifically a group of 8 are designed to support full text search, which is the activity of
digits followed by three groups of 4 digits followed by a group of searching through a collection of natural-language <firstterm>documents</>
12 digits, for a total of 32 digits representing the 128 bits. An to locate those that best match a <firstterm>query</>.
example of a UUID in this standard form is: The <type>tsvector</type> type represents a document in a form suited
<programlisting> for text search, while the <type>tsquery</type> type similarly represents
a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11 a query.
</programlisting> <xref linkend="textsearch"> provides a detailed explanation of this
PostgreSQL also accepts the following alternative forms for input: facility, and <xref linkend="functions-textsearch"> summarizes the
use of upper-case digits, the standard format surrounded by related functions and operators.
braces, and omitting the hyphens. Examples are:
<programlisting>
A0EEBC99-9C0B-4EF8-BB6D-6BB9BD380A11
{a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11}
a0eebc999c0b4ef8bb6d6bb9bd380a11
</programlisting>
Output is always in the standard form.
</para> </para>
<para> <sect2 id="datatype-tsvector">
To generate UUIDs, the contrib module <literal>uuid-ossp</literal> <title><type>tsvector</type></title>
provides functions that implement the standard algorithms.
Alternatively, UUIDs could be generated by client applications or
other libraries invoked through a server-side function.
</para>
</sect1>
<sect1 id="datatype-textsearch"> <indexterm>
<title>Full Text Search</title> <primary>tsvector (data type)</primary>
</indexterm>
<variablelist>
<varlistentry>
<term><firstterm>tsvector</firstterm></term>
<listitem>
<para> <para>
<type>tsvector</type> A <type>tsvector</type> value is a sorted list of distinct
<indexterm><primary>tsvector</primary></indexterm> is a data type <firstterm>lexemes</>, which are words that have been
that represents a document and is optimized for full text searching. <firstterm>normalized</> to make different variants of the same word look
In the simplest case, <type>tsvector</type> is a sorted list of alike (see <xref linkend="textsearch"> for details). Sorting and
lexemes, so even without indexes full text searches perform better duplicate-elimination are done automatically during input, as shown in
than standard <literal>~</literal> and <literal>LIKE</literal> this example:
operations:
<programlisting> <programlisting>
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector; SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
@ -3307,17 +3280,30 @@ SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
'a' 'on' 'and' 'ate' 'cat' 'fat' 'mat' 'rat' 'sat' 'a' 'on' 'and' 'ate' 'cat' 'fat' 'mat' 'rat' 'sat'
</programlisting> </programlisting>
Notice, that <literal>space</literal> is also a lexeme: (As the example shows, the sorting is first by length and then
alphabetically, but that detail is seldom important.) To represent
lexemes containing whitespace, surround them with quotes:
<programlisting> <programlisting>
SELECT 'space '' '' is a lexeme'::tsvector; SELECT $$the lexeme ' ' contains spaces$$::tsvector;
tsvector tsvector
---------------------------------- -------------------------------------------
'a' 'is' ' ' 'space' 'lexeme' 'the' ' ' 'lexeme' 'spaces' 'contains'
</programlisting> </programlisting>
Each lexeme, optionally, can have positional information which is used for (We use dollar-quoted string literals in this example and the next one,
<varname>proximity ranking</varname>: to avoid confusing matters by having to double quote marks within the
literals.) Embedded quotes can be handled by doubling them:
<programlisting>
SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector;
tsvector
------------------------------------------------
'a' 'the' 'Joe''s' 'quote' 'lexeme' 'contains'
</programlisting>
Optionally, integer <firstterm>position(s)</>
can be attached to any or all of the lexemes:
<programlisting> <programlisting>
SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector; SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector;
@ -3326,87 +3312,182 @@ SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::ts
'a':1,6,10 'on':5 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4 'a':1,6,10 'on':5 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
</programlisting> </programlisting>
Each lexeme position also can be labeled as <literal>A</literal>, A position normally indicates the source word's location in the
<literal>B</literal>, <literal>C</literal>, <literal>D</literal>, document. Positional information can be used for
where <literal>D</literal> is the default. These labels can be used to group <firstterm>proximity ranking</firstterm>. Position values can
lexemes into different <emphasis>importance</emphasis> or range from 1 to 16383; larger numbers are silently clamped to 16383.
<emphasis>rankings</emphasis>, for example to reflect document structure. Duplicate position entries are discarded.
Actual values can be assigned at search time and used during the calculation </para>
of the document rank. This is very useful for controlling search results.
</para>
<para> <para>
The concatenation operator, e.g. <literal>tsvector || tsvector</literal>, Lexemes that have positions can further be labeled with a
can "construct" a document from several parts. The order is important if <firstterm>weight</>, which can be <literal>A</literal>,
<type>tsvector</type> contains positional information. Of course, <literal>B</literal>, <literal>C</literal>, or <literal>D</literal>.
it is also possible to build a document using different tables: <literal>D</literal> is the default and hence is not shown on output:
<programlisting> <programlisting>
SELECT 'fat:1 cat:2'::tsvector || 'fat:1 rat:2'::tsvector; SELECT 'a:1A fat:2B,4C cat:5D'::tsvector;
?column? tsvector
--------------------------- ----------------------------
'cat':2 'fat':1,3 'rat':4 'a':1A 'cat':5 'fat':2B,4C
</programlisting>
SELECT 'fat:1 rat:2'::tsvector || 'fat:1 cat:2'::tsvector; Weights are typically used to reflect document structure, for example
?column? by marking title words differently from body words. Text search
--------------------------- ranking functions can assign different priorities to the different
'cat':4 'fat':1,3 'rat':2 weight markers.
</para>
<para>
It is important to understand that the
<type>tsvector</type> type itself does not perform any normalization;
it assumes that the words it is given are normalized appropriately
for the application. For example,
<programlisting>
select 'The Fat Rats'::tsvector;
tsvector
--------------------
'Fat' 'The' 'Rats'
</programlisting> </programlisting>
</para> For most English-text-searching applications the above words would
be considered non-normalized, but <type>tsvector</type> doesn't care.
Raw document text should usually be passed through
<function>to_tsvector</> to normalize the words appropriately
for searching:
</listitem> <programlisting>
SELECT to_tsvector('english', 'The Fat Rats');
to_tsvector
-----------------
'fat':2 'rat':3
</programlisting>
</varlistentry> Again, see <xref linkend="textsearch"> for more detail.
</para>
<varlistentry> </sect2>
<term><firstterm>tsquery</firstterm></term>
<listitem>
<para> <sect2 id="datatype-tsquery">
<type>tsquery</type> <title><type>tsquery</type></title>
<indexterm><primary>tsquery</primary></indexterm> is a data type
for textual queries which supports the boolean operators <indexterm>
<literal>&amp;</literal> (AND), <literal>|</literal> (OR), and <primary>tsquery (data type)</primary>
parentheses. A <type>tsquery</type> consists of lexemes (optionally </indexterm>
labeled by letters) with boolean operators in between:
<para>
A <type>tsquery</type> value stores lexemes that are to be
searched for, and combines them using the boolean operators
<literal>&amp;</literal> (AND), <literal>|</literal> (OR), and
<literal>!</> (NOT). Parentheses can be used to enforce grouping
of the operators:
<programlisting> <programlisting>
SELECT 'fat &amp; cat'::tsquery; SELECT 'fat &amp; rat'::tsquery;
tsquery tsquery
--------------- ---------------
'fat' &amp; 'cat' 'fat' &amp; 'rat'
SELECT 'fat &amp; (rat | cat)'::tsquery;
tsquery
---------------------------
'fat' &amp; ( 'rat' | 'cat' )
SELECT 'fat &amp; rat &amp; ! cat'::tsquery;
tsquery
------------------------
'fat' &amp; 'rat' &amp; !'cat'
</programlisting>
In the absence of parentheses, <literal>!</> (NOT) binds most tightly,
and <literal>&amp;</literal> (AND) binds more tightly than
<literal>|</literal> (OR).
</para>
<para>
Optionally, lexemes in a <type>tsquery</type> can be labeled with
one or more weight letters, which restricts them to match only
<type>tsvector</> lexemes with one of those weights:
<programlisting>
SELECT 'fat:ab &amp; cat'::tsquery; SELECT 'fat:ab &amp; cat'::tsquery;
tsquery tsquery
------------------ ------------------
'fat':AB &amp; 'cat' 'fat':AB &amp; 'cat'
</programlisting> </programlisting>
</para>
Labels can be used to restrict the search region, which allows the <para>
development of different search engines using the same full text index. Quoting rules for lexemes are the same as described above for
</para> lexemes in <type>tsvector</>; and, as with <type>tsvector</>,
any required normalization of words must be done before putting
<para> them into the <type>tsquery</> type. The <function>to_tsquery</>
<type>tsqueries</type> can be concatenated using <literal>&amp;&amp;</literal> (AND) function is convenient for performing such normalization:
and <literal>||</literal> (OR) operators:
<programlisting> <programlisting>
SELECT 'a &amp; b'::tsquery &amp;&amp; 'c | d'::tsquery; SELECT to_tsquery('Fat:ab & Cats');
?column? to_tsquery
--------------------------- ------------------
'a' &amp; 'b' &amp; ( 'c' | 'd' ) 'fat':AB & 'cat'
SELECT 'a &amp; b'::tsquery || 'c|d'::tsquery;
?column?
---------------------------
'a' &amp; 'b' | ( 'c' | 'd' )
</programlisting> </programlisting>
</para>
</para> </sect2>
</listitem>
</varlistentry> </sect1>
</variablelist>
<sect1 id="datatype-uuid">
<title><acronym>UUID</acronym> Type</title>
<indexterm zone="datatype-uuid">
<primary>UUID</primary>
</indexterm>
<para>
The data type <type>uuid</type> stores Universally Unique Identifiers
(UUID) as defined by RFC 4122, ISO/IEC 9834-8:2005, and related standards.
(Some systems refer to this data type as globally unique identifier, or
GUID,<indexterm><primary>GUID</primary></indexterm> instead.) Such an
identifier is a 128-bit quantity that is generated by an algorithm chosen
to make it very unlikely that the same identifier will be generated by
anyone else in the known universe using the same algorithm. Therefore,
for distributed systems, these identifiers provide a better uniqueness
guarantee than that which can be achieved using sequence generators, which
are only unique within a single database.
</para>
<para>
A UUID is written as a sequence of lower-case hexadecimal digits,
in several groups separated by hyphens, specifically a group of 8
digits followed by three groups of 4 digits followed by a group of
12 digits, for a total of 32 digits representing the 128 bits. An
example of a UUID in this standard form is:
<programlisting>
a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11
</programlisting>
<productname>PostgreSQL</productname> also accepts the following
alternative forms for input:
use of upper-case digits, the standard format surrounded by
braces, and omitting the hyphens. Examples are:
<programlisting>
A0EEBC99-9C0B-4EF8-BB6D-6BB9BD380A11
{a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11}
a0eebc999c0b4ef8bb6d6bb9bd380a11
</programlisting>
Output is always in the standard form.
</para>
<para>
<productname>PostgreSQL</productname> provides storage and comparison
functions for UUIDs, but the core database does not include any
function for generating UUIDs, because no single algorithm is well
suited for every application. The contrib module
<filename>contrib/uuid-ossp</filename> provides functions that implement
several standard algorithms.
Alternatively, UUIDs could be generated by client applications or
other libraries invoked through a server-side function.
</para>
</sect1> </sect1>
<sect1 id="datatype-xml"> <sect1 id="datatype-xml">

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff
Loading…
Cancel
Save