Editorial overhaul for text search documentation. Organize the info

more clearly, improve a lot of unclear descriptions, add some missing material. We still need a migration guide though.
18 years ago · dfc6f130b4
parent 6cb00e67ef
commit dfc6f130b4
3 changed files with 2632 additions and 2181 deletions
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.210 2007/10/13 23:06:26 tgl Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.211 2007/10/21 20:04:37 tgl Exp $ -->
 <chapter id="datatype">
  <title id="datatype-title">Data Types</title>
@ -237,13 +237,13 @@
      <row>
       <entry><type>tsquery</type></entry>
       <entry></entry>
-       <entry>full text search query</entry>
+       <entry>text search query</entry>
      </row>
      <row>
       <entry><type>tsvector</type></entry>
       <entry></entry>
-       <entry>full text search document</entry>
+       <entry>text search document</entry>
      </row>
      <row>
@ -3232,73 +3232,46 @@ SELECT * FROM test;
   </para>
  </sect1>
-  <sect1 id="datatype-uuid">
+  <sect1 id="datatype-textsearch">
-   <title><acronym>UUID</acronym> Type</title>
+   <title>Text Search Types</title>
-   <indexterm zone="datatype-uuid">
+   <indexterm zone="datatype-textsearch">
-    <primary>UUID</primary>
+    <primary>full text search</primary>
    <secondary>data types</secondary>
   </indexterm>
-   <para>
+   <indexterm zone="datatype-textsearch">
-    The data type <type>uuid</type> stores Universally Unique
+    <primary>text search</primary>
-    Identifiers (UUID) as per RFC 4122, ISO/IEC 9834-8:2005, and
+    <secondary>data types</secondary>
-    related standards.  (Some systems refer to this data type as
+   </indexterm>
    globally unique
    identifier/GUID<indexterm><primary>GUID</primary></indexterm>
    instead.)  Such an identifier is a 128-bit quantity that is
    generated by a suitable algorithm so that it is very unlikely to
    be generated by anyone else in the known universe using the same
    algorithm.  Therefore, for distributed systems, these identifiers
    provide a better uniqueness guarantee than that which can be
    achieved using sequence generators, which are only unique within a
    single database.
   </para>
   <para>
-    A UUID is written as a sequence of lower-case hexadecimal digits,
+    <productname>PostgreSQL</productname> provides two data types that
-    in several groups separated by hyphens, specifically a group of 8
+    are designed to support full text search, which is the activity of
-    digits followed by three groups of 4 digits followed by a group of
+    searching through a collection of natural-language <firstterm>documents</>
-    12 digits, for a total of 32 digits representing the 128 bits.  An
+    to locate those that best match a <firstterm>query</>.
-    example of a UUID in this standard form is:
+    The <type>tsvector</type> type represents a document in a form suited
-<programlisting>
+    for text search, while the <type>tsquery</type> type similarly represents
-a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11
+    a query.
-</programlisting>
+    <xref linkend="textsearch"> provides a detailed explanation of this
-    PostgreSQL also accepts the following alternative forms for input:
+    facility, and <xref linkend="functions-textsearch"> summarizes the
-    use of upper-case digits, the standard format surrounded by
+    related functions and operators.
    braces, and omitting the hyphens.  Examples are:
 <programlisting>
 A0EEBC99-9C0B-4EF8-BB6D-6BB9BD380A11
 {a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11}
 a0eebc999c0b4ef8bb6d6bb9bd380a11
 </programlisting>
    Output is always in the standard form.
   </para>
-   <para>
+   <sect2 id="datatype-tsvector">
-    To generate UUIDs, the contrib module <literal>uuid-ossp</literal>
+    <title><type>tsvector</type></title>
    provides functions that implement the standard algorithms.
    Alternatively, UUIDs could be generated by client applications or
    other libraries invoked through a server-side function.
   </para>
  </sect1>
-  <sect1 id="datatype-textsearch">
+    <indexterm>
-   <title>Full Text Search</title>
+     <primary>tsvector (data type)</primary>
-
+    </indexterm>
   <variablelist>
    <varlistentry>
    <term><firstterm>tsvector</firstterm></term>
     <listitem>
-      <para>
+    <para>
-       <type>tsvector</type>
+     A <type>tsvector</type> value is a sorted list of distinct
-       <indexterm><primary>tsvector</primary></indexterm> is a data type
+     <firstterm>lexemes</>, which are words that have been
-       that represents a document and is optimized for full text searching.
+     <firstterm>normalized</> to make different variants of the same word look
-       In the simplest case, <type>tsvector</type> is a sorted list of
+     alike (see <xref linkend="textsearch"> for details).  Sorting and
-       lexemes, so even without indexes full text searches perform better
+     duplicate-elimination are done automatically during input, as shown in
-       than standard <literal>~</literal> and <literal>LIKE</literal>
+     this example:
       operations:
 <programlisting>
 SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
@ -3307,17 +3280,30 @@ SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
 'a' 'on' 'and' 'ate' 'cat' 'fat' 'mat' 'rat' 'sat'
 </programlisting>
-       Notice, that <literal>space</literal> is also a lexeme:
+     (As the example shows, the sorting is first by length and then
     alphabetically, but that detail is seldom important.)  To represent
     lexemes containing whitespace, surround them with quotes:
 <programlisting>
-SELECT 'space ''    '' is a lexeme'::tsvector;
+SELECT $$the lexeme '    ' contains spaces$$::tsvector;
-             tsvector
+                 tsvector                  
----------------------------------
+-------------------------------------------
- 'a' 'is' '    ' 'space' 'lexeme'
+ 'the' '    ' 'lexeme' 'spaces' 'contains'
 </programlisting>
-       Each lexeme, optionally, can have positional information which is used for
+     (We use dollar-quoted string literals in this example and the next one,
-       <varname>proximity ranking</varname>:
+     to avoid confusing matters by having to double quote marks within the
     literals.)  Embedded quotes can be handled by doubling them:
 <programlisting>
 SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector;
                    tsvector                    
 ------------------------------------------------
 'a' 'the' 'Joe''s' 'quote' 'lexeme' 'contains'
 </programlisting>
     Optionally, integer <firstterm>position(s)</>
     can be attached to any or all of the lexemes:
 <programlisting>
 SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector;
@ -3326,87 +3312,182 @@ SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::ts
 'a':1,6,10 'on':5 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
 </programlisting>
-       Each lexeme position also can be labeled as <literal>A</literal>,
+     A position normally indicates the source word's location in the
-       <literal>B</literal>, <literal>C</literal>, <literal>D</literal>,
+     document.  Positional information can be used for
-       where <literal>D</literal> is the default. These labels can be used to group
+     <firstterm>proximity ranking</firstterm>.  Position values can
-       lexemes into different <emphasis>importance</emphasis> or
+     range from 1 to 16383; larger numbers are silently clamped to 16383.
-       <emphasis>rankings</emphasis>, for example to reflect document structure.
+     Duplicate position entries are discarded.
-       Actual values can be assigned at search time and used during the calculation
+    </para>
       of the document rank.  This is very useful for controlling search results.
      </para>
-      <para>
+    <para>
-       The concatenation operator, e.g. <literal>tsvector || tsvector</literal>,
+     Lexemes that have positions can further be labeled with a
-       can "construct" a document from several parts. The order is important if
+     <firstterm>weight</>, which can be <literal>A</literal>,
-       <type>tsvector</type> contains positional information. Of course,
+     <literal>B</literal>, <literal>C</literal>, or <literal>D</literal>.
-       it is also possible to build a document using different tables:
+     <literal>D</literal> is the default and hence is not shown on output:
 <programlisting>
-SELECT 'fat:1 cat:2'::tsvector || 'fat:1 rat:2'::tsvector;
+SELECT 'a:1A fat:2B,4C cat:5D'::tsvector;
-         ?column?
+          tsvector          
---------------------------
+----------------------------
- 'cat':2 'fat':1,3 'rat':4
+ 'a':1A 'cat':5 'fat':2B,4C
 </programlisting>
-SELECT 'fat:1 rat:2'::tsvector || 'fat:1 cat:2'::tsvector;
+     Weights are typically used to reflect document structure, for example
-         ?column?
+     by marking title words differently from body words.  Text search
---------------------------
+     ranking functions can assign different priorities to the different
- 'cat':4 'fat':1,3 'rat':2
+     weight markers.
    </para>
    <para>
     It is important to understand that the
     <type>tsvector</type> type itself does not perform any normalization;
     it assumes that the words it is given are normalized appropriately
     for the application.  For example,
 <programlisting>
 select 'The Fat Rats'::tsvector;
      tsvector      
 --------------------
 'Fat' 'The' 'Rats'
 </programlisting>
-      </para>
+     For most English-text-searching applications the above words would
     be considered non-normalized, but <type>tsvector</type> doesn't care.
     Raw document text should usually be passed through
     <function>to_tsvector</> to normalize the words appropriately
     for searching:
-     </listitem>
+<programlisting>
 SELECT to_tsvector('english', 'The Fat Rats');         
   to_tsvector   
 -----------------
 'fat':2 'rat':3
 </programlisting>
-    </varlistentry>
+     Again, see <xref linkend="textsearch"> for more detail.
    </para>
-    <varlistentry>
+   </sect2>
    <term><firstterm>tsquery</firstterm></term>
     <listitem>
-      <para>
+   <sect2 id="datatype-tsquery">
-       <type>tsquery</type>
+    <title><type>tsquery</type></title>
-       <indexterm><primary>tsquery</primary></indexterm> is a data type
+
-       for textual queries which supports the boolean operators
+    <indexterm>
-       <literal>&amp;</literal> (AND), <literal>|</literal> (OR), and
+     <primary>tsquery (data type)</primary>
-       parentheses.  A <type>tsquery</type> consists of lexemes (optionally
+    </indexterm>
-       labeled by letters) with boolean operators in between:
+
    <para>
     A <type>tsquery</type> value stores lexemes that are to be
     searched for, and combines them using the boolean operators
     <literal>&amp;</literal> (AND), <literal>|</literal> (OR), and
     <literal>!</> (NOT).  Parentheses can be used to enforce grouping
     of the operators:
 <programlisting>
-SELECT 'fat &amp; cat'::tsquery;
+ SELECT 'fat &amp; rat'::tsquery;
    tsquery    
 ---------------
- 'fat' &amp; 'cat'
+ 'fat' &amp; 'rat'
 SELECT 'fat &amp; (rat | cat)'::tsquery;
          tsquery          
 ---------------------------
 'fat' &amp; ( 'rat' | 'cat' )
 SELECT 'fat &amp; rat &amp; ! cat'::tsquery;
        tsquery         
 ------------------------
 'fat' &amp; 'rat' &amp; !'cat'
 </programlisting>
     In the absence of parentheses, <literal>!</> (NOT) binds most tightly,
     and <literal>&amp;</literal> (AND) binds more tightly than
     <literal>|</literal> (OR).
    </para>
    <para>
     Optionally, lexemes in a <type>tsquery</type> can be labeled with
     one or more weight letters, which restricts them to match only
     <type>tsvector</> lexemes with one of those weights:
 <programlisting>
 SELECT 'fat:ab &amp; cat'::tsquery;
    tsquery
 ------------------
 'fat':AB &amp; 'cat'
 </programlisting>
    </para>
-       Labels can be used to restrict the search region, which allows the
+    <para>
-       development of different search engines using the same full text index.
+     Quoting rules for lexemes are the same as described above for
-      </para>
+     lexemes in <type>tsvector</>; and, as with <type>tsvector</>,
-
+     any required normalization of words must be done before putting
-      <para>
+     them into the <type>tsquery</> type.  The <function>to_tsquery</>
-       <type>tsqueries</type> can be concatenated using <literal>&amp;&amp;</literal> (AND)
+     function is convenient for performing such normalization:
       and <literal>||</literal> (OR) operators:
 <programlisting>
-SELECT 'a &amp; b'::tsquery &amp;&amp; 'c | d'::tsquery;
+SELECT to_tsquery('Fat:ab & Cats');
-         ?column?
+    to_tsquery    
---------------------------
+------------------
- 'a' &amp; 'b' &amp; ( 'c' | 'd' )
+ 'fat':AB & 'cat'
 SELECT 'a &amp; b'::tsquery || 'c|d'::tsquery;
         ?column?
 ---------------------------
 'a' &amp; 'b' | ( 'c' | 'd' )
 </programlisting>
    </para>
-      </para>
+   </sect2>
-     </listitem>
+
-    </varlistentry>
+  </sect1>
-   </variablelist>
+
  <sect1 id="datatype-uuid">
   <title><acronym>UUID</acronym> Type</title>
   <indexterm zone="datatype-uuid">
    <primary>UUID</primary>
   </indexterm>
   <para>
    The data type <type>uuid</type> stores Universally Unique Identifiers
    (UUID) as defined by RFC 4122, ISO/IEC 9834-8:2005, and related standards.
    (Some systems refer to this data type as globally unique identifier, or
    GUID,<indexterm><primary>GUID</primary></indexterm> instead.)  Such an
    identifier is a 128-bit quantity that is generated by an algorithm chosen
    to make it very unlikely that the same identifier will be generated by
    anyone else in the known universe using the same algorithm.  Therefore,
    for distributed systems, these identifiers provide a better uniqueness
    guarantee than that which can be achieved using sequence generators, which
    are only unique within a single database.
   </para>
   <para>
    A UUID is written as a sequence of lower-case hexadecimal digits,
    in several groups separated by hyphens, specifically a group of 8
    digits followed by three groups of 4 digits followed by a group of
    12 digits, for a total of 32 digits representing the 128 bits.  An
    example of a UUID in this standard form is:
 <programlisting>
 a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11
 </programlisting>
    <productname>PostgreSQL</productname> also accepts the following
    alternative forms for input:
    use of upper-case digits, the standard format surrounded by
    braces, and omitting the hyphens.  Examples are:
 <programlisting>
 A0EEBC99-9C0B-4EF8-BB6D-6BB9BD380A11
 {a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11}
 a0eebc999c0b4ef8bb6d6bb9bd380a11
 </programlisting>
    Output is always in the standard form.
   </para>
   <para>
    <productname>PostgreSQL</productname> provides storage and comparison
    functions for UUIDs, but the core database does not include any
    function for generating UUIDs, because no single algorithm is well
    suited for every application.  The contrib module
    <filename>contrib/uuid-ossp</filename> provides functions that implement 
    several standard algorithms.
    Alternatively, UUIDs could be generated by client applications or
    other libraries invoked through a server-side function.
   </para>
  </sect1>
  <sect1 id="datatype-xml">
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml