Editorial overhaul for text search documentation. Organize the info

more clearly, improve a lot of unclear descriptions, add some missing material. We still need a migration guide though.
18 years ago · dfc6f130b4
parent 6cb00e67ef
commit dfc6f130b4
3 changed files with 2632 additions and 2181 deletions
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.210 2007/10/13 23:06:26 tgl Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.211 2007/10/21 20:04:37 tgl Exp $ -->

 <chapter id="datatype">
  <title id="datatype-title">Data Types</title>
@ -237,13 +237,13 @@
      <row>
       <entry><type>tsquery</type></entry>
       <entry></entry>
-       <entry>full text search query</entry>
+       <entry>text search query</entry>
      </row>

      <row>
       <entry><type>tsvector</type></entry>
       <entry></entry>
-       <entry>full text search document</entry>
+       <entry>text search document</entry>
      </row>

      <row>
@ -3232,73 +3232,46 @@ SELECT * FROM test;
   </para>
  </sect1>

-  <sect1 id="datatype-uuid">
-   <title><acronym>UUID</acronym> Type</title>
+  <sect1 id="datatype-textsearch">
+   <title>Text Search Types</title>

-   <indexterm zone="datatype-uuid">
-    <primary>UUID</primary>
+   <indexterm zone="datatype-textsearch">
+    <primary>full text search</primary>
+    <secondary>data types</secondary>
   </indexterm>

-   <para>
-    The data type <type>uuid</type> stores Universally Unique
-    Identifiers (UUID) as per RFC 4122, ISO/IEC 9834-8:2005, and
-    related standards.  (Some systems refer to this data type as
-    globally unique
-    identifier/GUID<indexterm><primary>GUID</primary></indexterm>
-    instead.)  Such an identifier is a 128-bit quantity that is
-    generated by a suitable algorithm so that it is very unlikely to
-    be generated by anyone else in the known universe using the same
-    algorithm.  Therefore, for distributed systems, these identifiers
-    provide a better uniqueness guarantee than that which can be
-    achieved using sequence generators, which are only unique within a
-    single database.
-   </para>
-
-   <para>
-    A UUID is written as a sequence of lower-case hexadecimal digits,
-    in several groups separated by hyphens, specifically a group of 8
-    digits followed by three groups of 4 digits followed by a group of
-    12 digits, for a total of 32 digits representing the 128 bits.  An
-    example of a UUID in this standard form is:
-<programlisting>
-a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11
-</programlisting>
-    PostgreSQL also accepts the following alternative forms for input:
-    use of upper-case digits, the standard format surrounded by
-    braces, and omitting the hyphens.  Examples are:
-<programlisting>
-A0EEBC99-9C0B-4EF8-BB6D-6BB9BD380A11
-{a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11}
-a0eebc999c0b4ef8bb6d6bb9bd380a11
-</programlisting>
-    Output is always in the standard form.
-   </para>
+   <indexterm zone="datatype-textsearch">
+    <primary>text search</primary>
+    <secondary>data types</secondary>
+   </indexterm>

   <para>
-    To generate UUIDs, the contrib module <literal>uuid-ossp</literal>
-    provides functions that implement the standard algorithms.
-    Alternatively, UUIDs could be generated by client applications or
-    other libraries invoked through a server-side function.
+    <productname>PostgreSQL</productname> provides two data types that
+    are designed to support full text search, which is the activity of
+    searching through a collection of natural-language <firstterm>documents</>
+    to locate those that best match a <firstterm>query</>.
+    The <type>tsvector</type> type represents a document in a form suited
+    for text search, while the <type>tsquery</type> type similarly represents
+    a query.
+    <xref linkend="textsearch"> provides a detailed explanation of this
+    facility, and <xref linkend="functions-textsearch"> summarizes the
+    related functions and operators.
   </para>
-  </sect1>
-
-  <sect1 id="datatype-textsearch">
-   <title>Full Text Search</title>

-   <variablelist>
+   <sect2 id="datatype-tsvector">
+    <title><type>tsvector</type></title>

-    <varlistentry>
-    <term><firstterm>tsvector</firstterm></term>
-     <listitem>
+    <indexterm>
+     <primary>tsvector (data type)</primary>
+    </indexterm>

-      <para>
-       <type>tsvector</type>
-       <indexterm><primary>tsvector</primary></indexterm> is a data type
-       that represents a document and is optimized for full text searching.
-       In the simplest case, <type>tsvector</type> is a sorted list of
-       lexemes, so even without indexes full text searches perform better
-       than standard <literal>~</literal> and <literal>LIKE</literal>
-       operations:
+    <para>
+     A <type>tsvector</type> value is a sorted list of distinct
+     <firstterm>lexemes</>, which are words that have been
+     <firstterm>normalized</> to make different variants of the same word look
+     alike (see <xref linkend="textsearch"> for details).  Sorting and
+     duplicate-elimination are done automatically during input, as shown in
+     this example:

 <programlisting>
 SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
@ -3307,17 +3280,30 @@ SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
 'a' 'on' 'and' 'ate' 'cat' 'fat' 'mat' 'rat' 'sat'
 </programlisting>

-       Notice, that <literal>space</literal> is also a lexeme:
+     (As the example shows, the sorting is first by length and then
+     alphabetically, but that detail is seldom important.)  To represent
+     lexemes containing whitespace, surround them with quotes:
+
+<programlisting>
+SELECT $$the lexeme '    ' contains spaces$$::tsvector;
+                 tsvector                  
+-------------------------------------------
+ 'the' '    ' 'lexeme' 'spaces' 'contains'
+</programlisting>
+
+     (We use dollar-quoted string literals in this example and the next one,
+     to avoid confusing matters by having to double quote marks within the
+     literals.)  Embedded quotes can be handled by doubling them:

 <programlisting>
-SELECT 'space ''    '' is a lexeme'::tsvector;
-             tsvector
----------------------------------
- 'a' 'is' '    ' 'space' 'lexeme'
+SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector;
+                    tsvector                    
+------------------------------------------------
+ 'a' 'the' 'Joe''s' 'quote' 'lexeme' 'contains'
 </programlisting>

-       Each lexeme, optionally, can have positional information which is used for
-       <varname>proximity ranking</varname>:
+     Optionally, integer <firstterm>position(s)</>
+     can be attached to any or all of the lexemes:

 <programlisting>
 SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector;
@ -3326,87 +3312,182 @@ SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::ts
 'a':1,6,10 'on':5 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
 </programlisting>

-       Each lexeme position also can be labeled as <literal>A</literal>,
-       <literal>B</literal>, <literal>C</literal>, <literal>D</literal>,
-       where <literal>D</literal> is the default. These labels can be used to group
-       lexemes into different <emphasis>importance</emphasis> or
-       <emphasis>rankings</emphasis>, for example to reflect document structure.
-       Actual values can be assigned at search time and used during the calculation
-       of the document rank.  This is very useful for controlling search results.
-      </para>
+     A position normally indicates the source word's location in the
+     document.  Positional information can be used for
+     <firstterm>proximity ranking</firstterm>.  Position values can
+     range from 1 to 16383; larger numbers are silently clamped to 16383.
+     Duplicate position entries are discarded.
+    </para>

-      <para>
-       The concatenation operator, e.g. <literal>tsvector || tsvector</literal>,
-       can "construct" a document from several parts. The order is important if
-       <type>tsvector</type> contains positional information. Of course,
-       it is also possible to build a document using different tables:
+    <para>
+     Lexemes that have positions can further be labeled with a
+     <firstterm>weight</>, which can be <literal>A</literal>,
+     <literal>B</literal>, <literal>C</literal>, or <literal>D</literal>.
+     <literal>D</literal> is the default and hence is not shown on output:

 <programlisting>
-SELECT 'fat:1 cat:2'::tsvector || 'fat:1 rat:2'::tsvector;
-         ?column?
---------------------------
- 'cat':2 'fat':1,3 'rat':4
+SELECT 'a:1A fat:2B,4C cat:5D'::tsvector;
+          tsvector          
+----------------------------
+ 'a':1A 'cat':5 'fat':2B,4C
+</programlisting>

-SELECT 'fat:1 rat:2'::tsvector || 'fat:1 cat:2'::tsvector;
-         ?column?
---------------------------
- 'cat':4 'fat':1,3 'rat':2
+     Weights are typically used to reflect document structure, for example
+     by marking title words differently from body words.  Text search
+     ranking functions can assign different priorities to the different
+     weight markers.
+    </para>
+
+    <para>
+     It is important to understand that the
+     <type>tsvector</type> type itself does not perform any normalization;
+     it assumes that the words it is given are normalized appropriately
+     for the application.  For example,
+
+<programlisting>
+select 'The Fat Rats'::tsvector;
+      tsvector      
+--------------------
+ 'Fat' 'The' 'Rats'
 </programlisting>

-      </para>
+     For most English-text-searching applications the above words would
+     be considered non-normalized, but <type>tsvector</type> doesn't care.
+     Raw document text should usually be passed through
+     <function>to_tsvector</> to normalize the words appropriately
+     for searching:

-     </listitem>
+<programlisting>
+SELECT to_tsvector('english', 'The Fat Rats');         
+   to_tsvector   
+-----------------
+ 'fat':2 'rat':3
+</programlisting>

-    </varlistentry>
+     Again, see <xref linkend="textsearch"> for more detail.
+    </para>

-    <varlistentry>
-    <term><firstterm>tsquery</firstterm></term>
-     <listitem>
+   </sect2>

-      <para>
-       <type>tsquery</type>
-       <indexterm><primary>tsquery</primary></indexterm> is a data type
-       for textual queries which supports the boolean operators
-       <literal>&amp;</literal> (AND), <literal>|</literal> (OR), and
-       parentheses.  A <type>tsquery</type> consists of lexemes (optionally
-       labeled by letters) with boolean operators in between:
+   <sect2 id="datatype-tsquery">
+    <title><type>tsquery</type></title>
+
+    <indexterm>
+     <primary>tsquery (data type)</primary>
+    </indexterm>
+
+    <para>
+     A <type>tsquery</type> value stores lexemes that are to be
+     searched for, and combines them using the boolean operators
+     <literal>&amp;</literal> (AND), <literal>|</literal> (OR), and
+     <literal>!</> (NOT).  Parentheses can be used to enforce grouping
+     of the operators:

 <programlisting>
-SELECT 'fat &amp; cat'::tsquery;
-    tsquery
+ SELECT 'fat &amp; rat'::tsquery;
+    tsquery    
 ---------------
- 'fat' &amp; 'cat'
+ 'fat' &amp; 'rat'
+
+SELECT 'fat &amp; (rat | cat)'::tsquery;
+          tsquery          
+---------------------------
+ 'fat' &amp; ( 'rat' | 'cat' )
+
+SELECT 'fat &amp; rat &amp; ! cat'::tsquery;
+        tsquery         
+------------------------
+ 'fat' &amp; 'rat' &amp; !'cat'
+</programlisting>
+
+     In the absence of parentheses, <literal>!</> (NOT) binds most tightly,
+     and <literal>&amp;</literal> (AND) binds more tightly than
+     <literal>|</literal> (OR).
+    </para>
+
+    <para>
+     Optionally, lexemes in a <type>tsquery</type> can be labeled with
+     one or more weight letters, which restricts them to match only
+     <type>tsvector</> lexemes with one of those weights:
+
+<programlisting>
 SELECT 'fat:ab &amp; cat'::tsquery;
    tsquery
 ------------------
 'fat':AB &amp; 'cat'
 </programlisting>
+    </para>

-       Labels can be used to restrict the search region, which allows the
-       development of different search engines using the same full text index.
-      </para>
-
-      <para>
-       <type>tsqueries</type> can be concatenated using <literal>&amp;&amp;</literal> (AND)
-       and <literal>||</literal> (OR) operators:
+    <para>
+     Quoting rules for lexemes are the same as described above for
+     lexemes in <type>tsvector</>; and, as with <type>tsvector</>,
+     any required normalization of words must be done before putting
+     them into the <type>tsquery</> type.  The <function>to_tsquery</>
+     function is convenient for performing such normalization:

 <programlisting>
-SELECT 'a &amp; b'::tsquery &amp;&amp; 'c | d'::tsquery;
-         ?column?
---------------------------
- 'a' &amp; 'b' &amp; ( 'c' | 'd' )
-
-SELECT 'a &amp; b'::tsquery || 'c|d'::tsquery;
-         ?column?
---------------------------
- 'a' &amp; 'b' | ( 'c' | 'd' )
+SELECT to_tsquery('Fat:ab & Cats');
+    to_tsquery    
+------------------
+ 'fat':AB & 'cat'
 </programlisting>
+    </para>

-      </para>
-     </listitem>
-    </varlistentry>
-   </variablelist>
+   </sect2>
+
+  </sect1>
+
+  <sect1 id="datatype-uuid">
+   <title><acronym>UUID</acronym> Type</title>

+   <indexterm zone="datatype-uuid">
+    <primary>UUID</primary>
+   </indexterm>
+
+   <para>
+    The data type <type>uuid</type> stores Universally Unique Identifiers
+    (UUID) as defined by RFC 4122, ISO/IEC 9834-8:2005, and related standards.
+    (Some systems refer to this data type as globally unique identifier, or
+    GUID,<indexterm><primary>GUID</primary></indexterm> instead.)  Such an
+    identifier is a 128-bit quantity that is generated by an algorithm chosen
+    to make it very unlikely that the same identifier will be generated by
+    anyone else in the known universe using the same algorithm.  Therefore,
+    for distributed systems, these identifiers provide a better uniqueness
+    guarantee than that which can be achieved using sequence generators, which
+    are only unique within a single database.
+   </para>
+
+   <para>
+    A UUID is written as a sequence of lower-case hexadecimal digits,
+    in several groups separated by hyphens, specifically a group of 8
+    digits followed by three groups of 4 digits followed by a group of
+    12 digits, for a total of 32 digits representing the 128 bits.  An
+    example of a UUID in this standard form is:
+<programlisting>
+a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11
+</programlisting>
+    <productname>PostgreSQL</productname> also accepts the following
+    alternative forms for input:
+    use of upper-case digits, the standard format surrounded by
+    braces, and omitting the hyphens.  Examples are:
+<programlisting>
+A0EEBC99-9C0B-4EF8-BB6D-6BB9BD380A11
+{a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11}
+a0eebc999c0b4ef8bb6d6bb9bd380a11
+</programlisting>
+    Output is always in the standard form.
+   </para>
+
+   <para>
+    <productname>PostgreSQL</productname> provides storage and comparison
+    functions for UUIDs, but the core database does not include any
+    function for generating UUIDs, because no single algorithm is well
+    suited for every application.  The contrib module
+    <filename>contrib/uuid-ossp</filename> provides functions that implement 
+    several standard algorithms.
+    Alternatively, UUIDs could be generated by client applications or
+    other libraries invoked through a server-side function.
+   </para>
  </sect1>

  <sect1 id="datatype-xml">
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml