You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
postgres/contrib/tsearch2/docs/tsearch2-ref.html

458 lines
18 KiB

23 years ago
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head>
<link type="text/css" rel="stylesheet" href="tsearch2-ref_files/tsearch.txt"><title>tsearch2 reference</title></head>
23 years ago
<body>
23 years ago
<h1 align="center">The tsearch2 Reference</h1>
23 years ago
23 years ago
<p align="center">
Brandon Craig Rhodes<br>30 June 2003 (edited by Oleg Bartunov, 2 Aug 2003).
</p><p>
23 years ago
This Reference documents the user types and functions
of the tsearch2 module for PostgreSQL.
An introduction to the module is provided
23 years ago
by the <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html">tsearch2 Guide</a>,
23 years ago
a companion document to this one.
You can retrieve a beta copy of the tsearch2 module from the
23 years ago
<a href="http://www.sai.msu.su/%7Emegera/postgres/gist/">GiST for PostgreSQL</a>
page -- look under the section entitled <i>Development History</i>
23 years ago
for the current version.
23 years ago
</p><h2><a name="vq">Vectors and Queries</a></h2>
23 years ago
23 years ago
<a name="vq">Vectors and queries both store lexemes,
23 years ago
but for different purposes.
A <tt>tsvector</tt> stores the lexemes
of the words that are parsed out of a document,
and can also remember the position of each word.
A <tt>tsquery</tt> specifies a boolean condition among lexemes.
23 years ago
</a><p>
<a name="vq">Any of the following functions with a <tt><i>configuration</i></tt> argument
23 years ago
can use either an integer <tt>id</tt> or textual <tt>ts_name</tt>
to select a configuration;
if the option is omitted, then the current configuration is used.
For more information on the current configuration,
read the next section on Configurations.
23 years ago
</a></p><h3><a name="vq">Vector Operations</a></h3>
23 years ago
23 years ago
<dl><dt>
<a name="vq"> <tt>to_tsvector( <em>[</em><i>configuration</i>,<em>]</em>
23 years ago
<i>document</i> TEXT) RETURNS tsvector</tt>
23 years ago
</a></dt><dd>
<a name="vq"> Parses a document into tokens,
23 years ago
reduces the tokens to lexemes,
and returns a <tt>tsvector</tt> which lists the lexemes
together with their positions in the document.
For the best description of this process,
23 years ago
see the section on </a><a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#ps">Parsing and Stemming</a>
23 years ago
in the accompanying tsearch2 Guide.
23 years ago
</dd><dt>
23 years ago
<tt>strip(<i>vector</i> tsvector) RETURNS tsvector</tt>
23 years ago
</dt><dd>
23 years ago
Return a vector which lists the same lexemes
as the given <tt><i>vector</i></tt>,
but which lacks any information
about where in the document each lexeme appeared.
While the returned vector is thus useless for relevance ranking,
it will usually be much smaller.
23 years ago
</dd><dt>
23 years ago
<tt>setweight(<i>vector</i> tsvector, <i>letter</i>) RETURNS tsvector</tt>
23 years ago
</dt><dd>
23 years ago
This function returns a copy of the input vector
in which every location has been labelled
with either the <tt><i>letter</i></tt>
<tt>'A'</tt>, <tt>'B'</tt>, or <tt>'C'</tt>,
or the default label <tt>'D'</tt>
(which is the default with which new vectors are created,
and as such is usually not displayed).
These labels are retained when vectors are concatenated,
allowing words from different parts of a document
to be weighted differently by ranking functions.
23 years ago
</dd><dt>
23 years ago
<tt><i>vector1</i> || <i>vector2</i></tt>
23 years ago
</dt><dt class="br">
23 years ago
<tt>concat(<i>vector1</i> tsvector, <i>vector2</i> tsvector)
RETURNS tsvector</tt>
23 years ago
</dt><dd>
23 years ago
Returns a vector which combines the lexemes and position information
in the two vectors given as arguments.
Position weight labels (described in the previous paragraph)
are retained intact during the concatenation.
This has at least two uses.
First,
if some sections of your document
need be parsed with different configurations than others,
you can parse them separately
and concatenate the resulting vectors into one.
Second,
you can weight words from some sections of you document
more heavily than those from others by:
parsing the sections into separate vectors;
assigning the vectors different position labels
with the <tt>setweight()</tt> function;
concatenating them into a single vector;
and then providing a <tt><i>weights</i></tt> argument
to the <tt>rank()</tt> function
that assigns different weights to positions with different labels.
23 years ago
</dd><dt>
23 years ago
<tt>tsvector_size(<i>vector</i> tsvector) RETURNS INT4</tt>
23 years ago
</dt><dd>
23 years ago
Returns the number of lexemes stored in the vector.
23 years ago
</dd><dt>
23 years ago
<tt><i>text</i>::tsvector RETURNS tsvector</tt>
23 years ago
</dt><dd>
23 years ago
Directly casting text to a <tt>tsvector</tt>
allows you to directly inject lexemes into a vector,
with whatever positions and position weights you choose to specify.
The <tt><i>text</i></tt> should be formatted
like the vector would be printed by the output of a <tt>SELECT</tt>.
23 years ago
See the <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#casting">Casting</a>
23 years ago
section in the Guide for details.
23 years ago
</dd></dl>
23 years ago
<h3>Query Operations</h3>
23 years ago
<dl><dt>
23 years ago
<tt>to_tsquery( <em>[</em><i>configuration</i>,<em>]</em>
<i>querytext</i> text) RETURNS tsvector</tt>
23 years ago
</dt><dd>
23 years ago
Parses a query,
which should be single words separated by the boolean operators
23 years ago
"<tt>&amp;</tt>"&nbsp;and,
"<tt>|</tt>"&nbsp;or,
and&nbsp;"<tt>!</tt>"&nbsp;not,
23 years ago
which can be grouped using parenthesis.
Each word is reduced to a lexeme using the current
or specified configuration.
23 years ago
</dd><dt>
23 years ago
<tt>querytree(<i>query</i> tsquery) RETURNS text</tt>
23 years ago
</dt><dd>
23 years ago
This might return a textual representation of the given query.
23 years ago
</dd><dt>
23 years ago
<tt><i>text</i>::tsquery RETURNS tsquery</tt>
23 years ago
</dt><dd>
23 years ago
Directly casting text to a <tt>tsquery</tt>
allows you to directly inject lexemes into a query,
with whatever positions and position weight flags you choose to specify.
The <tt><i>text</i></tt> should be formatted
like the query would be printed by the output of a <tt>SELECT</tt>.
23 years ago
See the <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#casting">Casting</a>
23 years ago
section in the Guide for details.
23 years ago
</dd></dl>
23 years ago
<h2><a name="configurations">Configurations</a></h2>
A configuration specifies all of the equipment necessary
to transform a document into a <tt>tsvector</tt>:
the parser that breaks its text into tokens,
and the dictionaries which then transform each token into a lexeme.
Every call to <tt>to_tsvector()</tt> (described above)
uses a configuration to perform its processing.
Three configurations come with tsearch2:
<ul>
23 years ago
<li><b>default</b> -- Indexes words and numbers,
23 years ago
using the <i>en_stem</i> English Snowball stemmer for Latin-alphabet words
and the <i>simple</i> dictionary for all others.
23 years ago
</li><li><b>default_russian</b> -- Indexes words and numbers,
23 years ago
using the <i>en_stem</i> English Snowball stemmer for Latin-alphabet words
and the <i>ru_stem</i> Russian Snowball dictionary for all others.
23 years ago
</li><li><b>simple</b> -- Processes both words and numbers
23 years ago
with the <i>simple</i> dictionary,
which neither discards any stop words nor alters them.
23 years ago
</li></ul>
23 years ago
The tsearch2 modules initially chooses your current configuration
by looking for your current locale in the <tt>locale</tt> field
of the <tt>pg_ts_cfg</tt> table described below.
You can manipulate the current configuration yourself with these functions:
23 years ago
<dl><dt>
23 years ago
<tt>set_curcfg( <i>id</i> INT <em>|</em> <i>ts_name</i> TEXT
) RETURNS VOID</tt>
23 years ago
</dt><dd>
23 years ago
Set the current configuration used by <tt>to_tsvector</tt>
and <tt>to_tsquery</tt>.
23 years ago
</dd><dt>
23 years ago
<tt>show_curcfg() RETURNS INT4</tt>
23 years ago
</dt><dd>
23 years ago
Returns the integer <tt>id</tt> of the current configuration.
23 years ago
</dd></dl>
23 years ago
<p>
Each configuration is defined by a record in the <tt>pg_ts_cfg</tt> table:
23 years ago
</p><pre>create table pg_ts_cfg (
23 years ago
id int not null primary key,
ts_name text not null,
prs_name text not null,
locale text
);</pre>
The <tt>id</tt> and <tt>ts_name</tt> are unique values
which identify the configuration;
the <tt>prs_name</tt> specifies which parser the configuration uses.
Once this parser has split document text into tokens,
23 years ago
the type of each resulting token --
or, more specifically, the type's <tt>tok_alias</tt>
as specified in the parser's <tt>lexem_type()</tt> table --
23 years ago
is searched for together with the configuration's <tt>ts_name</tt>
in the <tt>pg_ts_cfgmap</tt> table:
<pre>create table pg_ts_cfgmap (
ts_name text not null,
23 years ago
tok_alias text not null,
23 years ago
dict_name text[],
23 years ago
primary key (ts_name,tok_alias)
23 years ago
);</pre>
Those tokens whose types are not listed are discarded.
The remaining tokens are assigned integer positions,
starting with 1 for the first token in the document,
and turned into lexemes with the help of the dictionaries
whose names are given in the <tt>dict_name</tt> array for their type.
These dictionaries are tried in order,
stopping either with the first one to return a lexeme for the token,
or discarding the token if no dictionary returns a lexeme for it.
<h2><a name="testing">Testing</a></h2>
Function <tt>ts_debug</tt> allows easy testing of your <b>current</b> configuration.
You may always test another configuration using <tt>set_curcfg</tt> function.
<p>
Example:
</p><pre>apod=# select * from ts_debug('Tsearch module for PostgreSQL 7.3.3');
ts_name | tok_type | description | token | dict_name | tsvector
---------+----------+-------------+------------+-----------+--------------
default | lword | Latin word | Tsearch | {en_stem} | 'tsearch'
default | lword | Latin word | module | {en_stem} | 'modul'
default | lword | Latin word | for | {en_stem} |
default | lword | Latin word | PostgreSQL | {en_stem} | 'postgresql'
default | version | VERSION | 7.3.3 | {simple} | '7.3.3'
</pre>
Here:
<br>
<ul>
<li>tsname - configuration name
</li><li>tok_type - token type
</li><li>description - human readable name of tok_type
</li><li>token - parser's token
</li><li>dict_name - dictionary used for the token
</li><li>tsvector - final result</li></ul>
<h2><a name="parsers">Parsers</a></h2>
23 years ago
Each parser is defined by a record in the <tt>pg_ts_parser</tt> table:
<pre>create table pg_ts_parser (
prs_name text not null,
prs_start regprocedure not null,
prs_nexttoken regprocedure not null,
prs_end regprocedure not null,
prs_headline regprocedure not null,
prs_lextype regprocedure not null,
23 years ago
prs_comment text
);</pre>
23 years ago
The <tt>prs_name</tt> uniquely identify the parser,
23 years ago
while <tt>prs_comment</tt> usually describes its name and version
for the reference of users.
The other items identify the low-level functions
which make the parser operate,
and are only of interest to someone writing a parser of their own.
<p>
The tsearch2 module comes with one parser named <tt>default</tt>
which is suitable for parsing most plain text and HTML documents.
23 years ago
</p><p>
23 years ago
Each <tt><i>parser</i></tt> argument below
23 years ago
must designate a parser with <tt><i>prs_name</i></tt>;
23 years ago
the current parser is used when this argument is omitted.
23 years ago
</p><dl><dt>
23 years ago
<tt>CREATE FUNCTION set_curprs(<i>parser</i>) RETURNS VOID</tt>
23 years ago
</dt><dd>
23 years ago
Selects a current parser
which will be used when any of the following functions
are called without a parser as an argument.
23 years ago
</dd><dt>
<tt>CREATE FUNCTION token_type(
23 years ago
<em>[</em> <i>parser</i> <em>]</em>
23 years ago
) RETURNS SETOF tokentype</tt>
</dt><dd>
23 years ago
Returns a table which defines and describes
each kind of token the parser may produce as output.
23 years ago
For each token type the table gives the <tt>tokid</tt>
23 years ago
which the parser will label each token of that type,
the <tt>alias</tt> which names the token type,
and a short description <tt>descr</tt> for the user to read.
23 years ago
</dd><dt>
23 years ago
<tt>CREATE FUNCTION parse(
<em>[</em> <i>parser</i>, <em>]</em> <i>document</i> TEXT
23 years ago
) RETURNS SETOF tokenout</tt>
</dt><dd>
23 years ago
Parses the given document and returns a series of records,
one for each token produced by parsing.
23 years ago
Each token includes a <tt>tokid</tt> giving its type
23 years ago
and a <tt>lexem</tt> which gives its content.
23 years ago
</dd></dl>
23 years ago
<h2><a name="dictionaries">Dictionaries</a></h2>
Dictionaries take textual tokens as input,
usually those produced by a parser,
and return lexemes which are usually some reduced form of the token.
Among the dictionaries which come installed with tsearch2 are:
<ul>
<li><b>simple</b> simply folds uppercase letters to lowercase
before returning the word.
23 years ago
</li><li><b>en_stem</b> runs an English Snowball stemmer on each word
23 years ago
that attempts to reduce the various forms of a verb or noun
to a single recognizable form.
23 years ago
</li><li><b>ru_stem</b> runs a Russian Snowball stemmer on each word.
</li></ul>
23 years ago
Each dictionary is defined by an entry in the <tt>pg_ts_dict</tt> table:
<pre>CREATE TABLE pg_ts_dict (
dict_name text not null,
dict_init regprocedure,
23 years ago
dict_initoption text,
dict_lexize regprocedure not null,
23 years ago
dict_comment text
);</pre>
23 years ago
The <tt>dict_name</tt>
23 years ago
serve as unique identifiers for the dictionary.
The meaning of the <tt>dict_initoption</tt> varies among dictionaries,
but for the built-in Snowball dictionaries
it specifies a file from which stop words should be read.
The <tt>dict_comment</tt> is a human-readable description of the dictionary.
The other fields are internal function identifiers
useful only to developers trying to implement their own dictionaries.
<p>
The argument named <tt><i>dictionary</i></tt>
in each of the following functions
23 years ago
should be <tt>dict_name</tt>
23 years ago
identifying which dictionary should be used for the operation;
if omitted then the current dictionary is used.
23 years ago
</p><dl><dt>
23 years ago
<tt>CREATE FUNCTION set_curdict(<i>dictionary</i>) RETURNS VOID</tt>
23 years ago
</dt><dd>
23 years ago
Selects a current dictionary for use by functions
that do not select a dictionary explicitly.
23 years ago
</dd><dt>
23 years ago
<tt>CREATE FUNCTION lexize(
<em>[</em> <i>dictionary</i>, <em>]</em> <i>word</i> text)
RETURNS TEXT[]</tt>
23 years ago
</dt><dd>
23 years ago
Reduces a single word to a lexeme.
Note that lexemes are arrays of zero or more strings,
since in some languages there might be several base words
from which an inflected form could arise.
23 years ago
</dd></dl>
23 years ago
<h2><a name="ranking">Ranking</a></h2>
Ranking attempts to measure how relevant documents are to particular queries
by inspecting the number of times each search word appears in the document,
and whether different search terms occur near each other.
23 years ago
Note that this information is only available in unstripped vectors --
23 years ago
ranking functions will only return a useful result
for a <tt>tsvector</tt> which still has position information!
<p>
Both of these ranking functions
take an integer <i>normalization</i> option
that specifies whether a document's length should impact its rank.
This is often desirable,
since a hundred-word document with five instances of a search word
is probably more relevant than a thousand-word document with five instances.
The option can have the values:
23 years ago
</p><ul>
23 years ago
<li><tt>0</tt> (the default) ignores document length.
23 years ago
</li><li><tt>1</tt> divides the rank by the logarithm of the length.
</li><li><tt>2</tt> divides the rank by the length itself.
</li></ul>
23 years ago
The two ranking functions currently available are:
23 years ago
<dl><dt>
23 years ago
<tt>CREATE FUNCTION rank(<br>
<em>[</em> <i>weights</i> float4[], <em>]</em>
<i>vector</i> tsvector, <i>query</i> tsquery,
<em>[</em> <i>normalization</i> int4 <em>]</em><br>
) RETURNS float4</tt>
23 years ago
</dt><dd>
23 years ago
This is the ranking function from the old version of OpenFTS,
and offers the ability to weight word instances more heavily
depending on how you have classified them.
The <i>weights</i> specify how heavily to weight each category of word:
23 years ago
<pre>{<i>D-weight</i>, <i>C-weight</i>, <i>B-weight</i>, <i>A-weight</i>}</pre>
23 years ago
If no weights are provided, then these defaults are used:
<pre>{0.1, 0.2, 0.4, 1.0}</pre>
Often weights are used to mark words from special areas of the document,
like the title or an initial abstract,
and make them more or less important than words in the document body.
23 years ago
</dd><dt>
23 years ago
<tt>CREATE FUNCTION rank_cd(<br>
<em>[</em> <i>K</i> int4, <em>]</em>
<i>vector</i> tsvector, <i>query</i> tsquery,
<em>[</em> <i>normalization</i> int4 <em>]</em><br>
) RETURNS float4</tt>
23 years ago
</dt><dd>
23 years ago
This function computes the cover density ranking
for the given document <i>vector</i> and <i>query</i>,
as described in Clarke, Cormack, and Tudhope's
23 years ago
"<a href="http://citeseer.nj.nec.com/clarke00relevance.html">Relevance Ranking for One to Three Term Queries</a>"
23 years ago
in the 1999 <i>Information Processing and Management</i>.
The value <i>K</i> is one of the values from their formula,
and defaults to&nbsp;<i>K</i>=4.
The examples in their paper <i>K</i>=16;
we can roughly describe the term
as stating how far apart two search terms can fall
before the formula begins penalizing them for lack of proximity.
23 years ago
</dd></dl>
23 years ago
<h2><a name="headlines">Headlines</a></h2>
23 years ago
<dl><dt>
23 years ago
<tt>CREATE FUNCTION headline(<br>
<em>[</em> <i>id</i> int4, <em>|</em> <i>ts_name</i> text, <em>]</em>
<i>document</i> text, <i>query</i> tsquery,
<em>[</em> <i>options</i> text <em>]</em><br>
) RETURNS text</tt>
23 years ago
</dt><dd>
23 years ago
Every form of the the <tt>headline()</tt> function
accepts a <tt>document</tt> along with a <tt>query</tt>,
and returns one or more ellipse-separated excerpts from the document
in which terms from the query are highlighted.
The configuration with which to parse the document
can be specified by either its <i>id</i> or <i>ts_name</i>;
if none is specified that the current configuration is used instead.
<p>
An <i>options</i> string if provided should be a comma-separated list
23 years ago
of one or more '<i>option</i><tt>=</tt><i>value</i>' pairs.
23 years ago
The available options are:
23 years ago
</p><ul>
<li><tt>StartSel</tt>, <tt>StopSel</tt> --
23 years ago
the strings with which query words appearing in the document
should be delimited to distinguish them from other excerpted words.
23 years ago
</li><li><tt>MaxWords</tt>, <tt>MinWords</tt> --
23 years ago
limits on the shortest and longest headlines you will accept.
23 years ago
</li><li><tt>ShortWord</tt> --
23 years ago
this prevents your headline from beginning or ending
with a word which has this many characters or less.
The default value of <tt>3</tt> should eliminate most English
conjunctions and articles.
23 years ago
</li></ul>
23 years ago
Any unspecified options receive these defaults:
23 years ago
<pre>StartSel=&lt;b&gt;, StopSel=&lt;/b&gt;, MaxWords=35, MinWords=15, ShortWord=3
23 years ago
</pre>
23 years ago
</dd></dl>
23 years ago
</body></html>