|
|
|
@ -1,95 +1,106 @@ |
|
|
|
|
Tsearch2 - full text search extension for PostgreSQL |
|
|
|
|
|
|
|
|
|
[10][Online version] of this document is available |
|
|
|
|
|
|
|
|
|
This module is sponsored by Delta-Soft Ltd., Moscow, Russia. |
|
|
|
|
|
|
|
|
|
Notice: This version is fully incompatible with old tsearch (V1), |
|
|
|
|
which was deprecated in 7.4 and obsoleted in 8.0. |
|
|
|
|
|
|
|
|
|
The Tsearch2 contrib module contains an implementation of a new data |
|
|
|
|
type tsvector - a searchable data type with indexed access. In a |
|
|
|
|
nutshell, tsvector is a set of unique words along with their |
|
|
|
|
positional information in the document, organized in a special |
|
|
|
|
structure optimized for fast access and lookup. Actually, each word |
|
|
|
|
entry, besides its position in the document, could have a weight |
|
|
|
|
attribute, describing importance of this word (at a specific) position |
|
|
|
|
in document. A set of bit-signatures of a fixed length, representing |
|
|
|
|
tsvectors, are stored in a search tree (developed using PostgreSQL |
|
|
|
|
GiST), which provides online update of full text index and fast query |
|
|
|
|
lookup. The module provides indexed access methods, queries, |
|
|
|
|
operations and supporting routines for the tsvector data type and easy |
|
|
|
|
conversion of text data to tsvector. Table driven configuration allows |
|
|
|
|
creation of custom configuration optimized for specific searches using |
|
|
|
|
[1]Online version of this document is available |
|
|
|
|
|
|
|
|
|
Tsearch2 - is the full text engine, fully integrated into PostgreSQL |
|
|
|
|
RDBMS. |
|
|
|
|
|
|
|
|
|
Main features |
|
|
|
|
|
|
|
|
|
* Full online update |
|
|
|
|
* Supports multiple table driven configurations |
|
|
|
|
* flexible and rich linguistic support (dictionaries, stop words), |
|
|
|
|
thesaurus |
|
|
|
|
* full multibyte (UTF-8) support |
|
|
|
|
* Sophisticated ranking functions with support of proximity and |
|
|
|
|
structure information (rank, rank_cd) |
|
|
|
|
* Index support (GiST and Gin) with concurrency and recovery support |
|
|
|
|
* Rich query language with query rewriting support |
|
|
|
|
* Headline support (text fragments with highlighted search terms) |
|
|
|
|
* Ability to plug-in custom dictionaries and parsers |
|
|
|
|
* Template generator for tsearch2 dictionaries with [2]snowball |
|
|
|
|
stemmer support |
|
|
|
|
* It is mature (5 years of development) |
|
|
|
|
|
|
|
|
|
Tsearch2, in a nutshell, provides FTS operator (contains) for the new |
|
|
|
|
data types, representing document (tsvector) and query (tsquery). |
|
|
|
|
Table driven configuration allows creation of custom searches using |
|
|
|
|
standard SQL commands. |
|
|
|
|
|
|
|
|
|
Configuration allows you to: |
|
|
|
|
* specify the type of lexemes to be indexed and the way they are |
|
|
|
|
processed. |
|
|
|
|
* specify dictionaries to be used along with stop words recognition. |
|
|
|
|
* specify the parser used to process a document. |
|
|
|
|
|
|
|
|
|
See [11]Documentation Roadmap for links to documentation. |
|
|
|
|
|
|
|
|
|
tsvector is a searchable data type, representing document. It is a set |
|
|
|
|
of unique words along with their positional information in the |
|
|
|
|
document, organized in a special structure optimized for fast access |
|
|
|
|
and lookup. Each entry could be labelled to reflect its importance in |
|
|
|
|
document. |
|
|
|
|
|
|
|
|
|
tsquery is a data type for textual queries with support of boolean |
|
|
|
|
operators. It consists of lexemes (optionally labelled) with boolean |
|
|
|
|
operators between. |
|
|
|
|
|
|
|
|
|
Table driven configuration allows to specify: |
|
|
|
|
* parser, which used to break document onto lexemes |
|
|
|
|
* what lexemes to index and the way they are processed |
|
|
|
|
* dictionaries to be used along with stop words recognition. |
|
|
|
|
|
|
|
|
|
OpenFTS vs Tsearch2 |
|
|
|
|
|
|
|
|
|
OpenFTS is a middleware between application and database, so it uses |
|
|
|
|
tsearch2 as a storage, while database engine is used as a query executor |
|
|
|
|
(searching). Everything else (parsing of documents, query processing, |
|
|
|
|
linguistics) carry outs on client side. That's why OpenFTS has its own |
|
|
|
|
configuration table (fts_conf) and works with its own set of dictionaries. |
|
|
|
|
OpenFTS is more flexible, because it could be used in multi-server |
|
|
|
|
architecture with separated machines for repository of documents |
|
|
|
|
(documents could be stored in file system), database and query engine. |
|
|
|
|
[3]OpenFTS is a middleware between application and database. OpenFTS |
|
|
|
|
uses tsearch2 as a storage and database engine as a query executor |
|
|
|
|
(searching). Everything else, i.e. parsing of documents, query |
|
|
|
|
processing, linguistics, carry outs on client side. That's why OpenFTS |
|
|
|
|
has its own configuration table (fts_conf) and works with its own set |
|
|
|
|
of dictionaries. OpenFTS is more flexible, because it could be used in |
|
|
|
|
multi-server architecture with separate machines for repository of |
|
|
|
|
documents (documents could be stored in filesystem), database and |
|
|
|
|
query engine. |
|
|
|
|
|
|
|
|
|
See [4]Documentation Roadmap for links to documentation. |
|
|
|
|
|
|
|
|
|
Authors |
|
|
|
|
|
|
|
|
|
* Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia |
|
|
|
|
* Teodor Sigaev <teodor@sigaev.ru>, Moscow, Delta-Soft Ltd.,Russia |
|
|
|
|
|
|
|
|
|
* Teodor Sigaev <teodor@sigaev.ru>, Moscow,Moscow University,Russia |
|
|
|
|
|
|
|
|
|
Contributors |
|
|
|
|
|
|
|
|
|
* Robert John Shepherd and Andrew J. Kopciuch submitted |
|
|
|
|
"Introduction to tsearch" (Robert - tsearch v1, Andrew - tsearch |
|
|
|
|
* Robert John Shepherd and Andrew J. Kopciuch submitted |
|
|
|
|
"Introduction to tsearch" (Robert - tsearch v1, Andrew - tsearch |
|
|
|
|
v2) |
|
|
|
|
* Brandon Craig Rhodes wrote "Tsearch2 Guide" and "Tsearch2 |
|
|
|
|
* Brandon Craig Rhodes wrote "Tsearch2 Guide" and "Tsearch2 |
|
|
|
|
Reference" and proposed new naming convention for tsearch V2 |
|
|
|
|
|
|
|
|
|
Features Added with Tsearch2 |
|
|
|
|
|
|
|
|
|
* Relevance ranking of search results |
|
|
|
|
* Table driven configuration |
|
|
|
|
* Morphology support (ispell dictionaries, snowball stemmers) |
|
|
|
|
* Headline support (text fragments with highlighted search terms) |
|
|
|
|
* Ability to plug-in custom dictionaries and parsers |
|
|
|
|
* Synonym dictionary |
|
|
|
|
* Generator of templates for dictionaries (built-in snowball stemmer |
|
|
|
|
support) |
|
|
|
|
* Statistics of indexed words is available |
|
|
|
|
|
|
|
|
|
Sponsors |
|
|
|
|
|
|
|
|
|
* ABC Startsiden - compound words support |
|
|
|
|
* University of Mannheim for UTF-8 support (in 8.2) |
|
|
|
|
* jfg:networks ([5]http:www.jfg-networks.com/) for Gin - Generalized |
|
|
|
|
Inverted index (in 8.2) |
|
|
|
|
* Georgia Public Library Service and LibLime, Inc. for Thesaurus |
|
|
|
|
dictionary |
|
|
|
|
* PostGIS community - GiST Concurrency and Recovery |
|
|
|
|
|
|
|
|
|
The authors are grateful to the Russian Foundation for Basic Research |
|
|
|
|
and Delta-Soft Ltd., Moscow, Russia for support. |
|
|
|
|
|
|
|
|
|
Limitations |
|
|
|
|
|
|
|
|
|
* Lexeme should be not longer than 2048 bytes |
|
|
|
|
* The number of lexemes is limited by 2^32. Note, that actual |
|
|
|
|
capacity of tsvector is depends on whether positional information |
|
|
|
|
is stored or not. |
|
|
|
|
* tsvector - the size is limited by approximately 2^20 bytes. |
|
|
|
|
* tsquery - the number of entries (lexemes and operations) < 32768 |
|
|
|
|
* Positional information |
|
|
|
|
+ maximal position of lexeme < 2^14 (16384) |
|
|
|
|
+ lexeme could have maximum 256 positions |
|
|
|
|
|
|
|
|
|
* Length of lexeme < 2K |
|
|
|
|
* Length of tsvector (lexemes + positions) < 1Mb |
|
|
|
|
* The number of lexemes < 4^32 |
|
|
|
|
* 0< Positional information < 16383 |
|
|
|
|
* No more than 256 positions per lexeme |
|
|
|
|
* The number of nodes ( lexemes + operations) in tsquery < 32768 |
|
|
|
|
|
|
|
|
|
References |
|
|
|
|
|
|
|
|
|
* GiST development site - |
|
|
|
|
[12]http://www.sai.msu.su/~megera/postgres/gist |
|
|
|
|
* OpenFTS home page - [13]http://openfts.sourceforge.net/ |
|
|
|
|
[6]http://www.sai.msu.su/~megera/postgres/gist |
|
|
|
|
* GiN development - [7]http://www.sigaev.ru/gin/ |
|
|
|
|
* OpenFTS home page - [8]http://openfts.sourceforge.net/ |
|
|
|
|
* Mailing list - |
|
|
|
|
[14]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gen |
|
|
|
|
eral |
|
|
|
|
|
|
|
|
|
[15]Documentation Roadmap |
|
|
|
|
|
|
|
|
|
[9]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gene |
|
|
|
|
ral |
|
|
|
|
|
|
|
|
|
Documentation Roadmap |
|
|
|
|
|
|
|
|
|
* Several docs are available from docs/ subdirectory |
|
|
|
@ -97,113 +108,103 @@ Documentation Roadmap |
|
|
|
|
+ "Tsearch2 Guide" by Brandon Rhodes |
|
|
|
|
+ "Tsearch2 Reference" by Brandon Rhodes |
|
|
|
|
* Readme.gendict in gendict/ subdirectory |
|
|
|
|
+ [16][Gendict tutorial] |
|
|
|
|
|
|
|
|
|
Online version of documentation is always available from Tsearch V2 |
|
|
|
|
home page - |
|
|
|
|
[17]http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ |
|
|
|
|
|
|
|
|
|
+ Also, check [10]Gendict tutorial |
|
|
|
|
* Check [11]tsearch2 Wiki pages for various documentation |
|
|
|
|
|
|
|
|
|
Support |
|
|
|
|
|
|
|
|
|
Authors urgently recommend people to use [18][openfts-general] or |
|
|
|
|
[19][pgsql-general] mailing lists for questions and discussions. |
|
|
|
|
|
|
|
|
|
Caution |
|
|
|
|
Authors urgently recommend people to use [12]openfts-general or |
|
|
|
|
[13]pgsql-general mailing lists for questions and discussions. |
|
|
|
|
|
|
|
|
|
In spite of apparent easy full text searching with our tsearch module |
|
|
|
|
(authors hope it's so), any serious search engine require profound |
|
|
|
|
study of various aspects, such as stop words, dictionaries, special |
|
|
|
|
parsers. Tsearch module was designed to facilitate both those cases. |
|
|
|
|
|
|
|
|
|
Development History |
|
|
|
|
|
|
|
|
|
Latest news |
|
|
|
|
|
|
|
|
|
To the PostgreSQL 8.2 release we added: |
|
|
|
|
* multibyte (UTF-8) support |
|
|
|
|
* Thesaurus dictionary |
|
|
|
|
* Query rewriting |
|
|
|
|
* rank_cd relevation function now support different weights of |
|
|
|
|
lexemes |
|
|
|
|
* GiN support adds scalability of tsearch2 |
|
|
|
|
|
|
|
|
|
Pre-tsearch era |
|
|
|
|
Development of OpenFTS began in 2000 after realizing that we |
|
|
|
|
needed a search engine optimized for online updates and able to |
|
|
|
|
access metadata from the database. This is essential for online |
|
|
|
|
Development of OpenFTS began in 2000 after realizing that we |
|
|
|
|
need a search engine optimized for online updates with access |
|
|
|
|
to metadata from the database. This is essential for online |
|
|
|
|
news agencies, web portals, digital libraries, etc. Most search |
|
|
|
|
engines available utilize an inverted index which is very fast |
|
|
|
|
for searching but very slow for online updates. Incremental |
|
|
|
|
updates of an inverted index is a complex engineering task |
|
|
|
|
while we needed something light, free and with the ability to |
|
|
|
|
access metadata from the database. The last requirement is very |
|
|
|
|
important because in a real life application a search engine |
|
|
|
|
should always consult metadata ( topic, permissions, date |
|
|
|
|
range, version, etc.). We extensively use PostgreSQL as a |
|
|
|
|
database backend and have no intention to move from it, so the |
|
|
|
|
problem was to find a data structure and a fast way to access |
|
|
|
|
it. PostgreSQL has rather unique data type for storing sets |
|
|
|
|
(think about words) - arrays, but lacks index access to them. A |
|
|
|
|
document is parsed into lexemes, which are identified in |
|
|
|
|
various ways (e.g. stemming, morphology, dictionary), and as a |
|
|
|
|
result is reduced to an array of integer numbers. During our |
|
|
|
|
research we found a paper of Joseph Hellerstein which |
|
|
|
|
introduced an interesting data structure suitable for sets - |
|
|
|
|
RD-tree (Russian Doll tree). It looked very attractive, but |
|
|
|
|
implementing it in PostgreSQL seemed difficult because of our |
|
|
|
|
ignorance of database internals. Further research lead us to |
|
|
|
|
the idea to use GiST for implementing RD-tree, but at that time |
|
|
|
|
the GiST code had for a long while remained untouched and |
|
|
|
|
contained several bugs. After work on improving GiST for |
|
|
|
|
version 7.0.3 of PostgreSQL was done, we were able to implement |
|
|
|
|
RD-Tree and use it for index access to arrays of integers. This |
|
|
|
|
implementation was ideally suited for small arrays and |
|
|
|
|
eliminated complex joins, but was practically useless for |
|
|
|
|
indexing large arrays. The next improvement came from an idea |
|
|
|
|
to represent a document by a single bit-signature, a so-called |
|
|
|
|
superimposed signature (see "Index Structures for Databases |
|
|
|
|
Containing Data Items with Set-valued Attributes", 1997, Sven |
|
|
|
|
Helmer for details). We developeded the contrib/intarray module |
|
|
|
|
and used it for full text indexing. |
|
|
|
|
|
|
|
|
|
engines available utilize an inverted index which is very fast |
|
|
|
|
for searching but very slow for online updates. Incremental |
|
|
|
|
updates of an inverted index is a complex engineering task |
|
|
|
|
while we needed something light, free and with the ability to |
|
|
|
|
access metadata from the database. The last requirement was |
|
|
|
|
very important because in a real life application search engine |
|
|
|
|
should always consult metadata ( topic, permissions, date |
|
|
|
|
range, version, etc.). We extensively use PostgreSQL as a |
|
|
|
|
database backend and have no intention to move from it, so the |
|
|
|
|
problem was to find a data structure and a fast way to access |
|
|
|
|
it. PostgreSQL has rather unique data type for storing sets |
|
|
|
|
(think about words) - arrays, but lacks index access to them. |
|
|
|
|
During our research we found a paper of Joseph Hellerstein, who |
|
|
|
|
introduced an interesting data structure suitable for sets - |
|
|
|
|
RD-tree (Russian Doll tree). Further research lead us to the |
|
|
|
|
idea to use GiST for implementing RD-tree, but at that time the |
|
|
|
|
GiST code was intouched for a long time and contained several |
|
|
|
|
bugs. After work on improving GiST for version 7.0.3 of |
|
|
|
|
PostgreSQL was done, we were able to implement RD-Tree and use |
|
|
|
|
it for index access to arrays of integers. This implementation |
|
|
|
|
was ideally suited for small arrays and eliminated complex |
|
|
|
|
joins, but was practically useless for indexing large arrays. |
|
|
|
|
The next improvement came from an idea to represent a document |
|
|
|
|
by a single bit-signature, a so-called superimposed signature |
|
|
|
|
(see "Index Structures for Databases Containing Data Items with |
|
|
|
|
Set-valued Attributes", 1997, Sven Helmer for details). We |
|
|
|
|
developeded the contrib/intarray module and used it for full |
|
|
|
|
text indexing. |
|
|
|
|
|
|
|
|
|
tsearch v1 |
|
|
|
|
It was inconvenient to use integer id's instead of words, so we |
|
|
|
|
introduced a new data type called 'txtidx' - a searchable data |
|
|
|
|
type (textual) with indexed access. This was a first step of |
|
|
|
|
our work on an implementation of a built-in PostgreSQL full |
|
|
|
|
introduced a new data type called 'txtidx' - a searchable data |
|
|
|
|
type (textual) with indexed access. This was a first step of |
|
|
|
|
our work on an implementation of a built-in PostgreSQL full |
|
|
|
|
text search engine. Even though tsearch v1 had many features of |
|
|
|
|
a search engine it lacked configuration support and relevance |
|
|
|
|
ranking. People were encouraged to use OpenFTS, which provided |
|
|
|
|
relevance ranking based on coordinate information and flexible |
|
|
|
|
configuration. OpenFTS v.0.34 is the last version based on |
|
|
|
|
a search engine it lacked configuration support and relevance |
|
|
|
|
ranking. People were encouraged to use OpenFTS, which provided |
|
|
|
|
relevance ranking based on positional information and flexible |
|
|
|
|
configuration. OpenFTS v.0.34 is the last version based on |
|
|
|
|
tsearch v1. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
tsearch V2 |
|
|
|
|
People recognized tsearch as a powerful tool for full text |
|
|
|
|
searching and insisted on adding ranking support, better |
|
|
|
|
configurability, etc. We already thought about moving most of |
|
|
|
|
the features of OpenFTS to tsearch, and in the early 2003 we |
|
|
|
|
decided to work on a new version of tsearch - tsearch v2. We've |
|
|
|
|
abandoned auxiliary index tables which were used by OpenFTS to |
|
|
|
|
store coordinate information and modified the txtidx type to |
|
|
|
|
store them internally. Also, we've added table-driven |
|
|
|
|
configuration, support of ispell dictionaries, snowball |
|
|
|
|
stemmers and the ability to specify which types of lexemes to |
|
|
|
|
index. Also, it's now possible to generate headlines of |
|
|
|
|
documents with highlighted search terms. These changes make |
|
|
|
|
tsearch more user friendly and turn it into a really powerful |
|
|
|
|
full text search engine. After announcing the alpha version, we |
|
|
|
|
received a proposal from Brandon Rhodes to rename tsearch |
|
|
|
|
functions to be more consistent. So, we have renamed txtidx |
|
|
|
|
type to tsvector and other things as well. |
|
|
|
|
|
|
|
|
|
To allow users of tsearch v1 smooth upgrade, we named the module as |
|
|
|
|
tsearch2. |
|
|
|
|
|
|
|
|
|
Future release of OpenFTS (v.0.35) will be based on tsearch2. Brave |
|
|
|
|
people could download it from OpenFTS CVS (see link from [20][OpenFTS |
|
|
|
|
page] |
|
|
|
|
People recognized tsearch as a powerful tool for full text |
|
|
|
|
searching and insisted on adding ranking support, better |
|
|
|
|
configurability, etc. We already thought about moving most of |
|
|
|
|
the features of OpenFTS to tsearch, and in the early 2003 we |
|
|
|
|
decided to work on a new version of tsearch. We abandoned |
|
|
|
|
auxiliary index tables which were used by OpenFTS to store |
|
|
|
|
positional information and modified the txtidx type to store |
|
|
|
|
them internally. We added table-driven configuration, support |
|
|
|
|
of ispell dictionaries, snowball stemmers and the ability to |
|
|
|
|
specify which types of lexemes to index. Now, it's possible to |
|
|
|
|
generate headlines of documents with highlighted search terms. |
|
|
|
|
These changes make tsearch more user friendly and turn it into |
|
|
|
|
a really powerful full text search engine. Brandon Rhodes |
|
|
|
|
proposed to rename tsearch functions for consistency and we |
|
|
|
|
renamed txtidx type to tsvector and other things as well. To |
|
|
|
|
allow users of tsearch v1 smooth upgrade, we named the module |
|
|
|
|
as tsearch2. Since version 0.35 OpenFTS uses tsearch2. |
|
|
|
|
|
|
|
|
|
References |
|
|
|
|
|
|
|
|
|
10. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html |
|
|
|
|
11. http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Readme#Documentation_Roadmap |
|
|
|
|
12. http://www.sai.msu.su/~megera/postgres/gist |
|
|
|
|
13. http://openfts.sourceforge.net/ |
|
|
|
|
14. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general |
|
|
|
|
15. http://www.sai.msu.su/~megera/oddmuse/index.cgi?action=anchor&id=Documentation_Roadmap#Documentation_Roadmap |
|
|
|
|
16. http://www.sai.msu.su/~megera/oddmuse/index.cgi?Gendict |
|
|
|
|
17. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ |
|
|
|
|
18. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general |
|
|
|
|
19. http://archives.postgresql.org/pgsql-general/ |
|
|
|
|
20. http://openfts.sourceforge.net/ |
|
|
|
|
1. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html |
|
|
|
|
2. http://snowball.tartarus.org/ |
|
|
|
|
3. http://openfts.sourceforge.net/ |
|
|
|
|
4. file://localhost/u/megera/WWW/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme82.html#dm |
|
|
|
|
5. http:www.jfg-networks.com/ |
|
|
|
|
6. http://www.sai.msu.su/~megera/postgres/gist |
|
|
|
|
7. http://www.sigaev.ru/gin/ |
|
|
|
|
8. http://openfts.sourceforge.net/ |
|
|
|
|
9. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general |
|
|
|
|
10. http://www.sai.msu.su/~megera/wiki/Gendict |
|
|
|
|
11. http://www.sai.msu.su/~megera/wiki/Tsearch2 |
|
|
|
|
12. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general |
|
|
|
|
13. http://archives.postgresql.org/pgsql-general/ |
|
|
|
|