You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 
postgres/contrib/tsearch2
Tom Lane b053c53248 Fix erroneous parsing of tsquery input "... & !(subexpression) | ..." 15 years ago
..
data tsearch2 module 22 years ago
docs Add description of new features 19 years ago
expected Backpatch: Fix tsvector_out() and tsquery_out() to escape backslesh, add test of that. 18 years ago
gendict Fix a passel of recently-committed violations of the rule 'thou shalt 19 years ago
ispell Fix header's size of structs defines in ispell. 18 years ago
my2ispell Fix convertion for 'PFX flag N num' 19 years ago
snowball pgindent run for 8.2. 19 years ago
sql Backpatch: Fix tsvector_out() and tsquery_out() to escape backslesh, add test of that. 18 years ago
stopword Snowball multibyte. It's a pity, but snowball sources is very diferent for multibyte and 20 years ago
wordparser Prevent recursion during parse of email-like string with multiple '@'. 17 years ago
.gitignore Some more gitignore cleanups: cover contrib and PL regression test outputs. 15 years ago
Makefile Rename the uninstall scripts for contrib/lo and contrib/tsearch2 to 19 years ago
README.tsearch2 Fix two typos. 19 years ago
common.c pgindent run for 8.2. 19 years ago
common.h pgindent run for 8.2. 19 years ago
crc32.c Add CVS tag lines to files that were lacking them. 20 years ago
crc32.h Add CVS tag lines to files that were lacking them. 20 years ago
dict.c pgindent run for 8.2. 19 years ago
dict.h pgindent run for 8.2. 19 years ago
dict_ex.c Fix bug http://archives.postgresql.org/pgsql-bugs/2006-10/msg00258.php. 19 years ago
dict_ispell.c Add CVS tag lines to files that were lacking them. 20 years ago
dict_snowball.c Fix bug http://archives.postgresql.org/pgsql-bugs/2006-10/msg00258.php. 19 years ago
dict_syn.c Fix core dump of ispell for case of non-successfull initialization. 19 years ago
dict_thesaurus.c Fix memory reallocation condition 19 years ago
ginidx.c pgindent run for 8.2. 19 years ago
gistidx.c pgindent run for 8.2. 19 years ago
gistidx.h Add CVS tag lines to files that were lacking them. 20 years ago
prs_dcfg.c pgindent run for 8.2. 19 years ago
query.c Fix erroneous parsing of tsquery input "... & !(subexpression) | ..." 15 years ago
query.h pgindent run for 8.2. 19 years ago
query_cleanup.c This patch makes the error message strings throughout the backend 20 years ago
query_cleanup.h pgindent run for 8.2. 19 years ago
query_gist.c Fix broken compare function for tsquery_ops. Per Tom's report. 17 years ago
query_rewrite.c Fix incorrect cleanup of tsquery in ts_rewrite(). Per bug #4933 by 16 years ago
query_support.c Re-run pgindent, fixing a problem where comment lines after a blank 20 years ago
query_util.c Re-run pgindent, fixing a problem where comment lines after a blank 20 years ago
query_util.h Allow /contrib include files to compile on their own. 19 years ago
rank.c Fix indefinit loop in rank_cd for some corner queries. 18 years ago
snmap.c For some reason access/tupmacs.h has been #including utils/memutils.h, 20 years ago
snmap.h 1 add namespaces as Tom suggest http://www.pgsql.ru/db/mw/msg.html?mid=1987703 21 years ago
stopword.c Fix bug http://archives.postgresql.org/pgsql-bugs/2006-10/msg00258.php. 19 years ago
thesaurus Allow do not lexize words in substitution. 19 years ago
ts_cfg.c Fix URL generation in headline. Only tag lexeme will be replaced by space. 17 years ago
ts_cfg.h Fix URL generation in headline. Only tag lexeme will be replaced by space. 17 years ago
ts_lexize.c pgindent run for 8.2. 19 years ago
ts_locale.c Fix usage of char2wchar/wchar2char. Changes: 17 years ago
ts_locale.h Fix localization support for multibyte encoding and C locale. 19 years ago
ts_stat.c pgindent run for 8.2. 19 years ago
ts_stat.h Allow /contrib include files to compile on their own. 19 years ago
tsearch.sql.in Rename contrib contains/contained-by operators to @> and <@, per discussion. 19 years ago
tsvector.c Backpatch: Fix tsvector_out() and tsquery_out() to escape backslesh, add test of that. 18 years ago
tsvector.h pgindent run for 8.2. 19 years ago
tsvector_op.c Alphabetically order reference to include files, "S"-"Z". 19 years ago
untsearch.sql.in Rename contrib contains/contained-by operators to @> and <@, per discussion. 19 years ago
wparser.c Fix caching of unsuccessful initialization of parser or configuration. 19 years ago
wparser.h pgindent run. 22 years ago
wparser_def.c Fix URL generation in headline. Only tag lexeme will be replaced by space. 17 years ago

README.tsearch2

Tsearch2 - full text search extension for PostgreSQL

[1]Online version of this document is available

Tsearch2 - is the full text engine, fully integrated into PostgreSQL
RDBMS.

Main features

* Full online update
* Supports multiple table driven configurations
* flexible and rich linguistic support (dictionaries, stop words),
thesaurus
* full multibyte (UTF-8) support
* Sophisticated ranking functions with support of proximity and
structure information (rank, rank_cd)
* Index support (GiST and Gin) with concurrency and recovery support
* Rich query language with query rewriting support
* Headline support (text fragments with highlighted search terms)
* Ability to plug-in custom dictionaries and parsers
* Template generator for tsearch2 dictionaries with [2]snowball
stemmer support
* It is mature (5 years of development)

Tsearch2, in a nutshell, provides FTS operator (contains) for the new
data types, representing document (tsvector) and query (tsquery).
Table driven configuration allows creation of custom searches using
standard SQL commands.

tsvector is a searchable data type, representing document. It is a set
of unique words along with their positional information in the
document, organized in a special structure optimized for fast access
and lookup. Each entry could be labelled to reflect its importance in
document.

tsquery is a data type for textual queries with support of boolean
operators. It consists of lexemes (optionally labelled) with boolean
operators between.

Table driven configuration allows to specify:
* parser, which used to break document onto lexemes
* what lexemes to index and the way they are processed
* dictionaries to be used along with stop words recognition.

OpenFTS vs Tsearch2

[3]OpenFTS is a middleware between application and database. OpenFTS
uses tsearch2 as a storage and database engine as a query executor
(searching). Everything else, i.e. parsing of documents, query
processing, linguistics, carry outs on client side. That's why OpenFTS
has its own configuration table (fts_conf) and works with its own set
of dictionaries. OpenFTS is more flexible, because it could be used in
multi-server architecture with separate machines for repository of
documents (documents could be stored in filesystem), database and
query engine.

See [4]Documentation Roadmap for links to documentation.

Authors

* Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia
* Teodor Sigaev <teodor@sigaev.ru>, Moscow,Moscow University,Russia

Contributors

* Robert John Shepherd and Andrew J. Kopciuch submitted
"Introduction to tsearch" (Robert - tsearch v1, Andrew - tsearch
v2)
* Brandon Craig Rhodes wrote "Tsearch2 Guide" and "Tsearch2
Reference" and proposed new naming convention for tsearch V2

Sponsors

* ABC Startsiden - compound words support
* University of Mannheim for UTF-8 support (in 8.2)
* jfg:networks ([5]http:www.jfg-networks.com/) for Gin - Generalized
Inverted index (in 8.2)
* Georgia Public Library Service and LibLime, Inc. for Thesaurus
dictionary
* PostGIS community - GiST Concurrency and Recovery

The authors are grateful to the Russian Foundation for Basic Research
and Delta-Soft Ltd., Moscow, Russia for support.

Limitations

* Length of lexeme < 2K
* Length of tsvector (lexemes + positions) < 1Mb
* The number of lexemes < 4^32
* 0< Positional information < 16383
* No more than 256 positions per lexeme
* The number of nodes ( lexemes + operations) in tsquery < 32768

References

* GiST development site -
[6]http://www.sai.msu.su/~megera/postgres/gist
* GiN development - [7]http://www.sigaev.ru/gin/
* OpenFTS home page - [8]http://openfts.sourceforge.net/
* Mailing list -
[9]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gene
ral

Documentation Roadmap

* Several docs are available from docs/ subdirectory
+ "Tsearch V2 Introduction" by Andrew Kopciuch
+ "Tsearch2 Guide" by Brandon Rhodes
+ "Tsearch2 Reference" by Brandon Rhodes
* Readme.gendict in gendict/ subdirectory
+ Also, check [10]Gendict tutorial
* Check [11]tsearch2 Wiki pages for various documentation

Support

Authors urgently recommend people to use [12]openfts-general or
[13]pgsql-general mailing lists for questions and discussions.

Development History

Latest news

To the PostgreSQL 8.2 release we added:
* multibyte (UTF-8) support
* Thesaurus dictionary
* Query rewriting
* rank_cd relevation function now support different weights of
lexemes
* GiN support adds scalability of tsearch2

Pre-tsearch era
Development of OpenFTS began in 2000 after realizing that we
need a search engine optimized for online updates with access
to metadata from the database. This is essential for online
news agencies, web portals, digital libraries, etc. Most search
engines available utilize an inverted index which is very fast
for searching but very slow for online updates. Incremental
updates of an inverted index is a complex engineering task
while we needed something light, free and with the ability to
access metadata from the database. The last requirement was
very important because in a real life application search engine
should always consult metadata ( topic, permissions, date
range, version, etc.). We extensively use PostgreSQL as a
database backend and have no intention to move from it, so the
problem was to find a data structure and a fast way to access
it. PostgreSQL has rather unique data type for storing sets
(think about words) - arrays, but lacks index access to them.
During our research we found a paper of Joseph Hellerstein, who
introduced an interesting data structure suitable for sets -
RD-tree (Russian Doll tree). Further research lead us to the
idea to use GiST for implementing RD-tree, but at that time the
GiST code was untouched for a long time and contained several
bugs. After work on improving GiST for version 7.0.3 of
PostgreSQL was done, we were able to implement RD-Tree and use
it for index access to arrays of integers. This implementation
was ideally suited for small arrays and eliminated complex
joins, but was practically useless for indexing large arrays.
The next improvement came from an idea to represent a document
by a single bit-signature, a so-called superimposed signature
(see "Index Structures for Databases Containing Data Items with
Set-valued Attributes", 1997, Sven Helmer for details). We
developed the contrib/intarray module and used it for full
text indexing.

tsearch v1
It was inconvenient to use integer id's instead of words, so we
introduced a new data type called 'txtidx' - a searchable data
type (textual) with indexed access. This was a first step of
our work on an implementation of a built-in PostgreSQL full
text search engine. Even though tsearch v1 had many features of
a search engine it lacked configuration support and relevance
ranking. People were encouraged to use OpenFTS, which provided
relevance ranking based on positional information and flexible
configuration. OpenFTS v.0.34 is the last version based on
tsearch v1.

tsearch V2
People recognized tsearch as a powerful tool for full text
searching and insisted on adding ranking support, better
configurability, etc. We already thought about moving most of
the features of OpenFTS to tsearch, and in the early 2003 we
decided to work on a new version of tsearch. We abandoned
auxiliary index tables which were used by OpenFTS to store
positional information and modified the txtidx type to store
them internally. We added table-driven configuration, support
of ispell dictionaries, snowball stemmers and the ability to
specify which types of lexemes to index. Now, it's possible to
generate headlines of documents with highlighted search terms.
These changes make tsearch more user friendly and turn it into
a really powerful full text search engine. Brandon Rhodes
proposed to rename tsearch functions for consistency and we
renamed txtidx type to tsvector and other things as well. To
allow users of tsearch v1 smooth upgrade, we named the module
as tsearch2. Since version 0.35 OpenFTS uses tsearch2.

References

1. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html
2. http://snowball.tartarus.org/
3. http://openfts.sourceforge.net/
4. file://localhost/u/megera/WWW/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme82.html#dm
5. http:www.jfg-networks.com/
6. http://www.sai.msu.su/~megera/postgres/gist
7. http://www.sigaev.ru/gin/
8. http://openfts.sourceforge.net/
9. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
10. http://www.sai.msu.su/~megera/wiki/Gendict
11. http://www.sai.msu.su/~megera/wiki/Tsearch2
12. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
13. http://archives.postgresql.org/pgsql-general/