#!/usr/sbin/indexer -d ########################################################################### # This is a sample indexer config file. # To start using it please edit and rename to indexer.conf. # You can also make this file executable and run it directly. # You may want to keep the original indexer.conf-dist for future references. # Use '#' to comment out lines. # All command names are case insensitive (DBAddr=DBADDR=dbaddr). # You may use '\' character to prolong current command to next line # when it is required. # # You may include another configuration file in any place of the indexer.conf # using "Include " command. # Absolute path if starts with "/": #Include /etc/mnogosearch/inc1.conf # Relative path else: #Include inc1.conf ########################################################################### ########################################################################### # Section 1. # Global parameters. ########################################################################### # DBAddr # Options (type, host, database name, port, user and password) # to connect to SQL database. # Should be used before any other commands. # Has global effect for whole config file. # Format: #DBAddr :[//[DBUser[:DBPass]@]DBHost[:DBPort]]/DBName/[?dbmode=mode] # # ODBC notes: # Use DBName to specify ODBC data source name (DSN) # DBHost does not matter, use "localhost". # # Currently supported DBType values are # mysql, pgsql, mssql, oracle, ibase, db2, mimer, sqlite. # # MySQL users can specify path to Unix socket when connecting to localhost: # mysql://foo:bar@localhost/mnogosearch/?socket=/tmp/mysql.sock # # If you are using PostgreSQL and do not specify hostname, # e.g. pgsql://user:password@/dbname/ # then PostgreSQL will not work via TCP, but will use Unix socket. # # You may also select database mode of word storage. # When "single" is specified, all words are stored in the same table. # If "multi" is selected, words will be located in different tables. # "multi" mode is usually faster but requires more tables. # Default mode is "single". DBAddr mysql://db_user:db_password@db_host/db_name/?dbmode=single ###################################################################### # VarDir /var/lib/mnogosearch # You may choose alternative working directory for # search results cache: # #VarDir /var/lib/mnogosearch ###################################################################### # NewsExtensions yes/no # Whether to enable news extensions. # Default value is no. #NewsExtensions no ####################################################################### #SyslogFacility # This is used if indexer was compiled with syslog support and if you # don't like the default value. Argument is the same as used in syslog.conf # file. For list of possible facilities see syslog.conf(5) #SyslogFacility local7 ####################################################################### # LocalCharset # Defines the charset which will be used to store data in the database. # All other character sets will be converted into the given charset. # Take a look into mnoGoSearch documentation for detailed explanation # how to choose a LocalCharset depending on languages used on your site(s). # This command should be used once and takes global effect for the config file. # Only most popular charsets used in Internet are written here. # Take a look into the documentation to check the whole list of # supported charsets. # Default LocalCharset is iso-8859-1 (latin1). # # Western Europe: German, Finnish, French, Swedish LocalCharset iso-8859-1 #LocalCharset windows-1252 # Central Europe: Czech, Slovenian, Slovak, Hungarian, Polish #LocalCharset iso-8859-2 #LocalCharset windows-1250 # Baltic: Lithuanian, Estonian, Latvian #LocalCharset iso-8859-4 #LocalCharset iso-8859-13 #LocalCharset windows-1257 # Cyrillic: Russian, Serbian, Ukrainian, Belarussian, Macedonian, Bulgarian #LocalCharset koi8-r #LocalCharset iso-8859-5 #LocalCharset x-mac-cyrillic #LocalCharset windows-1251 # Arabic #LocalCharset iso-8859-6 #LocalCharset windows-1256 # Greek #LocalCharset iso-8859-7 #LocalCharset windows-1253 # Hebrew #LocalCharset iso-8859-8 #LocalCharset windows-1255 # Turkish #LocalCharset iso-8859-9 #LocalCharset windows-1254 # Vietnamese #LocalCharset viscii #LocalCharset windows-1258 # Chinese #LocalCharset gb2312 #LocalCharset BIG5 # Korean #LocalCharset EUC-KR # Japanese #LocalCharset Shift-JIS # Full UNICODE #LocalCharset UTF-8 #LocalCharset iso-8859-1 #LocalCharset windows-1252 ####################################################################### #ForceIISCharset1251 yes/no #This option is useful for users which deals with Cyrillic content and broken #(or misconfigured?) Microsoft IIS web servers, which tends to not report #charset correctly. This is really dirty hack, but if this option is turned on #it is assumed that all servers which reports as 'Microsoft' or 'IIS' have #content in Windows-1251 charset. #This command should be used only once in configuration file and takes global #effect. #Default: no #ForceIISCharset1251 no ########################################################################### #CrossWords yes/no # Whether to build CrossWords index # Default value is no #CrossWords no CrossWords yes ########################################################################### # StopwordFile # Load stop words from the given text file. You may specify either absolute # file name or a name relative to mnoGoSearch /etc directory. You may use # several StopwordFile commands. # #StopwordFile stopwords/en.sl Include stopwords.conf ########################################################################### # LangMapFile # Load language map for charset and language guesser from the given file. # You may specify either an absolute file name or a name relative # to mnoGoSearch /etc directory. You may use several LangMapFile commands. # #LangMapFile langmap/en.ascii.lm Include langmap.conf ####################################################################### # Word lengths. You may change default length range of words # stored in the database. By default, words with the length in the # range from 1 to 32 are stored. # #MinWordLength 1 #MaxWordLength 32 ####################################################################### # MaxDocSize bytes # Default value 1048576 (1 Mb) # Takes global effect for whole config file MaxDocSize 10485760 ####################################################################### # URLSelectCacheSize num # Default value 128 # Select targets to index at once. #URLSelectCacheSize 1024 ####################################################################### # WordCacheSize bytes # Default value 8388608 (8 Mb) # Defines maximal in-memory words cache size. # Note: cache is allocated for every DBAddr, so if you have 3 DBAddr # commands and WordCacheSize is 10Mb, it can take up to 30Mb of memory. #WordCacheSize 8388608 ####################################################################### # HTTPHeader
# You may add your desired headers in indexer HTTP request. # You should not use "If-Modified-Since","Accept-Charset" headers, # these headers are composed by indexer itself. # "User-Agent: mnoGoSearch/version" is sent too, but you may override it. # Command has global effect for all configuration file. # #HTTPHeader "User-Agent: My_Own_Agent" #HTTPHeader "Accept-Language: ru, en" HTTPHeader "Accept-Language: fr, nl, en, de, es" #HTTPHeader "From: webmaster@mysite.com" # flush server.active to inactive for all server table records # before loading new #FlushServerTable ####################################################################### # ServerTable # Load servers with all their parameters from the table specified in argument. # Check an example of tables server and srvinfo structure in # create/(your_database)/create.txt # #ServerTable mysql://user:pass@host/dbname/tablename ########################################################################## # LoadChineseList # Load Chinese word frequency list. # By default GB2312 charset and mandarin.freq dictionary is used. #LoadChineseList ########################################################################## # LoadThaiList # Load Thai word frequency list # By default tis-620 and thai.freq dictionary is used. #LoadThaiList ########################################################################## # Section 2. # URL control configuration. ########################################################################## #Allow [Match|NoMatch] [NoCase|Case] [String|Regex] [ ... ] # Use this to allow URLs that match (doesn't match) the given argument. # First three optional parameters describe the type of comparison. # Default values are Match, NoCase, String. # Use "NoCase" or "Case" values to choose case insensitive or case sensitive # comparison. # Use "Regex" to choose regular expression comparison. # Use "String" to choose string with wildcards comparison. # Wildcards are '*' for any number of characters and '?' for one character. # Note that '?' and '*' have special meaning in "String" match type. Please use # "Regex" to describe documents with '?' and '*' signs in URL. # "String" match is much faster than "Regex". Use "String" where it # is possible. # You may use several arguments for one 'Allow' command. # You may use this command any times. # Takes global effect for config file. # Note that mnoGoSearch automatically adds one "Allow regex .*" # command after reading config file. It means that allowed everything # that is not disallowed. # Examples # Allow everything: #Allow * # Allow everything but .php .cgi .pl extensions case insensitively using regex: #Allow NoMatch Regex \.php$|\.cgi$|\.pl$ # Allow .HTM extension case sensitively: #Allow Case *.HTM ########################################################################## #Disallow [Match|NoMatch] [NoCase|Case] [String|Regex] [ ... ] # Use this to disallow URLs that match (doesn't match) given argument. # The meaning of first three optional parameters is exactly the same # with "Allow" command. # You can use several arguments for one 'Disallow' command. # Takes global effect for config file. # # Examples: # Disallow URLs that are not in udm.net domains using "string" match: #Disallow NoMatch *.udm.net/* # Disallow any except known extensions and directory index using "regex" match: #Disallow NoMatch Regex \/$|\.htm$|\.html$|\.shtml$|\.phtml$|\.php$|\.txt$ # Exclude cgi-bin and non-parsed-headers using "string" match: #Disallow */cgi-bin/* *.cgi */nph-* # Exclude anything with '?' sign in URL. Note that '?' sign has a # special meaning in "string" match, so we have to use "regex" match here: #Disallow Regex \? Disallow Match *whoisonline* Disallow Match *myagenda* Disallow Match *&rand=* Disallow Match */chat/* Disallow Match */auth/* Disallow Match */online/* Disallow Match */user/* Disallow Match */admin/* Disallow Match */group/* Disallow Match *delete* Disallow Match *del* Disallow Match *remove* Disallow Match *example_document.html* # Exclude some known extensions using fast "String" match: Disallow *.b *.sh *.md5 *.rpm Disallow *.arj *.tar *.zip *.tgz *.gz *.z *.bz2 Disallow *.lha *.lzh *.rar *.zoo *.ha *.tar.Z Disallow *.gif *.jpg *.jpeg *.bmp *.tiff *.tif *.xpm *.xbm *.pcx Disallow *.vdo *.mpeg *.mpe *.mpg *.avi *.movie *.mov *.wmv Disallow *.mid *.mp3 *.rm *.ram *.wav *.aiff *.ra Disallow *.vrml *.wrl *.png *.ico *.psd *.dat Disallow *.exe *.com *.cab *.dll *.bin *.class *.ex_ #Disallow *.xls *.doc Disallow *.tex *.texi *.texinfo # Disallow *.rtf *.pdf *.ps *.eps Disallow *.cdf Disallow *.ai *.ppt *.hqx Disallow *.cpt *.bms *.oda *.tcl Disallow *.o *.a *.la *.so Disallow *.pat *.pm *.m4 *.am *.css Disallow *.map *.aif *.sit *.sea Disallow *.m3u *.qt # Exclude Apache directory list in different sort order using "string" match: Disallow *D=A *D=D *M=A *M=D *N=A *N=D *S=A *S=D # More complicated case. RAR .r00-.r99, ARJ a00-a99 files # and UNIX shared libraries. We use "Regex" match type here: Disallow Regex \.r[0-9][0-9]$ \.a[0-9][0-9]$ \.so\.[0-9]$ ########################################################################## #CheckOnly [Match|NoMatch] [NoCase|Case] [String|Regex] [ ... ] # The meaning of first three optional parameters is exactly the same # with "Allow" command. # Indexer will use HEAD instead of GET HTTP method for URLs that # match/do not match given regular expressions. It means that the file # will be checked only for being existing and will not be downloaded. # Useful for zip,exe,arj and other binary files. # Note that you can disallow those files with commands given below. # You may use several arguments for one "CheckOnly" commands. # Useful for example for searching through the URL names rather than # the contents (a la FTP-search). # Takes global effect for config file. # # Check some known non-text extensions using "string" match: #CheckOnly *.b *.sh *.md5 #CheckOnly *.arj *.tar *.zip *.tgz *.gz #CheckOnly *.lha *.lzh *.rar *.zoo *.tar*.Z #CheckOnly *.gif *.jpg *.jpeg *.bmp *.tiff #CheckOnly *.vdo *.mpeg *.mpe *.mpg *.avi *.movie #CheckOnly *.mid *.mp3 *.rm *.ram *.wav *.aiff #CheckOnly *.vrml *.wrl *.png #CheckOnly *.exe *.cab *.dll *.bin *.class #CheckOnly *.tex *.texi *.xls *.doc *.texinfo #CheckOnly *.rtf *.pdf *.cdf *.ps #CheckOnly *.ai *.eps *.ppt *.hqx #CheckOnly *.cpt *.bms *.oda *.tcl #CheckOnly *.rpm *.m3u *.qt *.mov #CheckOnly *.map *.aif *.sit *.sea # # or check ANY except known text extensions using "regex" match: #CheckOnly NoMatch Regex \/$|\.html$|\.shtml$|\.phtml$|\.php$|\.txt$ #CheckOnly NoMatch Regex &rand=[0-9][0-9][0-9][0-9]$|myagenda\.php.*$|whoisonline\.php.*$ ########################################################################## #HrefOnly [Match|NoMatch] [NoCase|Case] [String|Regex] [ ... ] # The meaning of first three optional parameters is exactly the same # with "Allow" command. # # Use this to scan a HTML page for "href" tags but not to index the contents # of the page with an URLs that match (doesn't match) given argument. # Commands have global effect for all configuration file. # # When indexing large mail list archives for example, the index and thread # index pages (like mail.10.html, thread.21.html, etc.) should be scanned # for links but shouldn't be indexed: # #HrefOnly */mail*.html */thread*.html HrefOnly Match *dk_sid=* HrefOnly Match *indexer_login.php* HrefOnly Match */your.domain.com/index.php* HrefOnly Match */document.php* HrefOnly Match */courses/*/index.php HrefOnly Match */courses/*/ HrefOnly Match */document/headerpage.php* HrefOnly Match */document/slideshow.php* ########################################################################## #CheckMp3 [Match|NoMatch] [NoCase|Case] [String|Regex] [ ...] # The meaning of first three optional parameters is exactly the same # with "Allow" command. # If an URL matches given rules, indexer will download only a little part # of the document and try to find MP3 tags in it. On success, indexer # will parse MP3 tags, else it will download whole document then parse # it as usual. # Notes: # This works only with those servers which support HTTP/1.1 protocol. # It is used "Range: bytes" header to download mp3 tag. #CheckMp3 *.bin *.mp3 ####################################################################### #CheckMP3Only [Match|NoMatch] [NoCase|Case] [String|Regex] [ ...] # The meaning of first three optional parameters is exactly the same # with "Allow" command. # If an URL matches given rules, indexer, like in the case CheckMP3 command, # will download only a little part of the document and try to find MP3 tags. # On success, indexer will parse MP3 tags, else it will NOT download whole # document. #CheckMP3Only *.bin *.mp3 # How to combine Allow, Disallow, CheckOnly, HrefOnly commands. # # indexer compares URLs against all these command arguments in the # order of their appearance in indexer.conf file. # If indexer finds that URL matches some rule, it will make a decision of what # to do with this URL, allow it, disallow it or use HEAD instead # of the GET method. So, you may use different Allow, Disallow, # CheckOnly, HrefOnly commands order. # If no one of these commands are given, mnoGoSearch will allow everything # by default. # # There are many possible combinations. Samples of two of them are here: # # Sample of first useful combination. # Disallow known non-text extensions (zip,wav etc), # then allow everything else. This sample is uncommented above (note that # there is actually no "Allow *" command, it is added automatically after # indexer.conf loading). # # Sample of second combination. # Allow some known text extensions (html, txt) and directory index ( / ), # then disallow everything else: # #Allow .html .txt */ #Disallow * # HoldBadHrefs