mirror of https://github.com/Cisco-Talos/clamav
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
623 lines
22 KiB
623 lines
22 KiB
%% LyX 1.5.3 created this file. For more info, see http://www.lyx.org/.
|
|
%% Do not edit unless you really know what you are doing.
|
|
\documentclass[a4paper,english,10pt]{article}
|
|
\usepackage{amssymb}
|
|
\usepackage{pslatex}
|
|
\usepackage[T1]{fontenc}
|
|
\usepackage[dvips]{graphicx}
|
|
\usepackage{url}
|
|
\usepackage{fancyhdr}
|
|
\usepackage{varioref}
|
|
\usepackage{prettyref}
|
|
\date{}
|
|
|
|
\begin{document}
|
|
|
|
\title{{\huge Phishing signatures creation HOWTO}}
|
|
\author{T\"or\"ok Edwin}
|
|
\maketitle
|
|
|
|
%TODO: define a LaTeX command, instead of using \textsc{RealURL} each time
|
|
|
|
\section{Database file format}
|
|
|
|
\subsection{PDB format}
|
|
This file contains urls/hosts that are target of phishing attempts.
|
|
It contains lines in the following format:
|
|
\begin{verbatim}
|
|
R[Filter]:RealURL:DisplayedURL[:FuncLevelSpec]
|
|
H[Filter]:DisplayedHostname[:FuncLevelSpec]
|
|
\end{verbatim}
|
|
|
|
\begin{description}
|
|
\item [{R}] regular expression, for the concatenated URL
|
|
\item [{H}] matches the \verb+DisplayedHostname+ as a simple pattern (literally, no regular expression)
|
|
\begin{itemize}
|
|
\item the pattern can match either the full hostname
|
|
\item or a subdomain of the specified hostname
|
|
\item to avoid false matches in case of subdomain matches, the engine checks that there is a dot(\verb+.+) or a space(\verb+ +) before the matched portion
|
|
\end{itemize}
|
|
\item [{Filter}] is ignored for R and H for compatibility reasons
|
|
\item [{\textsc{RealURL}}] is the URL the user is sent to, example: \emph{href} attribute of an html anchor (\emph{<a> tag})
|
|
\item [{\textsc{DisplayedURL}}] is the URL description displayed to the user, where its \emph{claimed} they are sent, example: contents of an html anchor (\emph{<a> tag})
|
|
\item [{DisplayedHostname}] is the hostname portion of the \textsc{DisplayedURL}
|
|
\item [{FuncLevelSpec}] an (optional) functionality level, 2 formats are possible:
|
|
\begin{itemize}
|
|
\item \verb+minlevel+ all engines having functionality level >= \verb+minlevel+ will load this line
|
|
\item \verb+minlevel-maxlevel+ engines with functionality level $>= $ \verb+minlevel+, and $< $ \verb+maxlevel+ will load this line
|
|
\end{itemize}
|
|
\end{description}
|
|
|
|
\subsection{GDB format}
|
|
This file contains URL hashes in the following format:
|
|
\begin{verbatim}
|
|
S:P:HostPrefix[:FuncLevelSpec]
|
|
S:F:Sha256hash[:FuncLevelSpec]
|
|
S1:P:HostPrefix[:FuncLevelSpec]
|
|
S1:F:Sha256hash[:FuncLevelSpec]
|
|
S2:P:HostPrefix[:FuncLevelSpec]
|
|
S2:F:Sha256hash[:FuncLevelSpec]
|
|
S:W:Sha256hash[:FuncLevelSpec]
|
|
\end{verbatim}
|
|
|
|
\begin{description}
|
|
\item [{S:}]
|
|
These are hashes for Google Safe Browsing - malware sites, and should not be used for other purposes.
|
|
\item [{S2:}]
|
|
These are hashes for Google Safe Browsing - phishing sites, and should not be used for other purposes.
|
|
\item [{S1:}]
|
|
Hashes for blacklisting phishing sites.
|
|
Virus name: Phishing.URL.Blacklisted
|
|
\item [{S:W}]
|
|
Locally whitelisted hashes.
|
|
\item [{HostPrefix}]
|
|
4-byte prefix of the sha256 hash of the last 2 or 3 components of the hostname.
|
|
If prefix doesn't match, no further lookups are performed.
|
|
\item [{Sha256hash}]
|
|
sha256 hash of the canonicalized URL, or a sha256 hash of its prefix/suffix according to the Google Safe Browsing ``Performing Lookups'' rules. There should be a corresponding \verb+:P:HostkeyPrefix+ entry for the hash to be taken into consideration.
|
|
\end{description}
|
|
|
|
To see which hash/URL matched, look at the \verb+clamscan --debug+ output, and look for the following strings:
|
|
\verb+Looking up hash+, \verb+prefix matched+, and \verb+Hash matched+.
|
|
Local whitelisting of .gdb entries can be done by creating a local.gdb file, and
|
|
adding a line \verb+S:W:<HASH>+.
|
|
|
|
\subsection{WDB format}
|
|
This file contains whitelisted url pairs
|
|
It contains lines in the following format:
|
|
\begin{verbatim}
|
|
X:RealURL:DisplayedURL[:FuncLevelSpec]
|
|
M:RealHostname:DisplayedHostname[:FuncLevelSpec]
|
|
\end{verbatim}
|
|
|
|
\begin{description}
|
|
\item [{X}] regular expression, for the \emph{entire URL}, not just the hostname
|
|
\begin{itemize}
|
|
\item The regular expression is by default anchored to start-of-line and end-of-line, as if you have used \verb+^RegularExpression$+
|
|
\item A trailing \verb+/+ is automatically added both to the regex, and the input string to avoid false matches
|
|
\item The regular expression matches the \emph{concatenation} of the \textsc{RealURL}, a colon(\verb+:+), and the \textsc{DisplayedURL} as a single string. It doesn't separately match \textsc{RealURL} and \textsc{DisplayedURL}!
|
|
\end{itemize}
|
|
\item [{M}] matches hostname, or subdomain of it, see notes for {H} above
|
|
\end{description}
|
|
|
|
\subsection{Hints}
|
|
|
|
\begin{itemize}
|
|
\item empty lines are ignored
|
|
\item the colons are mandatory
|
|
\item Don't leave extra spaces on the end of a line!
|
|
\item if any of the lines don't conform to this format, clamav will abort with a Malformed Database Error
|
|
\item see section \vref{sub:Extraction-of-realURL,} for more details on \textsc{realURL/displayedURL}
|
|
\end{itemize}
|
|
|
|
|
|
\subsection{Examples of PDB signatures}
|
|
To check for phishing mails that target amazon.com, or subdomains of amazon.com:
|
|
\begin{verbatim}
|
|
H:amazon.com
|
|
\end{verbatim}
|
|
|
|
To do the same, but for amazon.co.uk:
|
|
\begin{verbatim}
|
|
H:amazon.co.uk
|
|
\end{verbatim}
|
|
|
|
To limit the signatures to certain engine versions:
|
|
\begin{verbatim}
|
|
H:amazon.co.uk:20-30
|
|
H:amazon.co.uk:20-
|
|
H:amazon.co.uk:0-20
|
|
\end{verbatim}
|
|
First line: engine versions 20, 21, ..., 29 can load it
|
|
|
|
Second line: engine versions >= 20 can load it
|
|
|
|
Third line: engine versions < 20 can load it
|
|
|
|
In a real situation, you'd probably use the second form. A situation like that would be if you are using a feature of the signatures
|
|
not available in earlier versions, or if earlier versions have bugs with your signature. Its neither case here, the above examples
|
|
are for illustrative purposes only.
|
|
|
|
\subsection{Examples of WDB signatures}
|
|
To allow amazon's country specific domains and amazon.com, to mix domain names in \textsc{DisplayedURL}, and \textsc{RealURL}:
|
|
\begin{verbatim}
|
|
X:.+\.amazon\.(at|ca|co\.uk|co\.jp|de|fr)([/?].*)?:.+\.amazon\.com([/?].*)?:17-
|
|
\end{verbatim}
|
|
Explanation of this signature:
|
|
\begin{description}
|
|
\item [{X:}] this is a regular expression
|
|
\item [{:17-}] load signature only for engines with functionality level >= 17 (recommended for type X)
|
|
\end{description}
|
|
|
|
The regular expression is the following (X:, :17- stripped, and a / appended)
|
|
\begin{verbatim}
|
|
.+\.amazon\.(at|ca|co\.uk|co\.jp|de|fr)([/?].*)?:.+\.amazon\.com([/?].*)?/
|
|
\end{verbatim}
|
|
|
|
Explanation of this regular expression (note that it is a single regular expression, and not 2 regular
|
|
expressions splitted at the {:}).
|
|
\begin{itemize}
|
|
\item \verb;.+; any subdomain of
|
|
\item \verb;\.amazon\.; domain we are whitelisting (\textsc{RealURL} part)
|
|
\item \verb;(at|ca|co\.uk|co\.jp|de|fr); country-domains: at, ca, co.uk, co.jp, de, fr
|
|
\item \verb;([/?].*)?; recomended way to end real url part of whitelist, this protects against embedded URLs (evilurl.example.com/amazon.co.uk/)
|
|
\item \verb;:; \textsc{RealURL} and \textsc{DisplayedURL} are concatenated via a {:}, so match a literal {:} here
|
|
\item \verb;.+; any subdomain of
|
|
\item \verb;\.amazon\.com; whitelisted DisplayedURL
|
|
\item \verb;([/?].*)?; recommended way to end displayed url part, to protect against embedded URLs
|
|
\item \verb;/; automatically added to further protect against embedded URLs
|
|
\end{itemize}
|
|
|
|
When you whitelist an entry make sure you check that both domains are owned by the same entity.
|
|
What this whitelist entry allows is:
|
|
Links claiming to point to amazon.com (\textsc{DisplayedURL}), but really go to country-specific domain of amazon (\textsc{RealURL}).
|
|
|
|
\subsection{Example for how the URL extractor works}
|
|
Consider the following HTML file:
|
|
\begin{verbatim}
|
|
<html>
|
|
<a href="http://1.realurl.example.com/">
|
|
1.displayedurl.example.com
|
|
</a>
|
|
<a href="http://2.realurl.example.com">
|
|
2 d<b>i<p>splayedurl.e</b>xa<i>mple.com
|
|
</a>
|
|
<a href="http://3.realurl.example.com">
|
|
3.nested.example.com
|
|
<a href="http://4.realurl.example.com">
|
|
4.displayedurl.example.com
|
|
</a>
|
|
</a>
|
|
<form action="http://5.realurl.example.com">
|
|
sometext
|
|
<img src="http://5.displayedurl.example.com/img0.gif"/>
|
|
<a href="http://5.form.nested.displayedurl.example.com">
|
|
5.form.nested.link-displayedurl.example.com
|
|
</a>
|
|
</form>
|
|
<a href="http://6.realurl.example.com">
|
|
6.displ
|
|
<img src="6.displayedurl.example.com/img1.gif"/>
|
|
ayedurl.example.com
|
|
</a>
|
|
<a href="http://7.realurl.example.com">
|
|
<iframe src="http://7.displayedurl.example.com">
|
|
</a>
|
|
\end{verbatim}
|
|
|
|
The phishing engine extract the following \textsc{RealURL/DisplayedURL} pairs from it:
|
|
\begin{verbatim}
|
|
http://1.realurl.example.com/
|
|
1.displayedurl.example.com
|
|
|
|
http://2.realurl.example.com
|
|
2displayedurl.example.com
|
|
|
|
http://3.realurl.example.com
|
|
3.nested.example.com
|
|
|
|
http://4.realurl.example.com
|
|
4.displayedurl.example.com
|
|
|
|
http://5.realurl.example.com
|
|
http://5.displayedurl.example.com/img0.gif
|
|
|
|
http://5.realurl.example.com
|
|
http://5.form.nested.displayedurl.example.com
|
|
|
|
http://5.form.nested.displayedurl.example.com
|
|
5.form.nested.link-displayedurl.example.com
|
|
|
|
http://6.realurl.example.com
|
|
6.displayedurl.example.com
|
|
|
|
http://6.realurl.example.com
|
|
6.displayedurl.example.com/img1.gif
|
|
\end{verbatim}
|
|
|
|
|
|
\subsection{How matching works}
|
|
|
|
\subsubsection{RealURL, displayedURL concatenation\label{sub:RealURL,-displayedURL-concatenation}}
|
|
|
|
The phishing detection module processes pairs of \textsc{RealURL/DisplayedURL}.
|
|
Matching against daily.wdb is done as follows: the \textsc{realURL} is concatenated with a \verb+:+, and with the \textsc{DisplayedURL}, then that \emph{line} is matched against the lines in daily.wdb/daily.pdb
|
|
|
|
So if you have this line in daily.wdb:
|
|
\begin{verbatim}
|
|
M:www.google.ro:www.google.com
|
|
\end{verbatim}
|
|
|
|
and this href: \verb+<a href='http://www.google.ro'>www.google.com</a>+
|
|
then it will be whitelisted, but: \verb+<a href='http://images.google.com'>www.google.com</a>+
|
|
will not.
|
|
|
|
\subsubsection{What happens when a match is found}
|
|
|
|
In the case of the whitelist, a match means that the \textsc{RealURL/DisplayedURL}
|
|
combination is considered \textsc{clean}, and no further checks are
|
|
performed on it.
|
|
|
|
In the case of the domainlist, a match means that the \textsc{RealURL/displayedURL}
|
|
is going to be checked for phishing attempts.
|
|
|
|
Furthermore you can restrict what checks are to be performed by specifying the 3-digit hexnumber.
|
|
|
|
\subsubsection{Extraction of \textsc{realURL}, \textsc{displayedURL} from HTML tags\label{sub:Extraction-of-realURL,}}
|
|
|
|
The html parser extracts pairs of \textsc{realURL}/\textsc{displayedURL}
|
|
based on the following rules.
|
|
|
|
In version 0.93: After URLs have been extracted, they are normalized, and cut after the hostname.
|
|
\verb+http://test.example.com/path/somecgi?queryparameters+ becomes \verb+http://test.example.com/+
|
|
|
|
\begin{description}
|
|
\item [{a}] (anchor) the \emph{href} is the \textsc{realURL}, its \emph{contents}
|
|
is the \textsc{displayedURL}
|
|
|
|
\begin{description}
|
|
\item [{contents}] is the tag-stripped contents of the <a> tags, so for
|
|
example <b> tags are stripped (but not their contents)
|
|
\end{description}
|
|
nesting another <a> tag withing an <a> tag (besides being invalid
|
|
html) is treated as a </a><a..
|
|
|
|
\item [{form}] the \emph{action} attribute is the \textsc{realURL}, and a
|
|
nested <a> tag is the \textsc{displayedURL}
|
|
\item [{img/area}] if nested within an \emph{<a>} tag, the \textsc{realURL}
|
|
is the \emph{href} of the a tag, and the \emph{src/dynsrc/area} is
|
|
the \textsc{displayedURL} of the img
|
|
|
|
|
|
if nested withing a \emph{form} tag, then the action attribute of
|
|
the \emph{form} tag is the \textsc{realURL}
|
|
|
|
\item [{iframe}] if nested withing an \emph{<a>} tag the \emph{src} attribute
|
|
is the \textsc{displayedURL}, and the \emph{href} of its parent \emph{a} tag
|
|
is the \textsc{realURL}
|
|
|
|
|
|
if nested withing a \emph{form} tag, then the action attribute of
|
|
the \emph{form} tag is the \textsc{realURL}
|
|
|
|
\end{description}
|
|
|
|
\subsubsection{Example}
|
|
|
|
Consider this html file:
|
|
|
|
\begin{quote}
|
|
\emph{<a href=''evilurl''>www.paypal.com</a>}
|
|
|
|
\emph{<a href=''evilurl2'' title=''www.ebay.com''>click here to
|
|
sign in</a>}
|
|
|
|
\emph{<form action=''evilurl\_form''>}
|
|
|
|
\emph{Please sign in to <a href=''cgi.ebay.com''>Ebay</a> using
|
|
this form}
|
|
|
|
\emph{<input type='text' name='username'>Username</input>}
|
|
|
|
\emph{....}
|
|
|
|
\emph{</form>}
|
|
|
|
\emph{<a href=''evilurl''><img src=''images.paypal.com/secure.jpg''></a>}
|
|
\end{quote}
|
|
The resulting \textsc{realURL/displayedURL} pairs will be (note that
|
|
one tag can generate multiple pairs):
|
|
|
|
\begin{itemize}
|
|
\item evilurl / www.paypal.com
|
|
\item evilurl2 / click here to sign in
|
|
\item evilurl2 / www.ebay.com
|
|
\item evilurl\_form / cgi.ebay.com
|
|
\item cgi.ebay.com / Ebay
|
|
\item evilurl / image.paypal.com/secure.jpg
|
|
\end{itemize}
|
|
|
|
\subsection{Simple patterns\label{sec:Simple-patterns}}
|
|
|
|
Simple patterns are matched literally, i.e. if you say:
|
|
|
|
\begin{quote}
|
|
www.google.com
|
|
\end{quote}
|
|
it is going to match \emph{www.google.com}, and only that. The \emph{.
|
|
(dot)} character has no special meaning (see the section on regexes
|
|
\vref{sec:Regular-expressions} for how the \emph{.(dot)} character
|
|
behaves there)
|
|
|
|
|
|
\subsection{Regular expressions\label{sec:Regular-expressions}}
|
|
|
|
POSIX regular expressions are supported, and you can consider that
|
|
internally it is wrapped by \emph{\textasciicircum{}}, and \emph{\$.}
|
|
In other words, this means that the regular expression has to match
|
|
the entire concatenated (see section \vref{sub:RealURL,-displayedURL-concatenation}
|
|
for details on concatenation) url.
|
|
|
|
It is recomended that you read section \vref{sec:Introduction-to-regular}
|
|
to learn how to write regular expressions, and then come back and
|
|
read this for hints.
|
|
|
|
Be advised that clamav contains an internal, very basic regex matcher
|
|
to reduce the load on the regex matching core. Thus it is recomended
|
|
that you avoid using regex syntax not supported by it at the very
|
|
beginning of regexes (at least the first few characters).
|
|
|
|
Currently the clamav regex matcher supports:
|
|
|
|
\begin{itemize}
|
|
\item . (dot) character
|
|
\item $\backslash$ (escaping special characters)
|
|
\item | (pipe) alternatives
|
|
\item {[}] (character classes)
|
|
\item () (paranthesis for grouping, but no group extraction is performed)
|
|
\item other non-special characters
|
|
\end{itemize}
|
|
Thus the following are not supported:
|
|
|
|
\begin{itemize}
|
|
\item + repetition
|
|
\item {*} repetition
|
|
\item \{\} repetition
|
|
\item backreferences
|
|
\item lookaround
|
|
\item other {}``advanced'' features not listed in the supported list ;)
|
|
\end{itemize}
|
|
This however shouldn't discourage you from using the {}``not directly
|
|
supported features {}``, because if the internal engine encounters
|
|
unsupported syntax, it passes it on to the POSIX regex core (beginning
|
|
from the first unsupported token, everything before that is still
|
|
processed by the internal matcher). An example might make this more
|
|
clear:
|
|
|
|
\emph{www$\backslash$.google$\backslash$.(com|ro|it) ({[}a-zA-Z])+$\backslash$.google$\backslash$.(com|ro|it)}
|
|
|
|
Everything till \emph{({[}a-zA-Z])+} is processed internally, that
|
|
paranthesis (and everything beyond) is processed by the posix core.
|
|
|
|
Examples of url pairs that match:
|
|
|
|
\begin{itemize}
|
|
\item \emph{www.google.ro images.google.ro}
|
|
\item www.google.com images.google.ro
|
|
\end{itemize}
|
|
Example of url pairs that don't match:
|
|
|
|
\begin{itemize}
|
|
\item www.google.ro images1.google.ro
|
|
\item images.google.com image.google.com
|
|
\end{itemize}
|
|
|
|
\subsection{Flags\label{sec:Flags}}
|
|
|
|
Flags are a binary OR of the following numbers:
|
|
|
|
\begin{description}
|
|
\item [{HOST\_SUFFICIENT}] 1
|
|
\item [{DOMAIN\_SUFFICIENT}] 2
|
|
\item [{DO\_REVERSE\_LOOKUP}] 4
|
|
\item [{CHECK\_REDIR}] 8
|
|
\item [{CHECK\_SSL}] 16
|
|
\item [{CHECK\_CLOAKING}] 32
|
|
\item [{CLEANUP\_URL}] 64
|
|
\item [{CHECK\_DOMAIN\_REVERSE}] 128
|
|
\item [{CHECK\_IMG\_URL}] 256
|
|
\item [{DOMAINLIST\_REQUIRED}] 512
|
|
\end{description}
|
|
The names of the constants are self-explanatory.
|
|
|
|
These constants are defined in libclamav/phishcheck.h, you can check
|
|
there for the latest flags.
|
|
|
|
There is a default set of flags that are enabled, these are currently:
|
|
\begin{verbatim}
|
|
(CLEANUP\_URL|CHECK\_SSL|CHECK\_CLOAKING|CHECK\_IMG\_URL)
|
|
\end{verbatim}
|
|
ssl checking is performed only for a tags currently.
|
|
|
|
You must decide for each line in the domainlist if you want to filter
|
|
any flags (that is you don't want certain checks to be done), and
|
|
then calculate the binary OR of those constants, and then convert
|
|
it into a 3-digit hexnumber. For example you devide that domain\_sufficient
|
|
shouldn't be used for ebay.com, and you don't want to check images
|
|
either, so you come up with this flag number: $2|256\Rightarrow$258$(decimal)\Rightarrow102(hexadecimal)$
|
|
|
|
So you add this line to daily.wdb:
|
|
|
|
\begin{itemize}
|
|
\item R102~www.ebay.com~.+
|
|
\end{itemize}
|
|
|
|
\section{Introduction to regular expressions\label{sec:Introduction-to-regular}}
|
|
|
|
Recomended reading:
|
|
|
|
\begin{itemize}
|
|
\item http://www.regular-expressions.info/quickstart.html
|
|
\item http://www.regular-expressions.info/tutorial.html
|
|
\item regex(7) man-page: http://www.tin.org/bin/man.cgi?section=7\&topic=regex
|
|
\end{itemize}
|
|
|
|
\subsection{Special characters}
|
|
|
|
\begin{description}
|
|
\item [{{[}}] the opening square bracket - it marks the beginning of a
|
|
character class, see section\vref{sub:Character-classes}
|
|
\item [{$\backslash$}] the backslash - escapes special characters,
|
|
see section \vref{sub:Escaping}
|
|
\item [{\^{ }}] the caret - matches the beginning of a line (not needed
|
|
in clamav regexes, this is implied)
|
|
\item [{\$}] the dollar sign - matches the end of a line (not needed in
|
|
clamav regexes, this is implied)
|
|
\item [{\.{ }}] the period or dot - matches \emph{any} character
|
|
\item [{|}] the vertical bar or pipe symbol - matches either of the token
|
|
on its left and right side, see section\vref{sub:Alternation}
|
|
\item [{?}] the question mark - matches optionally the left-side token,
|
|
see section\vref{sub:Optional-matching,-and}
|
|
\item [{{*}}] the asterisk or star - matches 0 or more occurences of the
|
|
left-side token, see section \vref{sub:Optional-matching,-and}
|
|
\item [{+}] the plus sign - matches 1 or more occurences of the left-side
|
|
token, see section \vref{sub:Optional-matching,-and}
|
|
\item [{(}] the opening round bracket - \c{m}arks beginning of a group,
|
|
see section \vref{sub:Groups}
|
|
\item [{)}] the closing round bracket - marks end of a group, see section\vref{sub:Groups}
|
|
\end{description}
|
|
|
|
\subsection{Character classes\label{sub:Character-classes}}
|
|
|
|
|
|
\subsection{Escaping\label{sub:Escaping}}
|
|
|
|
Escaping has two purposes:
|
|
|
|
\begin{itemize}
|
|
\item it allows you to actually match the special characters themselves,
|
|
for example to match the literal \emph{+}, you would write \emph{$\backslash$+}
|
|
\item it also allows you to match non-printable characters, such as the
|
|
tab (\emph{$\backslash$t}), newline (\emph{$\backslash$n}),
|
|
..
|
|
\end{itemize}
|
|
However since non-printable characters are not valid inside an url,
|
|
you won't have a reason to use them.
|
|
|
|
|
|
\subsection{Alternation\label{sub:Alternation}}
|
|
|
|
|
|
\subsection{Optional matching, and repetition\label{sub:Optional-matching,-and}}
|
|
|
|
|
|
\subsection{Groups\label{sub:Groups}}
|
|
|
|
Groups are usually used together with repetition, or alternation.
|
|
For example: \emph{(com|it)+} means: match 1 or more repetitions of
|
|
\emph{com} or \emph{it,} that is it matches: com, it, comcom, comcomcom,
|
|
comit, itit, ititcom,... you get the idea.
|
|
|
|
Groups can also be used to extract substring, but this is not supported
|
|
by the clam engine, and not needed either in this case.
|
|
|
|
|
|
\section{How to create database files}
|
|
|
|
|
|
\subsection{How to create and maintain the whitelist (daily.wdb)}
|
|
|
|
If the phishing code claims that a certain mail is phishing, but its
|
|
not, you have 2 choices:
|
|
|
|
\begin{itemize}
|
|
\item examine your rules daily.pdb, and fix them if necessary (see: section\vref{sub:How-to-create})
|
|
\item add it to the whitelist (discussed here)
|
|
\end{itemize}
|
|
Lets assume you are having problems because of links like this in
|
|
a mail:
|
|
|
|
\begin{verbatim}
|
|
<a href=''http://69.0.241.57/bCentral/L.asp?L=XXXXXXXX''>
|
|
http://www.bcentral.it/
|
|
</a>
|
|
\end{verbatim}
|
|
After investigating those sites further, you decide they are no threat,
|
|
and create a line like this in daily.wdb:
|
|
|
|
\begin{quote}
|
|
R http://www$\backslash$.bcentral$\backslash$.it/.+ http://69$\backslash$.0$\backslash$.241$\backslash$.57/bCentral/L$\backslash$.asp?L=.+
|
|
\end{quote}
|
|
Note: urls like the above can be used to track unique mail recipients,
|
|
and thus know if somebody actually reads mails (so they can send more
|
|
spam). However since this site required no authentication information,
|
|
it is safe from a phishing point of view.
|
|
|
|
|
|
\subsection{How to create and maintain the domainlist (daily.pdb)\label{sub:How-to-create}}
|
|
|
|
When not using --phish-scan-alldomains (production environments for
|
|
example), you need to decide which urls you are going to check.
|
|
|
|
Although at a first glance it might seem a good idea to check everything,
|
|
it would produce false positives. Particularly newsletters, ads, etc.
|
|
are likely to use URLs that look like phishing attempts.
|
|
|
|
Lets assume that you've recently seen many phishing attempts claiming
|
|
they come from Paypal. Thus you need to add paypal to daily.pdb:
|
|
|
|
\begin{quote}
|
|
R .+ .+$\backslash$.paypal$\backslash$.com
|
|
\end{quote}
|
|
The above line will block (detect as phishing) mails that contain
|
|
urls that claim to lead to paypal, but they don't in fact.
|
|
|
|
Be carefull not to create regexes that match a too broad range of
|
|
urls though.
|
|
|
|
|
|
\subsection{Dealing with false positives, and undetected phishing mails}
|
|
|
|
|
|
\subsubsection{False positives}
|
|
|
|
Whenever you see a false positive (mail that is detected as phishing,
|
|
but its not), you need to examine \emph{why} clamav decided that its
|
|
phishing. You can do this easily by building clamav with debugging
|
|
(./configure --enable-experimental --enable-debug), and then running
|
|
a tool:
|
|
|
|
\begin{quote}
|
|
\$contrib/phishing/why.py phishing.eml
|
|
\end{quote}
|
|
This will show the url that triggers the phish verdict, and a reason
|
|
why that url is considered phishing attempt.
|
|
|
|
Once you know the reason, you might need to modify daily.pdb (if one
|
|
of yours rules inthere are too broad), or you need to add the url
|
|
to daily.wdb. If you think the algorithm is incorrect, please file
|
|
a bugreport on bugzilla.clamav.net, including the output of \emph{why.py}.
|
|
|
|
|
|
\subsubsection{Undetected phish mails}
|
|
|
|
Using why.py doesn't help here unfortunately (it will say: clean),
|
|
so all you can do is:
|
|
|
|
\begin{quote}
|
|
\$clamscan/clamscan --phish-scan-alldomains undetected.eml
|
|
\end{quote}
|
|
And see if the mail is detected, if yes, then you need to add an appropiate
|
|
line to daily.pdb (see section \vref{sub:How-to-create}).
|
|
|
|
If the mail is not detected, then try using:
|
|
|
|
\begin{quote}
|
|
\$clamscan/clamscan --debug undetected.eml|less
|
|
\end{quote}
|
|
|
|
Then see what urls are being checked, see if any of them is in a
|
|
whitelist, see if all urls are detected, etc.
|
|
|
|
|
|
\end{document}
|
|
|