Add a new logical signature subsignature type for matching on images
with image fuzzy hashes.
Image fuzzy hash subsigantures follow this format:
fuzzy_img#<hash>#<dist>
In this initial implementation, the hamming distance (dist) is ignored
and only exact fuzzy hash matches will alert.
Fuzzy hash matching is only performed for supported image types.
Also: removed some excessive debug log messages on start-up.
Fixed an issue where the signature name (virname) is being allocated and
stored for every subsignature or even ever sub-pattern in an AC-pattern
(i.e. NDB sig or LDB subsig) containing a `{n-m}` or `*` wildcard.
This fix is only for LDB subsigs though. NDB signatures are still
allocaing one virname per sub-pattern.
This fix was required because I needed a place to store the virname with
fuzzy-hash subsignatures. Storing it in the fuzzy-hash subsig
metadatathe way AC-pattern, PCRE, and BComp subsigs were doing it
wouldn't work because it would cross the C-Rust FFI boundary and giving
pointers to Rust allocated stuff is dicey. Not to mention native Rust
strings are different thatn C strings. Anyways, the correct thing to do
was to store the virname with the actual logical signature.
TODO: Keep track of NDB signatures in the same way and store the virname
for NDB sigs there instead of in AC-patterns so that we can get rid of
the virname field in the AC-pattern struct.
This commit addresses the signature load time issue in the following steps:
1. Loaded list items are allocated but left unattached; only a node reference is set on them for further processing. This is done with no increase of memory usage. See changes in insert_list and matcher-ac.h
2. Before the tries are built, the whole list of entries is sorted by node, then by pattern, then by partno. This requires O(N log(N)) time.
3. The list is processed linearly, one node at a time and the `next_same` chain is built. Each next_same chain head is also extracted. This requires O(N) time.
4. The list of heads is sorted by partno. This requires O(M log(M)) time on average with M<=N.
5. The list of heads is processed linearly and the `next` chain is built. This has O(M) complexity.
And improves scantime performance, by adding checks to:
1. Place longer lists earlier in the trie.
2. Keep close patterns close, rather than scattering them further apart.
This reduced memory cache faults to improve load and scan time performance.
In the LDB there is (one or more) special subsignature ${min-max}MACROID$,
which means:
must match any signature from group MACROID (for current filetype),
and the match must occur at a distance of min-max from the start(!) of the
previous logical subsignature match.
It also has the sideeffect of making the previous subsignature considered a
match only if both that and the macro matches. The offset of first match for
the previous logical subsig will be the offset where the {min-max} distance is
satisfied.
The macro logical subsignature will have a count of 0 (if it didn't match
together with the previous subsig), or a count of 1 if it did.
The matches can occur anywhere (even in
different ac scan buffers), since I don't call cli_ac_scanbuff I just use the
offset of first match (which we have for the bytecode anyway).
There can be at most 32 macro groups, signatures are added to a macro group by
using $MACROID$ as offset.
For example pdb entries could be converted to PDB:3:$0:<hexsig of domainname>
if we assign macro id 0 to PDB (and we can assign 31 more macro ids to
whatever).
Example:
test.ldb:
TestMacro;Target:0;0&1;616161;${3-4}12$
test.ndb:
D:0:$12:6262
D:0:$12:6363
D:0:$11:6262
test.dat:
aaaaxccdd
test-nomatch.dat:
aaaaxxxccdd