Signatures that start with "PUA.", "Heuristics.", or "BC.Heuristics."
are perceived to be less serious, or more likely to have false
positives, than other signatures that we would think of us as "strong
indicators".
At present, only a subset of "Heuristics." signatures, such as those
added by the phishing module, are added as "potentially unwanted".
Unless you're using heuristic-precedence mode, these "potentially
unwanted" indicators are recorded but not reported unless no other
signature alerts. This behavior should apply to all signatures that
start with "PUA." and "Heuristics.". We already do a string match
comparison on the signature name to apply that behavior to bytecode
matches that start with "BC.Heuristics.".
I moved that string comparison logic used for "BC.Heuristics." into the
main `cl_append_virus()` function and extended it to cover the other two
cases.
I also replaced all hardcoded calls to append "Heuristics." signatures
to append using the `cli_append_potentially_unwanted()` function, so we
can skip the string compare in these cases. That function will of course
append them as strong indicators if heuristic-precedence mode is
enabled.
If initial file type recognition comes back as an SFX type, which may
happen for small files that do not get recognized as any other file type
and contain a zip entry somewhere in the middle, then the type will be
set to that SFX type. This is a problem because later on when we go to
do embedded file type recognition, we explicitly skip SFX types, in
addition to TAR's and other types that are parsed elsewhere and have a
high embedded file type recognition FP-rate because they aren't
compressed.
This commit prohibits that initial FTM check from selecting an SFX type.
The SFX type will be rediscovered in `scanraw()` where the type is
handled/parsed.
Added inline documentation and did some general cleanup of the
`cli_scan_buff`, and then updated the function comment now that I
understand the function a little better.
While doing this, I found that the calls to cli_ac_initdata were being
done regardless of whether or not logically initialized matcher data was
required or used. But the call to free that matcher data was only being
done when AC-data was not provided by the caller. This would be a leak.
I fixed this by only initalizing the AC data when AC data is not
provided by the caller.
Significantly tidy the `cli_scan_fmap()` function, and add comments to
explain how it all works.
Add SHA1 and SHA256 digest variables to the FMAP structure in addition
to the existing MD5. Add a function to set the hash so that when we
calculate the hashes for hash matching, we save them for subsequent FP
matching. This enabled me to remove the extra "hash-only" FP check from
`cli_scan_fmap()`. This will also make it easier to switch the clean
cache hash algorithm to SHA256 in the future.
Remove extra allmatch checks that are no longer required.
Add a new header to prevent #include order issues.
Removed the allmatch checks that are no longer necessary now that it is
done when appending alerts.
Cleaned up error handling in the process to use the `goto done;` method.
I believe this fixes a minor bug wherein a detection in CAB or CHM
metadata scanning would end the scan early in allmatch mode.
Based on changes observed in the commits:
- 08fef61fea
- 1aa8768db2
... I believe that the iptr variable was intended to be used and was
accidentally not used.
In addition, adding the `ULL` suffix in the second location to match the
first, because it appears to have been accidentally omited.
The bytecode source files largely use `int` instead of the appropriate
`cl_errot_t` for clamav status codes, as well for boolean variables.
This hides warnings that would indicate bugs, and makes it harder to
read the code.
I haven't gone as in depth as with some other code cleanups. This
largely just replaces function interfactes and ret variables that use
`int` with `cl_error_t`. I also swapped a couple of `int`s to `bool`s.
While doing so I found that the `cli_bytecode_context_setpdf()` function
was incorrectly placed in the `bytecode_api.c` file instead of the next
to similar functions (`cli_bytecode_context_setpe`, etc.) in bytecode.c.
It's not an API function, so I moved it to the correct location.
I also eliminated a couple of compiler warnings:
- LLVM's CFG.h header emits a warning about a multi-line comment, so
that crops up with using LLVM for the bytecode runtime.
I disabled the warning through CMake.
- C doesn't like using the `inline` keyword on cli_dbgmsg in the
declaration in `bytecode2llvm.c` because we're compiling the bytecode
runtimes as a separate object file from the rest of libclamav.
It doesn't appear to be a functional issue, but I swapped that file
over to use `cli_dbgmsg_no_inline()` instead, just in case.
I would hope link-time-optimization will inline it anyways.
When potentially unwanted indicators are found, they are not reported
right away, except in heuristic-precedence mode. At the end of the
scan, we then report all of the heuristic/PUA alerts. In my initial
overhaul of the allmatch mode, I introduced an issue where it is never
reporting more than 1 heuristic/PUA alert.
This commit fixes that by reporting all potentially-unwanted indicators
at the end. Note that instead of using `cli_virus_found_cb()` which
would report the top layer file descriptor as well in the callback,
I'm reporting `-1` for the file descriptor instead, because we don't
know when the alert was added. If it was at a deeper layer in the file,
we would not longer have access to it at the end of the scan.
Also removed excessive debug statements '... infected with ...' that use
`cli_get_last_virus()` because it doesn't really reflect all of the
matches in allmatch mode, because the match would've been reported
previously (so it is excessive), and because I don't like the word
'infected'. :)
Also removed message saying 'Infected with ...'.
It's not really infected, and with allmatch it'll only print the last
match anyways.
And fix issue where magic_scan() critical error may be ignored.
When using --heuristic-precendence, the heuristic alerts in the image
parsers much be reported as regular matches, which means that when not
in all-match mode, the scan should abort. Right now, those matches
aren't being recorded so scanning continues in archives containing more
than just the malformed image file.
This fix returns the "CL_VIRUS" status code when heuristic-precendence
is enabled and allmatch is not enabled.
I found mixed types and multiple bugs in the hashtable/map/set code, and
very little documentation.
The most documentation available is the bytecode compiler users manual.
Although I also found one discrepancy there with the return value for
the BC API map_remove function that calls cli_map_removekey() and so put
in an issue with the compiler project for the documentation.
Most notably is that this hashtab.c had a lot of functions returning
negative enum values instead of returning the enums and then having the
caller evaluate the return code to return a negative/0/1 result.
This commit fixes all of that, and adds in a bunch of documentation to
explain the purpose and behavior of each function and structure provided
by hashtab.c/.h.
Specific bugs that I know I fixed outside of code quality improvements:
- cli_hashset_toarray() was returning CL_ENULLARG / CL_EMEM on failure,
when the caller is expecting a ssize_t to indicate how big of an array
is allocated. It now returns -1 on failure.
I also found that an attempt was made to have the same API that takes a
mempool pointer even if mempool is disabled. I preserved that, but made
it so the macro is in all-caps so it's more obvious what is going on.
Rework the append_virus mechanism to store evidence (strong indicators,
pua indicators, and eventually weak indicators) in vectors. When
appending a "virus", we will return CLEAN when in allmatch-mode, and
simply add the indicator to the appropriate vector.
Later we can check if there were any alerts to return a vector by
summing the lengths of the strong and pua indicator vectors.
This does away with storing the latest "virname" in the scan context.
Instead, we can query for the last indicator in the evidence, giving
priority to strong indicators.
When heuristic-precendence is enabled, add PUA as Strong instead of
as PotentiallyUnwanted. This way, they will be treated equally and
reported in order in allmatch mode.
Also document reason for disabling cache with metadata JSON enabled
A copy-paste error in the markdown support for building the macOS-
installer caused an issue where markdown module may be accidentally
used instead of pytest. Strangely, this still worked okay up until we
converted the clamscan tests into a directory of tests, at which point
the `python3 -m markdown` invocation was upset that it was trying to
execute a directory instead of a .py file.
This commit removes the line that overwrites the python test command
with using the markdown module.
Adds a test set authored by Andrew Williams that validates correct
allmatch behavior using as many features as possible to alert on a
test.exe program. Source for building the test.exe program is provided,
for those who are curious what it is and what it does, or in case it
needs to be re-built for some reason.
In addition to adding a test that verifies each of the sigs that should
alert, do alert, this adds a test to verify that if an authenticode
trust signature is added, none of the signatures alert. That test is
presently failing (expected failure, so the tests all pass) and should
be updated when the certificate verification bug is fixed.
Adds a set of clamscan tests to verify trusting files by hash at the
current (normalized, in this case), the parent (non-normalized), and
grandparent (zipped) layers.
This test is added because of work done in the `cli_scan_fmap()`
function that trimmed out extra hash FP-checks.
The `clamscan_test.py` file is getting way too long.
Created a new `unit_tests/clamscan` directory and separated all tests
into separate test files.
I also fixed an issue with the clamscan `ign2` test:
The `ign2` test wasn't written correctly and was actually testing
detection despite using the `-d` parameter to try to ignore a signature.
There is a minor bug where `ign2` files may be loaded after other files
when using the `-d` option. It is only guaranteed to be loaded first if
you load all the sigs from the same directory. I fixed the test.
In the future, we should make it so all database files are sorted in a
list before load time regardless of where they're sourced from.
Add new tests to verify correct behaviour combining:
- a "potentially unwanted" indicator
- a regular alert (strong indicator)
The test file is a zip that:
1. first has a malformed PNG file used to trigger the heuristic alert
when using `--alert-broken-media`.
2. second has the clam.exe program used to trigger the normal alert.
Included are tests with this file for a number of situations:
1. Test the heuristic alert must not alert because neither allmatch
is specified, nor --heuristic-scan-precedence.
2. Test the heuristic alert must alert because we don't use the sig
for the clam.exe file.
3. Test the heuristic alert must alert first because
--heuristic-scan-precedence is enabled.
We won't see the other alert because it's not allmatch mode.
4. Test we use --allmatch but we don't use --heuristic-scan-precedence.
That means the NDB sig should alert first, even though the heuristic
is encountered first.
Note the verify_output() uses STRICT_ORDER.
5. Test we use --allmatch AND we use --heuristic-scan-precedence.
That means the heuristic is encountered first and should be treated
equally, so it should alert first.
Note the verify_output() uses STRICT_ORDER.
There is a bug where because embedded file type recognition is during
the raw-scan (i.e. pattern match of a given layer), if it finds a
malware sig match in the raw-scan it will lose the metadata about
embedded file types found, and will fail to extract them.
In all-match mode, we want it to keep extracting and scanning despite
matches, so this is a bug.
This test verifies that it will match on a signature, and also find the
embedded zip, extract it, and match on the stuff within.
Add tests to verify an alert on the base file in addition to embedded
file type recognition (for ZIPSFX extraction) and then subsequent
detection of content extracted from the embedded zip.
Add tests:
- Test an import hash with wildcard size when all-match mode is
disabled.
- Test that bytecode rules will run after content match alerts in
all-match mode.
Print bytecode signature names in debug log when running, and not just
the bytecode signature id number.
Place bytecode signature-created temp files in the correct temp
sub-directory, and give it a name prefix that is more identifiable.
The current implementation sets a "next layer attributes" flag field
in the scan context. This may introduce bugs if accidentally not cleared
during error handling, causing that attribute to be applied to a
different layer than intended.
This commit resolves that by adding an attribute flag to the major
internal scan functions and removing the "next layer attributes" from
the scan context. This attributes flag shares the same flag fields as
the attributes flag in the new file inspection callback and the flags
are defined in `clamav.h`.
libclamav callbacks can be used to access embedded file content at each
layer of extraction during the course of a scan. The existing callbacks
only provide access to the file descriptor and a guess at the file type.
This patch adds a new callback for the purposes of file/archive
inspection that provides additional insight into the embedded file.
This includes:
- ancestors: an array of parent file names
- parent file size: the size of the direct parent layer
- file name: current layer's filename, if any.
- file buffer (pointer)
- file size: size of file buffer
- file type: just a guess at the current file's type
- file descriptor: may be -1 if the layer is in-memory only.
- layer attributes: a flag field. see LAYER_ATTRIBUTE_* defines in clamav.h
Two new example apps are added that are automatically built when
compiling under CMake:
- ex2 demonstrates the prescan callback.
- ex3 demonstrates the new file inspection callback.
The examples are now installed if enabled, so you can test them in the
Docker image, and so that they'll be colocated with the DLLs so you can
test them on Windows. The installed examples should also be able to find
the UnRAR library at run time, without having to set LD_LIBRARY_PATH.
This commit also sets the fmap->name in an fmap-scan using the basname
of the provided filename if the caller provided the filename and the
provided fmap does not have the name set.
Some basic testing is needed for the new cl_cvdunpack() API, so this
commit adds basic unit tests for that.
For reasons unknown, a number of cl_* API's have stubs for unit tests
that weren't filled out. The CVD load/verify ones in particular
required access to a signed CVD. We actually ship a very basic signed
CVD with the databases now, so I added tests for those while I was at it.
In the interest of using the public API's as much as possible for our
own applications (dog-fooding the API), this commit swaps sigtool and
freshclam `cli_cvdunpack()` calls to `cl_cvdunpack()`.
Add `cl_cvdunpack()` function to the public API.
This new API has an option to disable verification, but otherwise it
will attempt to verify that the CVD is correctly signed.
The CheckFmapFeatures module is failing because HAVE_SYS_STAT_H and
HAVE_SYS_TYPES_H are not defined within the check_c_source_compiles
environment.
This commit adds them in the same way as is done for CheckFDPassing
module.