The C-Rust FFI code is needlessly complex. Now that we are calling into
magic_scan from Rust, we can simply hand off the <style> block contents
to Rust code to handle extraction and scanning.
Primarily this commit fixes an issue with the size of the parameters
passed to cli_checklimits(). The parameters were "unsigned long", which
varies in size depending on platform.
I've switched them to uint64_t / u64.
While working on this, I observed some concerning warnigns on Windows,
and some less serious ones, primarily regarding inconsistencies with
`const` parameters.
Finally, in `scanmem.c`, there is a warning regarding use of `wchar_t *`
with `GetModuleFileNameEx()` instead of `GetModuleFileNameExW()`.
This made me realize this code assumes we're not defining `UNICODE`,
which would have such macros use the 'A' variant.
I have fixed it the best I can, although I'm still a little
uncomfortable with some of this code that uses `char` or `wchar_t`
instead of TCHAR.
I also remove the `if (GetModuleFileNameEx) {` conditional, because this
macro/function will always be defined. The original code was checking a
function pointer, and so this was a bug when integrating into ClamAV.
Regarding the changes to `rijndael.c`, I found that this module assumes
`unsigned long` == 32bits. It does not.
I have corrected it to use `uint32_t`.
The fmap structure has some stuff that differs in size in memory between
Linux and Windows, and between 32bit and 64bit architectures.
Notably, `time_t` appears to be defined by the Rust bindgen module as
`ulong` which may be either 8 bytes or 4 bytes, depending architecture
(thanks, C). To resolve this, we'll store time as a uint64_t instead.
The other problem in the fmap structure is the windows file and map
handles should always be exist, and may only be used in Windows, for
consistency in sizing of the structure.
In order to generate Rust bindings for C code, Rust's bindgen module
needs to know where to find all headers included by the API.
If they're all inside the project or inside the standard include path
(e.g. /usr/include and /usr/local/include) that's fine. But for third-
party C library headers from outside the standard include path, that's
a problem.
We didn't really notice this problem when generating on Unix systems
until we switched to use OpenSSL 3.1 and tested on systems that have
the OpenSSL 1.1.1 dev package installed.
The ability to find headers outside the project path is also needed to
generate bindings on Windows, if desired.
This commit solves the problem by passing include directories for the
ClamAV::libclamav CMake build target to the Rust build via the
CARGO_INCLUDE_DIRECTORIES environment variable.
Then, in the `libclamav_rust/build.rs` script, where we run bindgen,
we split that `;` separated string into invididual paths and add each
to the bindgen builder.
Includes rudimentary support for getting slices from FMap's and for
interacting with libclamav's context structure.
For now will use a Cisco-Talos org fork of the onenote_parser
until the feature to read open a onenote section from a slice (instead
of from a filepath) is added to the upstream.
HTML files with <style> blocks containing non-utf8 sequences are causing
warnings when processing them to extract base64 encoded images.
To resolve this, we can use the to_string_lossy() method that may
allocate and sanitize a copy of the content if the non-utf8 characters
are encountered.
Resolves: https://github.com/Cisco-Talos/clamav/issues/1082
I'm unsure why, but building with cmke -D MAINTAINER_MODE=ON is failing
right now. Updating to a newer version of bindgen appears to resolve the
issue.
I was able to update it by changing the version specified in
libclamav_rust/Cargo.toml, and then running `cargo update -p bindgen`
Not that I expect anyone else to be running maintainer-mode, but I did
also confirm using `cargo-msrv` that the minimum supported version of
rust did not change as a result of this commit.
When processing UTF-8 HTML code, the image extraction logic may panic if
the string contains a multi-byte grapheme that includes a '(', ')',
whitespace, or one of the other characters used to split the text when
searching for the base64 image content.
The panic is because the `split_at()` method will panic if you try to
split in the middle of a unicode grapheme.
This commit fixes the issue by processing the HTML string one grapheme
at a time instead of one character (byte) at a time.
The `grapheme_indices()` method is used to get the correct position of
the start of each grapheme for splitting the string.
The CLOSE command is failing to create a file when appending changes if
the file does not already exist. This prevents adding new files to a
database with a CDIFF and caused failures applying the test-3.cdiff file
in the freshclam feature tests.
Also improved the error message to show which command, specifically, is
failing (not just the line number).
Any cdiff or script using the UNLINK operation will fail to delete the
file claiming "No DB open for action UNLINK".
The UNLINK operation appears to be trying to delete a currently open
database, when in fact it should ensure no database is open before
deleting the local file given by the single "db_name" parameter.
I found that the `url(data:` type does not matter to a browser.
In addition, whitespace may be placed in a few locations and the browser
will ignore it.
This commit accounts for this, and updates the test accordingly.
This commit adds a feature to find, decode, and scan each image found
within HTML <style> tags where the image data is embedded in `url()`
function parameters a base64 blob
In C in the html normalization process we extract style tag contents
to new buffer for processing. We call into a new feature in Rust code to
find and decode each image (if there are multiple).
Once extracted, the images are scanned as contained files of unknown
type, and file type identifcation will determine the actual type.
If the image decoders (i.e. jpeg, tiff, png, etc.) fail to load an image
due to a panic, the application will crash.
This commit attempts to catch those panics and handle the error.
When potentially unwanted indicators are found, they are not reported
right away, except in heuristic-precedence mode. At the end of the
scan, we then report all of the heuristic/PUA alerts. In my initial
overhaul of the allmatch mode, I introduced an issue where it is never
reporting more than 1 heuristic/PUA alert.
This commit fixes that by reporting all potentially-unwanted indicators
at the end. Note that instead of using `cli_virus_found_cb()` which
would report the top layer file descriptor as well in the callback,
I'm reporting `-1` for the file descriptor instead, because we don't
know when the alert was added. If it was at a deeper layer in the file,
we would not longer have access to it at the end of the scan.
Also removed excessive debug statements '... infected with ...' that use
`cli_get_last_virus()` because it doesn't really reflect all of the
matches in allmatch mode, because the match would've been reported
previously (so it is excessive), and because I don't like the word
'infected'. :)
Rework the append_virus mechanism to store evidence (strong indicators,
pua indicators, and eventually weak indicators) in vectors. When
appending a "virus", we will return CLEAN when in allmatch-mode, and
simply add the indicator to the appropriate vector.
Later we can check if there were any alerts to return a vector by
summing the lengths of the strong and pua indicator vectors.
This does away with storing the latest "virname" in the scan context.
Instead, we can query for the last indicator in the evidence, giving
priority to strong indicators.
When heuristic-precendence is enabled, add PUA as Strong instead of
as PotentiallyUnwanted. This way, they will be treated equally and
reported in order in allmatch mode.
Also document reason for disabling cache with metadata JSON enabled
The `clamscan_test.py` file is getting way too long.
Created a new `unit_tests/clamscan` directory and separated all tests
into separate test files.
I also fixed an issue with the clamscan `ign2` test:
The `ign2` test wasn't written correctly and was actually testing
detection despite using the `-d` parameter to try to ignore a signature.
There is a minor bug where `ign2` files may be loaded after other files
when using the `-d` option. It is only guaranteed to be loaded first if
you load all the sigs from the same directory. I fixed the test.
In the future, we should make it so all database files are sorted in a
list before load time regardless of where they're sourced from.
The current implementation sets a "next layer attributes" flag field
in the scan context. This may introduce bugs if accidentally not cleared
during error handling, causing that attribute to be applied to a
different layer than intended.
This commit resolves that by adding an attribute flag to the major
internal scan functions and removing the "next layer attributes" from
the scan context. This attributes flag shares the same flag fields as
the attributes flag in the new file inspection callback and the flags
are defined in `clamav.h`.
libclamav callbacks can be used to access embedded file content at each
layer of extraction during the course of a scan. The existing callbacks
only provide access to the file descriptor and a guess at the file type.
This patch adds a new callback for the purposes of file/archive
inspection that provides additional insight into the embedded file.
This includes:
- ancestors: an array of parent file names
- parent file size: the size of the direct parent layer
- file name: current layer's filename, if any.
- file buffer (pointer)
- file size: size of file buffer
- file type: just a guess at the current file's type
- file descriptor: may be -1 if the layer is in-memory only.
- layer attributes: a flag field. see LAYER_ATTRIBUTE_* defines in clamav.h
Two new example apps are added that are automatically built when
compiling under CMake:
- ex2 demonstrates the prescan callback.
- ex3 demonstrates the new file inspection callback.
The examples are now installed if enabled, so you can test them in the
Docker image, and so that they'll be colocated with the DLLs so you can
test them on Windows. The installed examples should also be able to find
the UnRAR library at run time, without having to set LD_LIBRARY_PATH.
This commit also sets the fmap->name in an fmap-scan using the basname
of the provided filename if the caller provided the filename and the
provided fmap does not have the name set.
Right now, each Rust-based target added in CMake is being built in its
own directory under the build path. This causes Rust to build each
module from scratch, meaning any dependencies they have in common are
built twice.
The solution in this commit is to specify the top level build directory
as the target directory for every Rust build and test.
Note this also changes where the `clamav_rust.h` file is generated. It
is now also placed in the top-level build directory, instead of under the
`build/libclamav_rust` directory. That's a bit of a side-effect, and
could be rectified if needed, but it appears to have no ill-effects.
It's the same location that we drop the clamav-types.h file, so I think
it's fine, for now. Note that `clamav_rust.h` is not a public header,
it's just so libclamav functions can call into libclamav_rust functions.
Precompile the test executable during the main build, after libclamav
has built. This will make it so the compile time does not count against
the test time.
It also, unfortunately, make the main build take way longer.
Removed 3 duplicate variables from the test ENVIRONMENT variable.
* Windows: Fix utf8 filename support issue
The function used to verify the real path for the file being
scanned fails on some utf8 filenames, like:
file_with_ümlaut.eicar.com
The scan continues despite the failure, but the error reported from
clamd and warnings from clamscan are confusing because it looks like
the scan failed even if a verdict shows up later.
The fix here is to convert the path from utf8 to utf16 and then use
the CreateFileW() API to grab the file handle for the real path check.
Resolves: https://github.com/Cisco-Talos/clamav/issues/418
Resolves: CLAM-1782
* Windows: Fix utf8 libclamav logging issues
Print to the log using rust eprint() instead of fputs()
Rust's eprint() function supports full utf8, while fputs() gets
confused and prints stuff like:
file_with_├╝mlaut.eicar.com
instead of:
file_with_ümlaut.eicar.com
We must pass the LLVM library dependencies to the libclamav_rust
build.rs script so it links the libclamav_rust unit test executable with
LLVM.
Also:
- We can remove the libtinfo dependency that was hardcoded for the LLVM
3.6 support, and must remove it for the build to work on Alpine, macOS.
- Also, increased the libcheck default timeout from 60s to 300s after
experiencing a failure while testing this.
- Also made one of the valgrind suppressions more generic to account for
inline optimization differences observed in testing this.