scan_common must either be passed an fmap (map) or a file
descriptor (desc) corresponding to the file being scanned.
In the case where map is NULL, scan_common will create an
fmap in order to execute the BC_PRECLASS bytecode hook, and
this fmap wasn't being unmapped afterward
Fixed copypaste bug with duplicated fmap names being assigned to the
parent instead of the dup/child fmap.
Fixed file descriptor initialization issue in the HTML normalizer.
Disables run time warning messages emitted by libxml2 when parsing
HTML email content for JSON metadata feature.
Fixed compile time warning caused by libjson-c API changes from int to
size_t.
At present many parsers create tmp subdirectories to store extracted
files. For parsers like the vba parser, this is required as the
directory is later scanned. For other parsers, these subdirectories are
probably not helpful now that we provide recursive sub-dirs when
--leave-temps is enabled. It's not quite as simple as removing the extra
subdirectories, however. Certain parsers, like autoit, don't create very
unique filenames and would result in file name collisions when
--leave-temps is not enabled.
The best thing to do would be to make sure each parser uses unique
filenames and doesn't rely on cli_magic_scan_dir() to scan extracted
content before removing the extra subdirectory. In the meantime, this
commit gives the extra subdirectories meaningful names to improve
readability.
This commit also:
- Provides the 'bmp' prefix for extracted PE icons.
- Removes empty tmp subdirs when extracting rtf files, to eliminate
clutter.
- The PDF parser sometimes creates tmp files when decompressing streams
before it knows if there is actually any content to decompress. This
resulted in a large number of empty files. While it would be best to
avoid creating empty files in the first place, that's not quite as
as it sounds. This commit does the next best thing and deletes the
tmp files if nothing was actually extracted, even if --leave-temps is
enabled.
- Removes the "scantemp" prefix for unnamed fmaps scanned with
cli_magic_scan(). The 5-character hashes given to tmp files with
prefixes resulted in occasional file name collisions when extracting
certain file types with thousands of embedded files.
- The VBA and TAR parsers mistakenly used NAME_MAX instead of PATH_MAX,
resulting in truncated file paths and failed extraction when
--leave-temps is enabled and a lot of recursion is in play. This commit
switches them from NAME_MAX to PATH_MAX.
HTML normalization creates a tmp directory for storing rfc2397 style
links. The vast majority of html does not make use of rfc2397 and thus
an excess of empty tmp directories are generated. This commit alters
behavior to only create the rfc2397 directory when required if it does
not already exist.
Many of the core scanning functions' names no longer represent their
specific purpose or arguments. This commit aims to make the names more
intuitive. Names are now prefixed with "magic" if they involve
file-typing and file-type parsing. In addition, each function now
includes the type of input being scanned whether its "desc", "fmap", or
"buff". Some of the APIs also now specify "type" to indicate that a type
other than "ANY" may be passed in to select the type rather than use
file type magic for type recognition.
| current name | new name |
| ------------------------- | --------------------------------- |
| magic_scandesc() | cli_magic_scan() |
| cli_magic_scandesc_type() | <delete> |
| cli_magic_scandesc() | cli_magic_scan_desc() |
| cli_base_scandesc() | cli_magic_scan_desc_type() |
| cli_partition_scandesc() | <delete> |
| cli_map_scandesc() | magic_scan_nested_fmap_type() |
| cli_map_scan() | cli_magic_scan_nested_fmap_type() |
| cli_mem_scandesc() | cli_magic_scan_buff() |
| cli_scanbuff() | cli_scan_buff() |
| cli_scandesc() | cli_scan_desc() |
| cli_fmap_scandesc() | cli_scan_fmap() |
| cli_scanfile() | cli_magic_scan_file() |
| cli_scandir() | cli_magic_scan_dir() |
| cli_filetype2() | cli_determine_fmap_type() |
| cli_filetype() | cli_compare_ftm_file() |
| cli_partitiontype() | cli_compare_ftm_partition() |
| cli_scanraw() | scanraw() |
The metadata projecties JSON structure isn't recording file types found
embedded within a file such as self-extracting (SFX) types and office
document types (DOCX, PPTX, etc). This presents a problem...
At present there's no way to know if the current file has ended and a
few file is found tacked on to the end of the first file. If there
were, we could simply check if the type found by the raw-scan exists
within the first file, or after.
If within the first, and the type is an archive then it's reasonable to
conclude we're either observing zip headers (for SFXZIP detections) or
other files that are not compressed.
If the type ISN'T found within the first file, then we definitely have
whole new file to parse and we should do so with cli_magic_scan()
rather than only using these embedded type scanners.
At present we can't ignore SFXZIP detections even if the original file
type is a ZIP because we may have found two ZIPs appended together to
evade detection (a legitimate trick). As a consequence, we will
effectively parse every zip entry twice. The same issue applies to
types found within non-compressed archives.
This commit adds an EmbeddedObjects list to the metadata JSON object so
that the existance of these types is noted.
Additionally, this commit removes the two-part int64 cli_jsonint64()
implementation as json_object_new_int64() should be available
everywhere and the macro to detect such support was never set.
A way is needed to record scanned file names for two purposes:
1. File names (and extensions) must be stored in the json metadata
properties recorded when using the --gen-json clamscan option. Future
work may use this to compare file extensions with detected file types.
2. File names are useful when interpretting tmp directory output when
using the --leave-temps option.
This commit enables file name retention for later use by storing file
names in the fmap header structure, if a file name exists.
To store the names in fmaps, an optional name argument has been added to
any internal scan API's that create fmaps and every call to these APIs
has been modified to pass a file name or NULL if a file name is not
required. The zip and gpt parsers required some modification to record
file names. The NSIS and XAR parsers fail to collect file names at all
and will require future work to support file name extraction.
Also:
- Added recursive extraction to the tmp directory when the
--leave-temps option is enabled. When not enabled, the tmp directory
structure remains flat so as to prevent the likelihood of exceeding
MAX_PATH. The current tmp directory is stored in the scan context.
- Made the cli_scanfile() internal API non-static and added it to
scanners.h so it would be accessible outside of scanners.c in order to
remove code duplication within libmspack.c.
- Added function comments to scanners.h and matcher.h
- Converted a TDB-type macros and LSIG-type macros to enums for improved
type safey.
- Converted more return status variables from `int` to `cl_error_t` for
improved type safety, and corrected ooxml file typing functions so
they use `cli_file_t` exclusively rather than mixing types with
`cl_error_t`.
- Restructured the magic_scandesc() function to use goto's for error
handling and removed the early_ret_from_magicscan() macro and
magic_scandesc_cleanup() function. This makes the code easier to
read and made it easier to add the recursive tmp directory cleanup to
magic_scandesc().
- Corrected zip, egg, rar filename extraction issues.
- Removed use of extra sub-directory layer for zip, egg, and rar file
extraction. For Zip, this also involved changing the extracted
filenames to be randomly generated rather than using the "zip.###"
file name scheme.
This commit improves the layout of the tmp file output and the JSON
metadata output when using the --leave-temps and --gen-json options.
For all scans, each scan target will get a unique tmp sub-directory. If
using --leave-temps, that subdir will include the basename of the
original file to make it easier to identify. Additionally, when using
--leave-temps option, all extracted objects will have their
subdirectories extracted in recursive subdirectories including filename
prefixes where available. When not using the --leave-temps option, the
layout of the tmp sub-directory will remain flat, so as to alleviate the
possibility of exceeding PATH_MAX.
The JSON metadata generated by the --gen-json option is now generated
for all file types, not just a select few. The format is also
pretty-printed for readability and now includes filenames and file paths
when available.
Also:
- Added missing ALLMATCH check when determining if bytecode hooks should
be run.
- Added cl_engine_get_str API to windows libclamav symbol export file.
A missing return statement in png.c for a function that should return a
status code is resulting in undefined behavior.
In this patch I also added ".PNG" to one of the new heuristic signatures
to match others.
Add missing size checks to validate size data parsed from a VBA file.
This fixes a possible buffer overflow read that was caught by oss-fuzz
before it made it into any release.
Fix for an out-of-bounds read in the PDF parser when initializing
aes crypto routines that may result in a crash.
Bug found by OSS-Fuzz.
Also added checks for the arc4 init routine to mitigate the risk of a
similar issue.
Fix for an out-of-bounds read in the ARJ parser accidentally introduced
when adding text normalization and bound checking when parsing filename
and comment fields from file headers.
On some systems, the VirusEvent feature doubles the amount of RAM being used
because fork() duplicates the loaded signature database in the new process.
This commit changes fork() to vfork() so that VirusEvent won't fail if these systems
don't have enough memory.
Fixes a shell compatibility issue with string comparisons in the
clamonacc and libclamav-only M4 files:
test(1) uses `=` for string equality. (`==` is a bashism)
XLM is a macro language in Excel that was used before VBA (before
1996). It is still parsed and executed by modern Excel and is gaining
popularity with malware authors.
This patch adds rudimentary support for detecting and extracting
Excel 4.0 (XLM) macros.
The code is based on Didier Steven's plugin_biff for oletools.py.
Fixes a bug in the PtrVerifier pass when using LLVM >= v3.5 for the
bytecode signature runtime.
LLVM 3.5 changed the meaning of "use" and introduced "user". This fix
swaps out "use" keywords for "user" so the code functions correctly when
using LLVM 3.5+.
Add the credit card-only DLP option "StructuredCCOnly" to the win32
sample clamd config.
Also update NEWS.md to credit John Schember and Alexander Sulfrian for
the DLP CC-only mode contribution.
An integer overflow causes an out-of-bounds read that results in
a crash. The crash may occur when using the optional
Data-Loss-Prevention (DLP) feature to block content that contains credit
card numbers. This commit fixes the issue by using a signed index variable.
Add Data-Loss-Prevention option to detect credit cards only, excluding
debit and private label cards where possible.
You can select the credit card-only DLP mode for clamscan with the
`--structured-cc-mode` command-line option.
You can select the credit card-only DLP mode for clamd with the
`StructuredCCOnly` clamd.conf config option.
This patch also adds credit card matching for additional vendors:
- Mastercard 2016
- China Union Pay
- Discover 2009
Adds LZMA and BZip2 decompression routines to the bytecode API.
The ability to decompress LZMA and BZip2 streams is particularly
useful for bytecode signatures that extend clamav executable
unpacking capabilities.
Of note, the LZMA format is not well standardized. This API
expects the stream to start with the LZMA_Alone header.
Also fixed a bug in LZMA dictionary size setting.
- Existing VBA extraction code uses undocumented cache structures.
This code uses the documented way of accessing VBA projects.
- Adds additional detail to the dumped information:
Project name, Project doc string, ...
All VBA projects are dumped into a single file.
- Malware authors are currently evading detection by spreading
malicious code over several projects. It is hard to write
signatures if only part of the malicious code is visible.
Fixes an fmap leak in the bytecode switch_input() API. The
switch_input() API provides a way to read from an extracted file instead
of reading from the current file. The issue is that the current
implementation fails to free the fmap created to read from the extracted
file on cleanup or when switching back to the original fmap. In
addition, it fails to use the cli_bytecode_context_setfile() function
to restore the file_size in the context for the current fmap.
Fixes a couple fmap leaks in the unit tests.
Specifically this fixes use of cli_map_scandesc().
The cli_map_scandesc() function used to override the current fmap
settings with a new size and offset, performing a scan of the embedded
content. This broke the ability to iterate backwards through the fmap
recursion array when an alert occurs to check each map's hash for
whitelist matches.
In order to fix this issue, it needed to be possible to duplicate an
fmap header for the scan of the embedded file without duplicating the
actual map/data. This wasn't feasible with the posix fmap handle
implementation where the fmap header, bitmap array, and memory map
were all contiguouus. This commit makes it possible by extracting the
fmap header and bitmap array from the mmap region, using instead a
pointer for both the bitmap array and mmap/data. The resulting posix
fmap handle implementation as a result ended up working more similarly
to existing the Windows implementation.
In addition to the above changes, this commit fixes:
- fmap recursion tracking for cli_scandesc()
- a recursion tracking issue in cli_scanembpe() error handling
Signature alerts on content extracted into a new fmap such as normalized
HTML resulted in checking FP signatures against the fmap's hash value
that was initialized to all zeroes, and never computed.
This patch seeks will enable FP signatures of normalized HTML files or
other content that is extracted to a new fmap to work. This patch
doesn't resolve the issue that normal people will write FP signatures
targeting the original file, not the normalized file and thus won't
really see benefit from this bug-fix.
Additional work is needed to traverse the fmap recursion lists and
FP-check all parent fmaps when an alert occurs. In addition, the HTML
normalization method of temporarily overriding the ctx->fmap instead of
increasing the recursion depth and doing ctx->fmap++/-- will need to be
corrected for fmap reverse recursion traversal to work.
If the clamd.conf enables the LocalSocket option and sets the unix
socket file in a directory that does not exist, clamd creates the
missing directory but with invalid 000 permissions bits, causing socket
creation to fail.
This patch sets the umask temporarily to allow creation of the
directory w/ dwrxwr-wr- (766) permissions.
ClamAV doesn't handle compressed attribute for hfs+ file catalog
entries.
This patch adds support for FLATE compressed files.
To accomplish this, we had to find and parse the root/header node
of the attributes file, if one exists. Then, parse the attribute map
to check if the compressed attribute exists. If compressed, parse the
compression header to determine how to decompress it. Support is
included for both inline compressed files as well as compressed
resource forks.
Inflating inline compressed files is straightforward.
Inflating a compressed resource fork requires more work:
- Find location and size of the resource.
- Parse the resource block table.
- Inflate and write each block to a temporary file to be scanned.
Additional changes needed for this work:
- Make hfsplus_fetch_node work for both catalog and attributes.
- Figure out node size.
- Handle nodes that span several blocks.
- If the attributes are missing, or invalid, extraction continues.
This behavior is to support malformed files which would also
extract on macOS and perhaps other systems.
This patch also:
- Adds filename extraction for the hfs+ parser.
- Skips embedded file type detection for GPT image file types. This
prevents double extraction of embedded files, or misclassfication
of GPT images as MHTML, for example. This resolves bb12335.
The PDF parser currently prints verbose error messages when attempting
to shrink a buffer down to actual data length after decoding if it turns
out that the decoded stream was empty (0 bytes). With exception to the
verbose error messages, there's no real behavior issue.
This commit fixes the issue by checking if any bytes were decoded before
attempting to shrink the buffer.
Scans performed in the RTC SCAN_CLEANUP macro by the state.cb_end()
callback function never save the return value and thus fail to record a
detection. This patch sets `ret` so the detection isn't lost.
fixed a leak where host and port were not being properly cleaned up
cleaned up error handling for make_connection_real function
added various null param checks
a problem existed in which specifying --enable-libclamav-only would fail
if curl was not installed on the system
this fix puts a check in place which will ensure the curl check code is
not run if the option is turned on
in the future if curl becomes required in libclamav this check will need
to be removed
The newer freshclam uses libcurl for downloads and downloads the
updates via https. There are systems which don't have a "default CA
store" but instead the administrator maintains a CA-bundle of certs
they trust.
This patch allows the users to specify their own CA cert path by
setting the environment variable CURL_CA_BUNDLE to the path of their
choice.
Patch courtesy of Sebastian A. Siewior