HTML normalization creates a tmp directory for storing rfc2397 style
links. The vast majority of html does not make use of rfc2397 and thus
an excess of empty tmp directories are generated. This commit alters
behavior to only create the rfc2397 directory when required if it does
not already exist.
Many of the core scanning functions' names no longer represent their
specific purpose or arguments. This commit aims to make the names more
intuitive. Names are now prefixed with "magic" if they involve
file-typing and file-type parsing. In addition, each function now
includes the type of input being scanned whether its "desc", "fmap", or
"buff". Some of the APIs also now specify "type" to indicate that a type
other than "ANY" may be passed in to select the type rather than use
file type magic for type recognition.
| current name | new name |
| ------------------------- | --------------------------------- |
| magic_scandesc() | cli_magic_scan() |
| cli_magic_scandesc_type() | <delete> |
| cli_magic_scandesc() | cli_magic_scan_desc() |
| cli_base_scandesc() | cli_magic_scan_desc_type() |
| cli_partition_scandesc() | <delete> |
| cli_map_scandesc() | magic_scan_nested_fmap_type() |
| cli_map_scan() | cli_magic_scan_nested_fmap_type() |
| cli_mem_scandesc() | cli_magic_scan_buff() |
| cli_scanbuff() | cli_scan_buff() |
| cli_scandesc() | cli_scan_desc() |
| cli_fmap_scandesc() | cli_scan_fmap() |
| cli_scanfile() | cli_magic_scan_file() |
| cli_scandir() | cli_magic_scan_dir() |
| cli_filetype2() | cli_determine_fmap_type() |
| cli_filetype() | cli_compare_ftm_file() |
| cli_partitiontype() | cli_compare_ftm_partition() |
| cli_scanraw() | scanraw() |
The metadata projecties JSON structure isn't recording file types found
embedded within a file such as self-extracting (SFX) types and office
document types (DOCX, PPTX, etc). This presents a problem...
At present there's no way to know if the current file has ended and a
few file is found tacked on to the end of the first file. If there
were, we could simply check if the type found by the raw-scan exists
within the first file, or after.
If within the first, and the type is an archive then it's reasonable to
conclude we're either observing zip headers (for SFXZIP detections) or
other files that are not compressed.
If the type ISN'T found within the first file, then we definitely have
whole new file to parse and we should do so with cli_magic_scan()
rather than only using these embedded type scanners.
At present we can't ignore SFXZIP detections even if the original file
type is a ZIP because we may have found two ZIPs appended together to
evade detection (a legitimate trick). As a consequence, we will
effectively parse every zip entry twice. The same issue applies to
types found within non-compressed archives.
This commit adds an EmbeddedObjects list to the metadata JSON object so
that the existance of these types is noted.
Additionally, this commit removes the two-part int64 cli_jsonint64()
implementation as json_object_new_int64() should be available
everywhere and the macro to detect such support was never set.
A way is needed to record scanned file names for two purposes:
1. File names (and extensions) must be stored in the json metadata
properties recorded when using the --gen-json clamscan option. Future
work may use this to compare file extensions with detected file types.
2. File names are useful when interpretting tmp directory output when
using the --leave-temps option.
This commit enables file name retention for later use by storing file
names in the fmap header structure, if a file name exists.
To store the names in fmaps, an optional name argument has been added to
any internal scan API's that create fmaps and every call to these APIs
has been modified to pass a file name or NULL if a file name is not
required. The zip and gpt parsers required some modification to record
file names. The NSIS and XAR parsers fail to collect file names at all
and will require future work to support file name extraction.
Also:
- Added recursive extraction to the tmp directory when the
--leave-temps option is enabled. When not enabled, the tmp directory
structure remains flat so as to prevent the likelihood of exceeding
MAX_PATH. The current tmp directory is stored in the scan context.
- Made the cli_scanfile() internal API non-static and added it to
scanners.h so it would be accessible outside of scanners.c in order to
remove code duplication within libmspack.c.
- Added function comments to scanners.h and matcher.h
- Converted a TDB-type macros and LSIG-type macros to enums for improved
type safey.
- Converted more return status variables from `int` to `cl_error_t` for
improved type safety, and corrected ooxml file typing functions so
they use `cli_file_t` exclusively rather than mixing types with
`cl_error_t`.
- Restructured the magic_scandesc() function to use goto's for error
handling and removed the early_ret_from_magicscan() macro and
magic_scandesc_cleanup() function. This makes the code easier to
read and made it easier to add the recursive tmp directory cleanup to
magic_scandesc().
- Corrected zip, egg, rar filename extraction issues.
- Removed use of extra sub-directory layer for zip, egg, and rar file
extraction. For Zip, this also involved changing the extracted
filenames to be randomly generated rather than using the "zip.###"
file name scheme.
This commit improves the layout of the tmp file output and the JSON
metadata output when using the --leave-temps and --gen-json options.
For all scans, each scan target will get a unique tmp sub-directory. If
using --leave-temps, that subdir will include the basename of the
original file to make it easier to identify. Additionally, when using
--leave-temps option, all extracted objects will have their
subdirectories extracted in recursive subdirectories including filename
prefixes where available. When not using the --leave-temps option, the
layout of the tmp sub-directory will remain flat, so as to alleviate the
possibility of exceeding PATH_MAX.
The JSON metadata generated by the --gen-json option is now generated
for all file types, not just a select few. The format is also
pretty-printed for readability and now includes filenames and file paths
when available.
Also:
- Added missing ALLMATCH check when determining if bytecode hooks should
be run.
- Added cl_engine_get_str API to windows libclamav symbol export file.
A missing return statement in png.c for a function that should return a
status code is resulting in undefined behavior.
In this patch I also added ".PNG" to one of the new heuristic signatures
to match others.
Add missing size checks to validate size data parsed from a VBA file.
This fixes a possible buffer overflow read that was caught by oss-fuzz
before it made it into any release.
Fix for an out-of-bounds read in the PDF parser when initializing
aes crypto routines that may result in a crash.
Bug found by OSS-Fuzz.
Also added checks for the arc4 init routine to mitigate the risk of a
similar issue.
Fix for an out-of-bounds read in the ARJ parser accidentally introduced
when adding text normalization and bound checking when parsing filename
and comment fields from file headers.
XLM is a macro language in Excel that was used before VBA (before
1996). It is still parsed and executed by modern Excel and is gaining
popularity with malware authors.
This patch adds rudimentary support for detecting and extracting
Excel 4.0 (XLM) macros.
The code is based on Didier Steven's plugin_biff for oletools.py.
Fixes a bug in the PtrVerifier pass when using LLVM >= v3.5 for the
bytecode signature runtime.
LLVM 3.5 changed the meaning of "use" and introduced "user". This fix
swaps out "use" keywords for "user" so the code functions correctly when
using LLVM 3.5+.
An integer overflow causes an out-of-bounds read that results in
a crash. The crash may occur when using the optional
Data-Loss-Prevention (DLP) feature to block content that contains credit
card numbers. This commit fixes the issue by using a signed index variable.
Add Data-Loss-Prevention option to detect credit cards only, excluding
debit and private label cards where possible.
You can select the credit card-only DLP mode for clamscan with the
`--structured-cc-mode` command-line option.
You can select the credit card-only DLP mode for clamd with the
`StructuredCCOnly` clamd.conf config option.
This patch also adds credit card matching for additional vendors:
- Mastercard 2016
- China Union Pay
- Discover 2009
Adds LZMA and BZip2 decompression routines to the bytecode API.
The ability to decompress LZMA and BZip2 streams is particularly
useful for bytecode signatures that extend clamav executable
unpacking capabilities.
Of note, the LZMA format is not well standardized. This API
expects the stream to start with the LZMA_Alone header.
Also fixed a bug in LZMA dictionary size setting.
- Existing VBA extraction code uses undocumented cache structures.
This code uses the documented way of accessing VBA projects.
- Adds additional detail to the dumped information:
Project name, Project doc string, ...
All VBA projects are dumped into a single file.
- Malware authors are currently evading detection by spreading
malicious code over several projects. It is hard to write
signatures if only part of the malicious code is visible.
Fixes an fmap leak in the bytecode switch_input() API. The
switch_input() API provides a way to read from an extracted file instead
of reading from the current file. The issue is that the current
implementation fails to free the fmap created to read from the extracted
file on cleanup or when switching back to the original fmap. In
addition, it fails to use the cli_bytecode_context_setfile() function
to restore the file_size in the context for the current fmap.
Fixes a couple fmap leaks in the unit tests.
Specifically this fixes use of cli_map_scandesc().
The cli_map_scandesc() function used to override the current fmap
settings with a new size and offset, performing a scan of the embedded
content. This broke the ability to iterate backwards through the fmap
recursion array when an alert occurs to check each map's hash for
whitelist matches.
In order to fix this issue, it needed to be possible to duplicate an
fmap header for the scan of the embedded file without duplicating the
actual map/data. This wasn't feasible with the posix fmap handle
implementation where the fmap header, bitmap array, and memory map
were all contiguouus. This commit makes it possible by extracting the
fmap header and bitmap array from the mmap region, using instead a
pointer for both the bitmap array and mmap/data. The resulting posix
fmap handle implementation as a result ended up working more similarly
to existing the Windows implementation.
In addition to the above changes, this commit fixes:
- fmap recursion tracking for cli_scandesc()
- a recursion tracking issue in cli_scanembpe() error handling
Signature alerts on content extracted into a new fmap such as normalized
HTML resulted in checking FP signatures against the fmap's hash value
that was initialized to all zeroes, and never computed.
This patch seeks will enable FP signatures of normalized HTML files or
other content that is extracted to a new fmap to work. This patch
doesn't resolve the issue that normal people will write FP signatures
targeting the original file, not the normalized file and thus won't
really see benefit from this bug-fix.
Additional work is needed to traverse the fmap recursion lists and
FP-check all parent fmaps when an alert occurs. In addition, the HTML
normalization method of temporarily overriding the ctx->fmap instead of
increasing the recursion depth and doing ctx->fmap++/-- will need to be
corrected for fmap reverse recursion traversal to work.
ClamAV doesn't handle compressed attribute for hfs+ file catalog
entries.
This patch adds support for FLATE compressed files.
To accomplish this, we had to find and parse the root/header node
of the attributes file, if one exists. Then, parse the attribute map
to check if the compressed attribute exists. If compressed, parse the
compression header to determine how to decompress it. Support is
included for both inline compressed files as well as compressed
resource forks.
Inflating inline compressed files is straightforward.
Inflating a compressed resource fork requires more work:
- Find location and size of the resource.
- Parse the resource block table.
- Inflate and write each block to a temporary file to be scanned.
Additional changes needed for this work:
- Make hfsplus_fetch_node work for both catalog and attributes.
- Figure out node size.
- Handle nodes that span several blocks.
- If the attributes are missing, or invalid, extraction continues.
This behavior is to support malformed files which would also
extract on macOS and perhaps other systems.
This patch also:
- Adds filename extraction for the hfs+ parser.
- Skips embedded file type detection for GPT image file types. This
prevents double extraction of embedded files, or misclassfication
of GPT images as MHTML, for example. This resolves bb12335.
The PDF parser currently prints verbose error messages when attempting
to shrink a buffer down to actual data length after decoding if it turns
out that the decoded stream was empty (0 bytes). With exception to the
verbose error messages, there's no real behavior issue.
This commit fixes the issue by checking if any bytes were decoded before
attempting to shrink the buffer.
Scans performed in the RTC SCAN_CLEANUP macro by the state.cb_end()
callback function never save the return value and thus fail to record a
detection. This patch sets `ret` so the detection isn't lost.
These opcodes specify a function or keyword by number
instead of by name. The corresponding lookup tables
still have a few entries without names, but the majority
of them are been determined and verified.
The PROFILE_HASHTABLE preprocessor definition can be set at build
time and is intended to be used to enable profiling capabilities
for developers working with hash table and set data structure
profiling. This hashtable profiling functionality was added into
the code a while back and isn't currently functional, but would
ultimately be nice to have. This commit is a first step towards
getting it working.
When PROFILE_HASHTABLE is set, it causes several counters used for
collecting performance metrics to be inserted into the core hashtable
structures. When PROFILE_HASHTABLE is not set, however, these
counters are omitted, and the other members of the structure only
ever contain constant data. I'm guessing that at some point, as an
optimization in the latter case, ClamAV began declaring the hashtable
structures `const`, causing gcc (and maybe other compilers) to put
the structures in the read-only data section. Thus, the code
crashes when PROFILE_HASHTABLE is defined and the counters in the
read-only data section try to get incremented. The fix for this is
to just not mark these structures as `const` if PROFILE_HASHTABLE
is defined.
Disable line wrap when printing the progress bar so that small terminal
windows do not see excessive lines printed.
Reduce the number of characters in the progress bar to accomodate
80-char width terminals.
Correctly display number of kilobytes (KiB) in progress bar. Previously
was showing # of MiB but printing "KiB".
Removing problematic call to convert file descriptors to filepaths.
Added filename and tempfile names to scandesc calls in clamd.
Added a general scan option to treat the scan engine as unprivileged,
meaning that the scan engine will not have read access to the file.
Added check to drop a temp file for RAR's where the we don't have
read access to the filepath provided (i.e. unprivileged is set, or
access() check fails).