The size of the UE buffer for the new Adobe Reader X encryption support
was not properly recorded and may result in reading too far into the UE
buffer. This patch checks the size of the UE buffer and rejects it if
the length is not 32, as it does with the other AES256 CBC method.
290424 Missing break in switch - In hash_match: Missing break
statement between cases in switch statement
290414 Resource leak - In cli_scanishield_msi: Leak of memory or
pointers to system resources. Memory leak in a fail case
288197 Resource leak - In decrypt_any: Leak of memory or pointers
to system resources. Memory leak in a fail case
290426 Resource leak - In cli_magic_scan: Leak of memory or pointers
to system resources. Leaked a file prefix when running with
--save-temps
192923 Resource leak - In cli_scanrar: Leak of memory or pointers to
system resources. Leaked a file descriptor if a virus was found in
a RAR file comment
225146 Resource leak - In cli_scanegg: Leak of memory or pointers
to system resources. Leaked a file descriptor if unable to write
a comment file to disk
290425 Resource leak - In scan_common: Leak of memory or pointers
to system resources. Memory leaks in various fail cases.
Also changes cli_scanrar to write out the file comment only if
--leave-temps is specified and scan the buffer (like what is done
in cli_scanegg) instead of writing the file out, scanning that,
and then deleting the file if --leave-temps is not specified.
The unit tests stopped working when correcting an issue with a
switch statement that determined what type of signature had matched
on a Google SafeBrowsing GDB rule. Looking into the unit tests, it
looks like the code had always assumed that the test cases would be
detected by a malware test rule in unit_tests/input/daily.gdb, but
now some of the tests get matched on the phishing test rule.
I updated the test logic to be more clear, and added tests for both
cases now.
Fix some memory leaks in libclamav/scanners.c
PDFs may contain Javascript actions in objects. Those may actually be
indirect references to other objects where the Javascript resides rather
than storing the Javascript in the current object. The thing is,
sometimes the indirect object is empty (no actual script), in which case
the PDF may not have any active content.
This commit changes the Javascript detection logic so it only records
the stats/JSON after detecting the "Javascript" and "JS" tags in the
object (indirect or direct) where the Javascript is supposed to reside.
This moves the logic from an action callback when the object dictionary
key "Javascript" is detected over into the object extraction logic,
after all of the objects have been parsed.
At present many parsers create tmp subdirectories to store extracted
files. For parsers like the vba parser, this is required as the
directory is later scanned. For other parsers, these subdirectories are
probably not helpful now that we provide recursive sub-dirs when
--leave-temps is enabled. It's not quite as simple as removing the extra
subdirectories, however. Certain parsers, like autoit, don't create very
unique filenames and would result in file name collisions when
--leave-temps is not enabled.
The best thing to do would be to make sure each parser uses unique
filenames and doesn't rely on cli_magic_scan_dir() to scan extracted
content before removing the extra subdirectory. In the meantime, this
commit gives the extra subdirectories meaningful names to improve
readability.
This commit also:
- Provides the 'bmp' prefix for extracted PE icons.
- Removes empty tmp subdirs when extracting rtf files, to eliminate
clutter.
- The PDF parser sometimes creates tmp files when decompressing streams
before it knows if there is actually any content to decompress. This
resulted in a large number of empty files. While it would be best to
avoid creating empty files in the first place, that's not quite as
as it sounds. This commit does the next best thing and deletes the
tmp files if nothing was actually extracted, even if --leave-temps is
enabled.
- Removes the "scantemp" prefix for unnamed fmaps scanned with
cli_magic_scan(). The 5-character hashes given to tmp files with
prefixes resulted in occasional file name collisions when extracting
certain file types with thousands of embedded files.
- The VBA and TAR parsers mistakenly used NAME_MAX instead of PATH_MAX,
resulting in truncated file paths and failed extraction when
--leave-temps is enabled and a lot of recursion is in play. This commit
switches them from NAME_MAX to PATH_MAX.
Many of the core scanning functions' names no longer represent their
specific purpose or arguments. This commit aims to make the names more
intuitive. Names are now prefixed with "magic" if they involve
file-typing and file-type parsing. In addition, each function now
includes the type of input being scanned whether its "desc", "fmap", or
"buff". Some of the APIs also now specify "type" to indicate that a type
other than "ANY" may be passed in to select the type rather than use
file type magic for type recognition.
| current name | new name |
| ------------------------- | --------------------------------- |
| magic_scandesc() | cli_magic_scan() |
| cli_magic_scandesc_type() | <delete> |
| cli_magic_scandesc() | cli_magic_scan_desc() |
| cli_base_scandesc() | cli_magic_scan_desc_type() |
| cli_partition_scandesc() | <delete> |
| cli_map_scandesc() | magic_scan_nested_fmap_type() |
| cli_map_scan() | cli_magic_scan_nested_fmap_type() |
| cli_mem_scandesc() | cli_magic_scan_buff() |
| cli_scanbuff() | cli_scan_buff() |
| cli_scandesc() | cli_scan_desc() |
| cli_fmap_scandesc() | cli_scan_fmap() |
| cli_scanfile() | cli_magic_scan_file() |
| cli_scandir() | cli_magic_scan_dir() |
| cli_filetype2() | cli_determine_fmap_type() |
| cli_filetype() | cli_compare_ftm_file() |
| cli_partitiontype() | cli_compare_ftm_partition() |
| cli_scanraw() | scanraw() |
A way is needed to record scanned file names for two purposes:
1. File names (and extensions) must be stored in the json metadata
properties recorded when using the --gen-json clamscan option. Future
work may use this to compare file extensions with detected file types.
2. File names are useful when interpretting tmp directory output when
using the --leave-temps option.
This commit enables file name retention for later use by storing file
names in the fmap header structure, if a file name exists.
To store the names in fmaps, an optional name argument has been added to
any internal scan API's that create fmaps and every call to these APIs
has been modified to pass a file name or NULL if a file name is not
required. The zip and gpt parsers required some modification to record
file names. The NSIS and XAR parsers fail to collect file names at all
and will require future work to support file name extraction.
Also:
- Added recursive extraction to the tmp directory when the
--leave-temps option is enabled. When not enabled, the tmp directory
structure remains flat so as to prevent the likelihood of exceeding
MAX_PATH. The current tmp directory is stored in the scan context.
- Made the cli_scanfile() internal API non-static and added it to
scanners.h so it would be accessible outside of scanners.c in order to
remove code duplication within libmspack.c.
- Added function comments to scanners.h and matcher.h
- Converted a TDB-type macros and LSIG-type macros to enums for improved
type safey.
- Converted more return status variables from `int` to `cl_error_t` for
improved type safety, and corrected ooxml file typing functions so
they use `cli_file_t` exclusively rather than mixing types with
`cl_error_t`.
- Restructured the magic_scandesc() function to use goto's for error
handling and removed the early_ret_from_magicscan() macro and
magic_scandesc_cleanup() function. This makes the code easier to
read and made it easier to add the recursive tmp directory cleanup to
magic_scandesc().
- Corrected zip, egg, rar filename extraction issues.
- Removed use of extra sub-directory layer for zip, egg, and rar file
extraction. For Zip, this also involved changing the extracted
filenames to be randomly generated rather than using the "zip.###"
file name scheme.
Fix for an out-of-bounds read in the PDF parser when initializing
aes crypto routines that may result in a crash.
Bug found by OSS-Fuzz.
Also added checks for the arc4 init routine to mitigate the risk of a
similar issue.
XLM is a macro language in Excel that was used before VBA (before
1996). It is still parsed and executed by modern Excel and is gaining
popularity with malware authors.
This patch adds rudimentary support for detecting and extracting
Excel 4.0 (XLM) macros.
The code is based on Didier Steven's plugin_biff for oletools.py.
Fix for minor memory leak in fmap_dump_to_file().
Fix to PDF object stream logic, accounting for a realloc() issue when the only pdf object stream fails to parse, and for when pdf objects in a stream appear to extend further than the size of the stream.
Fix for memory leak cleaning up PDF object stream buffer in error condition.
Fix to bug in pdf_decodestream wherein objects were found in an object stream, but the object stream could later be free'd if max scansize was exceeded, resulting in a NULL dereference.
General cleanup of pdf_decodestream/pdf_decodestream_internal exit code logic.
Updated libclamav documentation detailing new scan options structure.
Renamed references to 'algorithmic' detection to 'heuristic' detection. Renaming references to 'properties' to 'collect metadata'.
Renamed references to 'scan all' to 'scan all match'.
Renamed a couple of 'Hueristic.*' signature names as 'Heuristics.*' signatures (plural) to match majority of other heuristics.