clamav

Commit Graph

Author	SHA1	Message	Date
Val Snyder	8a77214c82	Add CL_TYPE_AI_MODEL and associated file type magic signatures This is just preliminary support for identifying an assortment of different AI model files. So far, this detects the following types: - GGML GGUF (.gguf) - ONNX AI (.onnx) - TensorFlow Lite (.tflite) Additional types to consider: - SafeTensors (.safetensors) - TensorFlow (.pb, .ckpt, .tfrecords) - Keras (.keras) - pickle (.pkl) - numpy (.npy, .npz) - coreml (.coreml) - PyTorch (.pt, .pth, .bin, .mar, .pte, .pt2, .ptl) Outside of being able to differentiate by file type, the scanner will treat CL_TYPE_AI_MODEL the same as CL_TYPE_BINARY_DATA. We're not adding parsers to further process these files, for now.	2 months ago
Val Snyder	7ff29b8c37	Bump copyright dates for 2025	3 months ago
Andy Ragusa	79f2a5f2f6	Add parser for ALZ archives	1 year ago
Micah Snyder	3ae9c1e434	Add LHA/LZH archive support File type magic signatures chosen based on the extensions supported by Rust delharc crate. See: https://docs.rs/delharc/latest/delharc/	1 year ago
Micah Snyder	9cb28e51e6	Bump copyright dates for 2024	1 year ago
Micah Snyder	fd11f1b468	Add CL_TYPE_PYTHON_COMPILED and associated file type magic signatures It may be necessary to differentiate between *.pyc and other binary types in case additional processing is needed. Outside of being able to differentiate the by file type, the scanner will treat CL_TYPE_PYTHON_COMPILED the same as CL_TYPE_BINARY_DATA. That is - we're not adding parser at this time to further break down .pyc files.	1 year ago
Micah Snyder	3b2f8c044a	Support for extracting attachments from OneNote section files Includes rudimentary support for getting slices from FMap's and for interacting with libclamav's context structure. For now will use a Cisco-Talos org fork of the onenote_parser until the feature to read open a onenote section from a slice (instead of from a filepath) is added to the upstream.	1 year ago
Andy Ragusa	b4f0836236	Add support for UDF files Add support for specifically for Beginning Extended Area Descriptor (BEA01) type of UDF files.	2 years ago
Micah Snyder	6eebecc303	Bump copyright for 2023	2 years ago
Micah Snyder	9d8af639c3	Improve file type detection for some small files Some small files that appear to include archive content (zip entries, RAR archives, etc) later in the file are no longer set to that SFX type initially. However, the current way we skip those drops the initial type detection on the floor, meaning that something like a small image may end up as CL_TYPE_ANY instead of CL_TYPE_PNG. This commit fixes that by changing where in the process the SFX types are ignored.	3 years ago
Micah Snyder	223c690958	Use correct enum types for cli_memcpy internal API The `cli_memcpy` function is a version of memcpy that will catch exceptions on Windows if the memcpy fails because the memory-mapped file fails to read (like if a network drive is disconnected or something). It presently returns an `int`. Since it is only used in one spot, I've swapped it over to a `cl_error_t`, and created a unique variable so we're not piggybacking with the `cli_file_t` enum used to check the result of a filetyping scan.	3 years ago
Micah Snyder	73088d261b	Fix issue detecting embedded zips attached to small files If initial file type recognition comes back as an SFX type, which may happen for small files that do not get recognized as any other file type and contain a zip entry somewhere in the middle, then the type will be set to that SFX type. This is a problem because later on when we go to do embedded file type recognition, we explicitly skip SFX types, in addition to TAR's and other types that are parsed elsewhere and have a high embedded file type recognition FP-rate because they aren't compressed. This commit prohibits that initial FTM check from selecting an SFX type. The SFX type will be rediscovered in `scanraw()` where the type is handled/parsed.	3 years ago
micasnyd	140c88aa4e	Bump copyright for 2022 Includes minor format corrections.	3 years ago
Micah Snyder	db013a2bfd	libclamav: Fix scan recursion tracking Scan recursion is the process of identifying files embedded in other files and then scanning them, recursively. Internally this process is more complex than it may sound because a file may have multiple layers of types before finding a new "file". At present we treat the recursion count in the scanning context as an index into both our fmap list AND our container list. These two lists are conceptually a part of the same thing and should be unified. But what's concerning is that the "recursion level" isn't actually incremented or decremented at the same time that we add a layer to the fmap or container lists but instead is more touchy-feely, increasing when we find a new "file". To account for this shadiness, the size of the fmap and container lists has always been a little longer than our "max scan recursion" limit so we don't accidentally overflow the fmap or container arrays (!). I've implemented a single recursion-stack as an array, similar to before, which includes a pointer to each fmap at each layer, along with the size and type. Push and pop functions add and remove layers whenever a new fmap is added. A boolean argument when pushing indicates if the new layer represents a new buffer or new file (descriptor). A new buffer will reset the "nested fmap level" (described below). This commit also provides a solution for an issue where we detect embedded files more than once during scan recursion. For illustration, imagine a tarball named foo.tar.gz with this structure: \| description \| type \| rec level \| nested fmap level \| \| ------------------------- \| ----- \| --------- \| ----------------- \| \| foo.tar.gz \| GZ \| 0 \| 0 \| \| └── foo.tar \| TAR \| 1 \| 0 \| \| ├── bar.zip \| ZIP \| 2 \| 1 \| \| │ └── hola.txt \| ASCII \| 3 \| 0 \| \| └── baz.exe \| PE \| 2 \| 1 \| But suppose baz.exe embeds a ZIP archive and a 7Z archive, like this: \| description \| type \| rec level \| nested fmap level \| \| ------------------------- \| ----- \| --------- \| ----------------- \| \| baz.exe \| PE \| 0 \| 0 \| \| ├── sfx.zip \| ZIP \| 1 \| 1 \| \| │ └── hello.txt \| ASCII \| 2 \| 0 \| \| └── sfx.7z \| 7Z \| 1 \| 1 \| \| └── world.txt \| ASCII \| 2 \| 0 \| (A) If we scan for embedded files at any layer, we may detect: \| description \| type \| rec level \| nested fmap level \| \| ------------------------- \| ----- \| --------- \| ----------------- \| \| foo.tar.gz \| GZ \| 0 \| 0 \| \| ├── foo.tar \| TAR \| 1 \| 0 \| \| │ ├── bar.zip \| ZIP \| 2 \| 1 \| \| │ │ └── hola.txt \| ASCII \| 3 \| 0 \| \| │ ├── baz.exe \| PE \| 2 \| 1 \| \| │ │ ├── sfx.zip \| ZIP \| 3 \| 1 \| \| │ │ │ └── hello.txt \| ASCII \| 4 \| 0 \| \| │ │ └── sfx.7z \| 7Z \| 3 \| 1 \| \| │ │ └── world.txt \| ASCII \| 4 \| 0 \| \| │ ├── sfx.zip \| ZIP \| 2 \| 1 \| \| │ │ └── hello.txt \| ASCII \| 3 \| 0 \| \| │ └── sfx.7z \| 7Z \| 2 \| 1 \| \| │ └── world.txt \| ASCII \| 3 \| 0 \| \| ├── sfx.zip \| ZIP \| 1 \| 1 \| \| └── sfx.7z \| 7Z \| 1 \| 1 \| (A) is bad because it scans content more than once. Note that for the GZ layer, it may detect the ZIP and 7Z if the signature hits on the compressed data, which it might, though extracting the ZIP and 7Z will likely fail. The reason the above doesn't happen now is that we restrict embedded type scans for a bunch of archive formats to include GZ and TAR. (B) If we scan for embedded files at the foo.tar layer, we may detect: \| description \| type \| rec level \| nested fmap level \| \| ------------------------- \| ----- \| --------- \| ----------------- \| \| foo.tar.gz \| GZ \| 0 \| 0 \| \| └── foo.tar \| TAR \| 1 \| 0 \| \| ├── bar.zip \| ZIP \| 2 \| 1 \| \| │ └── hola.txt \| ASCII \| 3 \| 0 \| \| ├── baz.exe \| PE \| 2 \| 1 \| \| ├── sfx.zip \| ZIP \| 2 \| 1 \| \| │ └── hello.txt \| ASCII \| 3 \| 0 \| \| └── sfx.7z \| 7Z \| 2 \| 1 \| \| └── world.txt \| ASCII \| 3 \| 0 \| (B) is almost right. But we can achieve it easily enough only scanning for embedded content in the current fmap when the "nested fmap level" is 0. The upside is that it should safely detect all embedded content, even if it may think the sfz.zip and sfx.7z are in foo.tar instead of in baz.exe. The biggest risk I can think of affects ZIPs. SFXZIP detection is identical to ZIP detection, which is why we don't allow SFXZIP to be detected if insize of a ZIP. If we only allow embedded type scanning at fmap-layer 0 in each buffer, this will fail to detect the embedded ZIP if the bar.exe was not compressed in foo.zip and if non-compressed files extracted from ZIPs aren't extracted as new buffers: \| description \| type \| rec level \| nested fmap level \| \| ------------------------- \| ----- \| --------- \| ----------------- \| \| foo.zip \| ZIP \| 0 \| 0 \| \| └── bar.exe \| PE \| 1 \| 1 \| \| └── sfx.zip \| ZIP \| 2 \| 2 \| Provided that we ensure all files extracted from zips are scanned in new buffers, option (B) should be safe. (C) If we scan for embedded files at the baz.exe layer, we may detect: \| description \| type \| rec level \| nested fmap level \| \| ------------------------- \| ----- \| --------- \| ----------------- \| \| foo.tar.gz \| GZ \| 0 \| 0 \| \| └── foo.tar \| TAR \| 1 \| 0 \| \| ├── bar.zip \| ZIP \| 2 \| 1 \| \| │ └── hola.txt \| ASCII \| 3 \| 0 \| \| └── baz.exe \| PE \| 2 \| 1 \| \| ├── sfx.zip \| ZIP \| 3 \| 1 \| \| │ └── hello.txt \| ASCII \| 4 \| 0 \| \| └── sfx.7z \| 7Z \| 3 \| 1 \| \| └── world.txt \| ASCII \| 4 \| 0 \| (C) is right. But it's harder to achieve. For this example we can get it by restricting 7ZSFX and ZIPSFX detection only when scanning an executable. But that may mean losing detection of archives embedded elsewhere. And we'd have to identify allowable container types for each possible embedded type, which would be very difficult. So this commit aims to solve the issue the (B)-way. Note that in all situations, we still have to scan with file typing enabled to determine if we need to reassign the current file type, such as re-identifying a Bzip2 archive as a DMG that happens to be Bzip2- compressed. Detection of DMG and a handful of other types rely on finding data partway through or near the ned of a file before reassigning the entire file as the new type. Other fixes and considerations in this commit: - The utf16 HTML parser has weak error handling, particularly with respect to creating a nested fmap for scanning the ascii decoded file. This commit cleans up the error handling and wraps the nested scan with the recursion-stack push()/pop() for correct recursion tracking. Before this commit, each container layer had a flag to indicate if the container layer is valid. We need something similar so that the cli_recursion_stack_get_() functions ignore normalized layers. Details... Imagine an LDB signature for HTML content that specifies a ZIP container. If the signature actually alerts on the normalized HTML and you don't ignore normalized layers for the container check, it will appear as though the alert is in an HTML container rather than a ZIP container. This commit accomplishes this with a boolean you set in the scan context before scanning a new layer. Then when the new fmap is created, it will use that flag to set similar flag for the layer. The context flag is reset those that anything after this doesn't have that flag. The flag allows the new recursion_stack_get() function to ignore normalized layers when iterating the stack to return a layer at a requested index, negative or positive. Scanning normalized extracted/normalized javascript and VBA should also use the 'layer is normalized' flag. - This commit also fixes Heuristic.Broken.Executable alert for ELF files to make sure that: A) these only alert if cli_append_virus() returns CL_VIRUS (aka it respects the FP check). B) all broken-executable alerts for ELF only happen if the SCAN_HEURISTIC_BROKEN option is enabled. - This commit also cleans up the error handling in cli_magic_scan_dir(). This was needed so we could correctly apply the layer-is-normalized-flag to all VBA macros extracted to a directory when scanning the directory. - Also fix an issue where exceeding scan maximums wouldn't cause embedded file detection scans to abort. Granted we don't actually want to abort if max filesize or max recursion depth are exceeded... only if max scansize, max files, and max scantime are exceeded. Add 'abort_scan' flag to scan context, to protect against depending on correct error propagation for fatal conditions. Instead, setting this flag in the scan context should guarantee that a fatal condition deep in scan recursion isn't lost which result in more stuff being scanned instead of aborting. This shouldn't be necessary, but some status codes like CL_ETIMEOUT never used to be fatal and it's easier to do this than to verify every parser only returns CL_ETIMEOUT and other "fatal status codes" in fatal conditions. - Remove duplicate is_tar() prototype from filestypes.c and include is_tar.h instead. - Presently we create the fmap hash when creating the fmap. This wastes a bit of CPU if the hash is never needed. Now that we're creating fmap's for all embedded files discovered with file type recognition scans, this is a much more frequent occurence and really slows things down. This commit fixes the issue by only creating fmap hashes as needed. This should not only resolve the perfomance impact of creating fmap's for all embedded files, but also should improve performance in general. - Add allmatch check to the zip parser after the central-header meta match. That way we don't multiple alerts with the same match except in allmatch mode. Clean up error handling in the zip parser a tiny bit. - Fixes to ensure that the scan limits such as scansize, filesize, recursion depth, # of embedded files, and scantime are always reported if AlertExceedsMax (--alert-exceeds-max) is enabled. - Fixed an issue where non-fatal alerts for exceeding scan maximums may mask signature matches later on. I changed it so these alerts use the "possibly unwanted" alert-type and thus only alert if no other alerts were found or if all-match or heuristic-precedence are enabled. - Added the "Heuristics.Limits.Exceeded." events to the JSON metadata when the --gen-json feature is enabled. These will show up once under "ParseErrors" the first time a limit is exceeded. In the present implementation, only one limits-exceeded events will be added, so as to prevent a malicious or malformed sample from filling the JSON buffer with millions of events and using a tonne of RAM.	4 years ago
Micah Snyder (micasnyd)	b9ca6ea103	Update copyright dates for 2021 Also fixes up clang-format.	4 years ago
Micah Snyder	4cce1fcd20	GIF, PNG bugfixes; Add AlertBrokenMedia option Added a new scan option to alert on broken media (graphics) file formats. This feature mitigates the risk of malformed media files intended to exploit vulnerabilities in other software. At present media validation exists for JPEG, TIFF, PNG, and GIF files. To enable this feature, set `AlertBrokenMedia yes` in clamd.conf, or use the `--alert-broken-media` option when using `clamscan`. These options are disabled by default for now. Application developers may enable this scan option by enabling `CL_SCAN_HEURISTIC_BROKEN_MEDIA` for the `heuristic` scan option bit field. Fixed PNG parser logic bugs that caused an excess of parsing errors and fixed a stack exhaustion issue affecting some systems when scanning PNG files. PNG file type detection was disabled via signature database update for 0.103.0 to mitigate effects from these bugs. Fixed an issue where PNG and GIF files no longer work with Target:5 (graphics) signatures if detected as CL_TYPE_PNG/GIF rather than as CL_TYPE_GRAPHICS. Target types now support up to 10 possible file types to make way for additional graphics types in future releases. Scanning JPEG, TIFF, PNG, and GIF files will no longer return "parse" errors when file format validation fails. Instead, the scan will alert with the "Heuristics.Broken.Media" signature prefix and a descriptive suffix to indicate the issue, provided that the "alert broken media" feature is enabled. GIF format validation will no longer fail if the GIF image is missing the trailer byte, as this appears to be a relatively common issue in otherwise functional GIF files. Added a TIFF dynamic configuration (DCONF) option, which was missing. This will allow us to disable TIFF format validation via signature database update in the event that it proves to be problematic. This feature already exists for many other file types. Added CL_TYPE_JPEG and CL_TYPE_TIFF types.	4 years ago
Micah Snyder	9b9999d778	Rename core scanning functions Many of the core scanning functions' names no longer represent their specific purpose or arguments. This commit aims to make the names more intuitive. Names are now prefixed with "magic" if they involve file-typing and file-type parsing. In addition, each function now includes the type of input being scanned whether its "desc", "fmap", or "buff". Some of the APIs also now specify "type" to indicate that a type other than "ANY" may be passed in to select the type rather than use file type magic for type recognition. \| current name \| new name \| \| ------------------------- \| --------------------------------- \| \| magic_scandesc() \| cli_magic_scan() \| \| cli_magic_scandesc_type() \| <delete> \| \| cli_magic_scandesc() \| cli_magic_scan_desc() \| \| cli_base_scandesc() \| cli_magic_scan_desc_type() \| \| cli_partition_scandesc() \| <delete> \| \| cli_map_scandesc() \| magic_scan_nested_fmap_type() \| \| cli_map_scan() \| cli_magic_scan_nested_fmap_type() \| \| cli_mem_scandesc() \| cli_magic_scan_buff() \| \| cli_scanbuff() \| cli_scan_buff() \| \| cli_scandesc() \| cli_scan_desc() \| \| cli_fmap_scandesc() \| cli_scan_fmap() \| \| cli_scanfile() \| cli_magic_scan_file() \| \| cli_scandir() \| cli_magic_scan_dir() \| \| cli_filetype2() \| cli_determine_fmap_type() \| \| cli_filetype() \| cli_compare_ftm_file() \| \| cli_partitiontype() \| cli_compare_ftm_partition() \| \| cli_scanraw() \| scanraw() \|	5 years ago
Micah Snyder	005cbf5a37	Record names of extracted files A way is needed to record scanned file names for two purposes: 1. File names (and extensions) must be stored in the json metadata properties recorded when using the --gen-json clamscan option. Future work may use this to compare file extensions with detected file types. 2. File names are useful when interpretting tmp directory output when using the --leave-temps option. This commit enables file name retention for later use by storing file names in the fmap header structure, if a file name exists. To store the names in fmaps, an optional name argument has been added to any internal scan API's that create fmaps and every call to these APIs has been modified to pass a file name or NULL if a file name is not required. The zip and gpt parsers required some modification to record file names. The NSIS and XAR parsers fail to collect file names at all and will require future work to support file name extraction. Also: - Added recursive extraction to the tmp directory when the --leave-temps option is enabled. When not enabled, the tmp directory structure remains flat so as to prevent the likelihood of exceeding MAX_PATH. The current tmp directory is stored in the scan context. - Made the cli_scanfile() internal API non-static and added it to scanners.h so it would be accessible outside of scanners.c in order to remove code duplication within libmspack.c. - Added function comments to scanners.h and matcher.h - Converted a TDB-type macros and LSIG-type macros to enums for improved type safey. - Converted more return status variables from `int` to `cl_error_t` for improved type safety, and corrected ooxml file typing functions so they use `cli_file_t` exclusively rather than mixing types with `cl_error_t`. - Restructured the magic_scandesc() function to use goto's for error handling and removed the early_ret_from_magicscan() macro and magic_scandesc_cleanup() function. This makes the code easier to read and made it easier to add the recursive tmp directory cleanup to magic_scandesc(). - Corrected zip, egg, rar filename extraction issues. - Removed use of extra sub-directory layer for zip, egg, and rar file extraction. For Zip, this also involved changing the extracted filenames to be randomly generated rather than using the "zip.###" file name scheme.	5 years ago
Mickey Sola	ffd0f1357f	png - fixup PR based on feedback	5 years ago
Aldo Mazzeo	f366b7c703	Transforming the PNG checker into a PNG exploit seeker	5 years ago
Mickey Sola	f55acb436e	gif - update PR based on feedback; add dconf option for gif scanning	5 years ago
Aldo Mazzeo	153a87a74b	Making the GIF parser more tolerant and supporting GIF overlays	5 years ago
Micah Snyder	206dbaefe8	Update copyright dates for 2020	5 years ago
Micah Snyder	ad6e0f70cb	Adds unzip parser code readability improvements; doxygen function comments.	6 years ago
Micah Snyder	ee40795fe2	Converted mpool calls to macros when USE_MPOOL is defined to clearly differentiate between function and macro behavior.	6 years ago
Micah Snyder	0450e68551	Added new EGG archive extraction feature, written from scratch based on ESTsoft's EGG archive specification. EGG extraction support includes deflate, bzip2, and lzma decompression. AZO (LZO?) decompression not yet supported. Solid archives not yet supported. Split archives may have some limited success. This commit also includes updates to autoconf iconv.m4 file enable detection of libiconv in alternative install locations.	6 years ago
Micah Snyder	52cddcbcfd	Updating and cleaning up copyright notices.	6 years ago
Micah Snyder	72fd33c8b2	clang-format'd using new .clang-format rules.	6 years ago
Micah Snyder	38fe8b69a0	Added .clang-format style rules, clam-format script to automate formatting of ClamAV code, and preparing select files so that clang-format does not alter carefully formatted sections.	6 years ago
Micah Snyder (micasnyd)	56bb195e07	bb12102: adding CL_TYPE_LNK for Windows Shortcut Files.	7 years ago
Steven Morgan	aedd18ac32	bb11586 - change CL_TYPE_EPS to CL_TYPE_PS.	9 years ago
Steven Morgan	e98acd72db	bb11586 - add file type CL_TYPE_EPS for raw scan matching of PostScript files.	9 years ago
Kevin Lin	ef48d7cbeb	MHTML: added filetype and switch case	9 years ago
Kevin Lin	ce174c71d5	added 'CustomXML' as trigger for likely OOXML	9 years ago
Kevin Lin	8250b1a72f	fix ooxml filetype detection	9 years ago
Kevin Lin	8059ffb786	clean up and boost accuracy to detecting OOXML documents	9 years ago
Kevin Lin	c6f7be5536	ooxml_hwp: add support for filetyping and preclassification	10 years ago
Kevin Lin	6cd5a9dc4e	hwpole2: new filetype and handler for hwp embedded ole2 files	10 years ago
Kevin Lin	904fe15510	add HMPML filetype, tab fixes in filetype.c	10 years ago
Steven Morgan	bb6379cab9	bb11454 - correct if/else bracketing for ooxml file typing.	10 years ago
Kevin Lin	146fbb29ad	add HWP 3.x internal filetypes	10 years ago
Steven Morgan	fd1e2dd3ad	bb11438 - strengthen file typing for OOXML.	10 years ago
Mickey Sola	46a35abe56	mass update of copyright headers	10 years ago
Kevin Lin	4cdcd47de8	added enums for Word 2003 XML and Excel 2003 XML files	10 years ago
Kevin Lin	b5641e9c79	bb#11205 - fixed implicit declaration	11 years ago
Kevin Lin	c8c80ddfd9	bb#11145 - added function to determine ooxml filetype unzip: adjusted dir/file searching mechanism	11 years ago
Kevin Lin	f290ffd350	mbr: increased clarity on boot section scan ftype: increased clarity on unsupported file systems	11 years ago
Shawn Webb	30a7509744	Add proof-of-concept XDP support. This feature requires libxml2 support. This commit bumps FLEVEL and introduces a new filetype based on the expected XML namespace for XDP files.	11 years ago
Shawn Webb	cd94be7a52	Silence a bunch of compiler warnings in libclamav	11 years ago
Shawn Webb	60d8d2c352	Move all the crypto API to clamav.h	11 years ago

1 2 3

123 Commits (8a77214c8292f5f7ce94dee7b4f363b5c14d91af)