The PDF parser currently prints verbose error messages when attempting
to shrink a buffer down to actual data length after decoding if it turns
out that the decoded stream was empty (0 bytes). With exception to the
verbose error messages, there's no real behavior issue.
This commit fixes the issue by checking if any bytes were decoded before
attempting to shrink the buffer.
Users have reported slow scan speeds of some PDF documents. The scan
speed was very slow on Windows in particular. Investigation indicated
significant time spent in cli_realloc.
Performance is particularly bad with a relatively large data stream is
decompressed using small chunk sizes and the final buffer is reallocated
to a larger buffer size each time a chunk is added.
This commit replaces BUFSIZ, which varies from 256B -> 8192B, with
INFLATE_CHUNK_SIZE, set to 256kB, the chunk size recommended by the zlib
documentation for efficient inflate performance. The output buffer is
shrunk (reallocated) down to the final decoded buffer length so as not
to waste memory when many small buffers must be decompressed.
A followup fix should provide a standard way to do zlib decompression
across libclamav where a linked list of decompressed chunks are
assembled and then the final output buffer is allocated at the end.
Fix for minor memory leak in fmap_dump_to_file().
Fix to PDF object stream logic, accounting for a realloc() issue when the only pdf object stream fails to parse, and for when pdf objects in a stream appear to extend further than the size of the stream.
Fix for memory leak cleaning up PDF object stream buffer in error condition.
Fix to bug in pdf_decodestream wherein objects were found in an object stream, but the object stream could later be free'd if max scansize was exceeded, resulting in a NULL dereference.
General cleanup of pdf_decodestream/pdf_decodestream_internal exit code logic.
Updated libclamav documentation detailing new scan options structure.
Renamed references to 'algorithmic' detection to 'heuristic' detection. Renaming references to 'properties' to 'collect metadata'.
Renamed references to 'scan all' to 'scan all match'.
Renamed a couple of 'Hueristic.*' signature names as 'Heuristics.*' signatures (plural) to match majority of other heuristics.