Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add support for collecting and recording per-file metadata (#63)
This commit adds blob metadata to Nosey Parker. The scan command now collects and records some basic metadata about blobs (size in bytes, guessed mime type, guessed charset). The guessed metadata is based on path names, and at present only works on plain file inputs and not blobs found in Git history (see #16). If Nosey Parker is built with the libmagic feature, blob metadata is collected an recorded using an additional content-based mechanism that uses libmagic, which collects this information even for blobs found in Git history that do not have pathnames. This feature slows down scanning time something like 6-10x, and requires additional system-installed libraries to build, and so is not enabled by default. When scanning, by default, the metadata is collected and recorded only for blobs that have rule matches within them. The collection of blob metadata can be controlled slightly by the new `--record-all-blobs <BOOL>` command-line option; a true value causes all discovered blobs to have metadata collected and recorded, not just those with rule matches. The report command makes use of the newly collected metadata. In all output formats, the metadata is included. Additionally in this pull request: the performance of scanning on certain match-heavy workloads has been improved as much as 2x. This was achieved through using fewer sqlite transactions in the datastore implementation.
- Loading branch information