Add support for collecting and recording per-file metadata (#63)

This commit adds blob metadata to Nosey Parker. The scan command now collects and records some basic metadata about blobs (size in bytes, guessed mime type, guessed charset). The guessed metadata is based on path names, and at present only works on plain file inputs and not blobs found in Git history (see #16). If Nosey Parker is built with the libmagic feature, blob metadata is collected an recorded using an additional content-based mechanism that uses libmagic, which collects this information even for blobs found in Git history that do not have pathnames. This feature slows down scanning time something like 6-10x, and requires additional system-installed libraries to build, and so is not enabled by default. When scanning, by default, the metadata is collected and recorded only for blobs that have rule matches within them. The collection of blob metadata can be controlled slightly by the new `--record-all-blobs <BOOL>` command-line option; a true value causes all discovered blobs to have metadata collected and recorded, not just those with rule matches. The report command makes use of the newly collected metadata. In all output formats, the metadata is included. Additionally in this pull request: the performance of scanning on certain match-heavy workloads has been improved as much as 2x. This was achieved through using fewer sqlite transactions in the datastore implementation.
praetorian-inc · Jun 30, 2023 · 3626257 · 3626257
1 parent 18224b8
commit 3626257
Show file tree

Hide file tree

Showing 35 changed files with 1,136 additions and 515 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -29,6 +29,12 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 
 - The Git repository cloning behavior in the `scan` command can now be controlled with the new `--git-clone-mode MODE` parameter.
 
+- In the `scan` command, basic blob metadata is recorded in the datastore for each discovered blob, including blob size in bytes and guessed mime type and charset when available.
+  A path-based mechanism is used to guess mime type; at present, this only works for plain file inputs (i.e., not for blobs found in Git history).
+  Optionally, if the `libmagic` Cargo feature is enabled, libmagic (the guts of the `file` command-line program) is used to guess mime type and charset based on content for blobs from all sources.
+  This metadata is recorded for each blob in which matches are found, but this behavior can be enabled for all blobs using the new `--record-all-blobs true` parameter.
+  This newly-recorded metadata is included in output of the `report` command.
+
 
 ### Changes
 - Existing rules were modified to reduce both false positives and false negatives:
@@ -42,7 +48,10 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 - When a Git repository is cloned, the default behavior is to match `git clone --bare` instead of `git clone --mirror`.
   This new default behavior results in cloning potentially less content, but avoids cloning content from forks from repositories hosted on GitHub.
 
-- The command-line help has been refined for clarity
+- The command-line help has been refined for clarity.
+
+- Scanning performance has been improved on particular workloads by as much as 2x by recording matches to the datastore in larger batches.
+  This is particularly relevant to heavy multithreaded scanning workloads where the inputs have many matches.
 
 
 ### Fixes