diff --git a/README.md b/README.md index e744df2..84e579f 100644 --- a/README.md +++ b/README.md @@ -8,6 +8,7 @@ paperless-ngx-postprocessor allows you to automatically set titles, ASNs, and cr * Setup rulesets to choose which documents are postprocessed and which are ignored, based on metadata like correspondent, document_type, storage_path, tags, and more * For each ruleset, extract metadata using [Python regular expressions](https://docs.python.org/3/library/re.html#regular-expression-syntax) * Use [Jinja templates](https://jinja.palletsprojects.com/en/3.1.x/templates/) to specify new values for archive serial number, title, and created date, using the values from your regular expression +* Optionally use [Jinja templates](https://jinja.palletsprojects.com/en/3.1.x/templates/) to validate document metadata, and add a tag to documents that have invalid metadata (e.g. to catch parsing errors) * Optionally apply a tag to documents that are changed during postprocessing, so you can keep track of which documents have changed * Optionally make backups of changes, so you can restore document metadata back to the way it was before postprocessing * Optionally run on one or more existing documents, if you need to adjust the metadata of documents that have already been consumed by Paperless-ngx @@ -65,6 +66,8 @@ Last but not least, create rulesets in the `paperless-postprocessor-ngx/rulesets paperless-ngx-postprocessor works by reading rulesets from all the `.yml` files in the `rulesets.d` folder, seeing if the contents of the document match any of the rulesets, extracting values from the document's contents using a regular expression, and then writing new values for the metadata based on the document's preexisting metadata and any values extracted using the regular expression. +You can also provide an optional validation rule to catch documents whose metadata doesn't get set properly. + ### An example An example helps illustrate this. Say you have the following ruleset: @@ -75,6 +78,7 @@ Some Ruleset Name: metadata_postprocessing: source: '{{ source | title }}' # This applies the Jinja 'title' filter, capitalizing each word title: '{{created_year}}-{{created_month}}-{{created_day}} -- {{correspondent}} -- {{document_type}} (from {{ source }})' + validation_rule: '{{ created_date_object == last_date_object_of_month(created_date_object) }}' ``` First paperless-ngx-postprocessor will get a local copy of the document's preexisting metadata. For a full list of the preexisting metadata you can use for matching and postprocessing, see [below](#available-metadata). @@ -94,9 +98,11 @@ Finally after all the rules are processed, paperless-ngx-postprocessor will take If any of those differ from the values the document's metadata had when we started, then paperless-ngx-postprocessor will push the new values to paperless-ngx, and processing is complete. +After all of those values have been pushed, paperless-ngx-postprocessor will then try to evaluate the `validation_rule` field. In this case, the validation rule evaluates to `True` if the document's created date is the last day of the month. + ### Some caveats -In order to make parsing dates easier, paperless-postprocessor-ngx will "normalize" and error-check the `created_year`, `created_month`, and `created_day` fields after the initial values are extracted using the regular expression, and after every individual postprocessing rule. +In order to make parsing dates easier, paperless-ngx-postprocessor will "normalize" and error-check the `created_year`, `created_month`, and `created_day` fields after the initial values are extracted using the regular expression, and after every individual postprocessing rule. Normalization is as follows: * `created_day` will be turned into a zero-padded two-digit string (e.g. `09`). @@ -117,6 +123,14 @@ In addition to the [default Jinja filters](https://jinja.palletsprojects.com/en/ * Matches using `re.match()`. Only returns `True` or `False`. For details see the [official python documentation](https://docs.python.org/3/library/re.html#re.match). * `regex_sub(pattern, repl)` * Substitutes using `re.sub()`. For details see the [official python documentation](https://docs.python.org/3/library/re.html#re.sub). +* `date(year, month, day)` + * Creates a [Python `date` object](https://docs.python.org/3/library/datetime.html#date-objects) for the given date. This allows easier date manipulation inside Jinja templates. +* `timedelta(days=0, seconds=0, microseconds=0, milliseconds=0, minutes=0, hours=0, weeks=0)` + * Creates a [Python `timedelta` object](https://docs.python.org/3/library/datetime.html#timedelta-objects). This allows easier date manipulation inside Jinja templates. +* `last_date_object_of_month(date_object)` + * Takes a Python `date` object, extracts its month, and returns a new `date` object that corresponds to the last day of that month. +* `num_documents(**constraints)` + * Queries paperless to see how many documents satisfy all of the `constraints`. For more information see FIXME below. These can be used like this: ``` @@ -171,6 +185,86 @@ Each of the rules will match any and every document (since their `match` field i 6. Since fields persist across rulesets, and `bar` was set in the `First Ruleset`, title will be set to `uppercase foo is YOU_FOUND_ME`. 7. This title will then be used to finally update paperless-ngx. +### The `num_documents()` filter + +The `num_documents()` filter is primarily intended for validation rules. It returns the number of documents that match *all* of the given constraints. Each of the constraints must be specified by keyword. Valid arguments are: +* `correspondent` - The name of the correspondent +* `document_type` - The name of the document type +* `storage_path` - The name of the storage path +* `asn` - The archive serial number +* `title` - The title of the document +* `added_year` - The added year (as an `int`) +* `added_month` - The added month (as an `int`) +* `added_day` - The added day (as an `int`) +* `added_date_object` - The added date as a Python `date` object. This is essentially a quicker way than specifying all of `added_year`, `added_month`, and `added_day`. +* `added_range` - Finds documents created within a given range. The value should be a tuple containing two `date` objects, e.g. `(start_date, end_date)`. If either date is `None`, then that side of the limit is ignored. The limits are exclusive, so `(date(2063,04,01), None)` will find documents created on or after April 2, 2063, and will not match any documents created on April 1. +* `created_year` - The created year (as an `int`) +* `created_month` - The created month (as an `int`) +* `created_day` - The created day (as an `int`) +* `created_date_object` - The created date as a Python `date` object. This is essentially a quicker way than specifying all of `created_year`, `created_month`, and `created_day`. +* `created_range` - Finds documents created within a given range. The value should be a tuple containing two `date` objects, e.g. `(start_date, end_date)`. If either date is `None`, then that side of the limit is ignored. The limits are exclusive, so `(date(2063,04,01), None)` will find documents created on or after April 2, 2063, and will not match any documents created on April 1. + +Some examples will help explain how to use `num_documents()`. + +### Example validation rules + +Say you have documents whose creation dates should only be the end of the month (e.g. a bank statement). To catch documents whose creation date isn't the end of the month, you could use: +```yaml +validation_rule: "{{ created_date_object == last_date_object_of_month(created_date_object) }}" +``` + +Say you have documents that should only be created on Sundays. Then you could use [the Python `date` object's `weekday()` method](https://docs.python.org/3/library/datetime.html#datetime.date.weekday): +```yaml +validation_rule: "{{ created_date_object.weekday() == 6 }}" +``` + +Say you have documents that should be unique, i.e. only one of that document with a given correspondent, document type, storage path, etc. on a given day. You could use the `num_documents` custom Jinja filter: +```yaml +validation_rule: "{{ num_documents(correspondent=correspondent, document_type=document_type, storage_path=storage_path, created_date_object=created_date_object) == 1 }}" +``` +(Note that you have to specify all of those selectors, since the `limit` filter looks at *all* documents, *not* just those that would otherwise match the current ruleset's `match` rule.) + +Or you can get even fancier: say you want at most one document from a particular correspondent in a given calendar week, starting on Sunday. Then we need an expression that will give us the Saturday before since the range for `created_range` is exclusive. This little one-liner does just that, using the Python `timedelta` object: +```yaml +{% set week_start = created_date_object - timedelta(days=(((created_date_object.weekday()+1) % 7) + 1)) %} + +And then the Sunday after is just 8 days later: +```yaml +{% set week_end = week_start + timedelta(days=8) %} +``` + +Putting it all together, we get a validation rule like: +``` +validation_rule: >- + {% set week_start = created_date_object - timedelta(days=(((created_date_object.weekday()+1) % 7) + 1)) %} + {% set week_end = week_start + timedelta(days=8) %} + {{ num_documents(correspondent=correspondent, created_range=(week_start, week_end)) == 1}} +``` + +#### Exceptions + +Sometimes you'll want to exclude some documents from validation. To do so, you'll need to adjust the `match` rule to exclude them. It's recommended that you split up the processing and the validation, in that case. E.g. to ignore documents 123 and 456 when doing validation, this: +```yaml +Some rulename: + match: '{{ SOME_FILTER }}' + metadata_postprocessing: + some_var: '{{ SOME_POSTPROCESSING_RULE }}' + validation_rule: '{{ SOME_VALIDATION_RULE }}' +``` + +becomes this: +```yaml +Some rulename for postprocessing: + match: '{{ SOME_FILTER }}' + metadata_postprocessing: + some_var: '{{ SOME_POSTPROCESSING_RULE }}' +--- +Some rulename for validation: + match: '{{ SOME_FILTER and document_id not in [123, 456] }}' + validation_rule: '{{ SOME_VALIDATION_RULE }}' +``` + + ## Formal ruleset definition ### Ruleset syntax @@ -184,18 +278,21 @@ Ruleset Name: METADATA_FIELDNAME_1: METADATA_TEMPLATE_1 ... METADATA_FIELDNAME_N: METADATA_TEMPLATE_N + validation_rule: VALIDATION_TEMPLATE ``` where * `MATCH_TEMPLATE` is a Jinja template. If it evaluates to True, the ruleset will match and postprocessing will continue. * `metadata_regex` is optional. If specified,`REGEX` is a Python regular expression. Any named groups in `REGEX` will be saved and their values can be used in the postprocessing rules in this ruleset. * `metadata_postprocessing` is optional. If not specified, then paperless-ngx-postprocessor will update the document's metadata based only on the fields extract from the regular expression. * `METADATA_FIELDNAME_X` is the name of a metadata field to update, and `METADATA_TEMPLATE_X` is a Jinja template that will be evaluated using the metadata so far. You can have as many metadata fields as you like. +* `validation_rule` is optional. If specified, paperless-ngx-postprocessor will evaluate the `VALIDATION_TEMPLATE` Jinja template. If it evaluates to `False` and the `INVALID_TAG` is set, then the `INVALID_TAG` will be added to the document. (If `validation_rule` is omitted, no validation check is done.) ### Available metadata: The metadata available for matching and postprocessing mostly matches [the metadata available in paperless-ngx for filename handling](https://paperless-ngx.readthedocs.io/en/latest/advanced_usage.html#file-name-handling). The following fields are read-only. They keep the same value through postprocessing as they had before postprocessing started. (If you try to overwrite them with new values, those values will be ignored.) +* `document_id`: The document ID. * `correspondent`: The name of the correspondent, or `None`. * `document_type`: The name of the document type, or `None`. * `tag_list`: A list object containing the names of all tags assigned to the document. @@ -204,6 +301,8 @@ The following fields are read-only. They keep the same value through postprocess * `added_year`: Year added only (as a `str`, not an `int`). * `added_month`: Month added only, number 01-12 (as a `str`, not an `int`). * `added_day`: Day added only, number 01-31 (as a `str`, not an `int`). +* `added_date`: The date the document was added in `YYYY-MM-DD` format. +* `added_date_object`: A Python [date object](https://docs.python.org/3/library/datetime.html#date-objects) for the date the document was added. The following fields are available for matching, and can be overwritten by values extracted from the regular expression (e.g. by using a named group with the field name) or by postprocessing rules. * `asn`: The archive serial number of the document, or `None`. @@ -215,6 +314,7 @@ The following fields are available for matching, and can be overwritten by value The following fields are read-only, but will be updated automatically after every step by the values given in the `created_year`, `created_month`, and `created_day` fields. * `created`: The full date (ISO format) the document was created. * `created_date`: The date the document was created in `YYYY-MM-DD` format. +* `created_date_object`: A Python [date object](https://docs.python.org/3/library/datetime.html#date-objects) for the date the document was created. ## Configuration @@ -224,6 +324,7 @@ paperless-ngx-postprocessor can be configured using the following environment va * `PNGX_POSTPROCESSOR_DRY_RUN=`: If set to `True`, paperless-ngx-postprocessor will not actually push any changes to paperless-ngx. (default: `False`) * `PNGX_POSTPROCESSOR_BACKUP=`: Backup file to write any changed values to. If no filename is given, one will be automatically generated based on the current date and time. If the path is a directory, the automatically generated file will be stored in that directory. (default: `False`) * `PNGX_POSTPROCESSOR_POSTPROCESSING_TAG=`: A tag to apply if any changes are made during postprocessing. (default: `None`) +* `PNGX_POSTPROCESSOR_INVALID_TAG=`: A tag to apply if the document fails any validation rules. (default: `None`) * `PNGX_POSTPROCESSOR_RULESETS_DIR=`: The config directory (within the Docker container) containing the rulesets for postprocessing. (default: `/usr/src/paperless-ngx-postprocessor/rulesets.d`) * `PNGX_POSTPROCESSOR_PAPERLESS_API_URL=`: The full URL to access the Paperless-ngx REST API (within the Docker container). (default: `http://localhost:8000/api`) * `PNGX_POSTPROCESSOR_PAPERLESS_SRC_DIR=`: The directory containing the source for the running instance of paperless-ngx (within the Docker container). If this is set incorrectly, postprocessor will not be able to automagically acquire the auth token. (default: `/usr/src/paperless/src`) @@ -271,6 +372,8 @@ Note that to run the management script from the docker host, you need to provide ./paperlessngx_postprocessor.py --auth-token THE_AUTH_TOKEN [specific command here] ``` +You'll probably also need to specify other configuration options (like the rulesets dir and the api url), since paperless-ngx-postprocessor won't automatically read them from Paperless-ngx's `docker-compose.env` file. + ### Running inside or outside the docker container Note that no matter where you run it, `paperlessngx_postprocessor.py` will try to use sensible defaults to figure out how to access the Paperless-ngx API. If you have a custom configuration, you may need to specify additional configuration options to `paperlessngx_postprocessor.py`. See [Configuration](#configuration) above for more details. @@ -279,10 +382,10 @@ In terms of how the script works in management mode, it runs post-processing on For example to re-run postprocessing on all documents with `correspondent` `The Bank`, you would do the following (including the auth token if running this command from the Docker host): ```bash -./paperlessngx_postprocessor.py [--auth-token THE_AUTH_TOKEN] correspondent "The Bank" +./paperlessngx_postprocessor.py [--auth-token THE_AUTH_TOKEN] [OTHER OPTIONS] process --correspondent "The Bank" ``` -You can choose all documents of a particular `correspondent` or `document_type` or `storage_path`, all documents with a specific `tag`, or just all documents (using `all`), or a specific document using its `document_id`. Note that you cannot combine selectors on the command line: e.g it's not possible to select all documents that match both a given `document_type` and `tag` simultaneously on the command line. +You can choose all documents of a particular `correspondent`, `document_type`, `storage_path`, `tag`, and many other selectors, by `document_id`, or even all documents. For details on how to specify documents, do `./paperlessngx_postprocessor.py process --help`. Note that As of version 2.0.0, you **can** combine selectors on the command line. The command line interface supports all of the same options that you can set via the environment variables listed in the [Configuration section above](#configuration). To see how to specify them, use the command line interface's built-in help: ```bash @@ -313,10 +416,18 @@ To restore backup to undo changes, do: If you want to see what the restore will do, you can open up the backup file in a text editor. Inside is just a yaml document with all of the document IDs and what their fields should be restored to. -### Upgrading +## Upgrading +### Upgrading `paperless-ngx` If you are running paperless-ngx in a Docker container, you will need to redo [setup step two](#2-run-the-one-time-setup-script-inside-the-paperless-ngx-docker-container) after any time you upgrade paperless-ngx. +### Upgrading `paperless-ngx-postprocessor` +In the directory where you checked out `paperless-ngx-postprocessor`, just do a `git pull` + +#### Upgrading from v1 to v2 +- Rulesets for v2 are a superset of those for v1, so no changes should be necessary. +- The command line interface has undergone breaking changes, so if you had any scripts that ran the management script (outside of running the standard post-consumption script), they'll need to be updated. + ## FAQ ### Will this work with paperless or paperless-ng? diff --git a/paperlessngx_postprocessor.py b/paperlessngx_postprocessor.py index be62eff..232bba2 100755 --- a/paperlessngx_postprocessor.py +++ b/paperlessngx_postprocessor.py @@ -4,88 +4,130 @@ import logging import sys import yaml +import os from paperlessngx_postprocessor import Config, PaperlessAPI, Postprocessor if __name__ == "__main__": logging.basicConfig(format="[%(asctime)s] [%(levelname)s] [%(module)s] %(message)s")#, level=logging.DEBUG) - config = Config() + config = Config(Config.general_options()) arg_parser = argparse.ArgumentParser(description="Apply postprocessing to documents in Paperless-ngx", #formatter_class=argparse.ArgumentDefaultsHelpFormatter, epilog="See https://github.com/jgillula/paperless-ngx-postprocessor#readme for more information and detailed examples.") for option_name in config.options_spec.keys(): arg_parser.add_argument("--" + option_name.replace("_","-"), **config.options_spec[option_name].argparse_args) - - selector_options = ["document_id", "correspondent", "document_type", "tag", "storage_path", "all", "restore"] - arg_parser.add_argument("selector", metavar="SELECTOR", type=str, choices=selector_options, help="Selector to specify which document(s) to postprocess (or that you want to restore from a backup file). Choose one of {{{}}}".format(", ".join(selector_options))) - arg_parser.add_argument("item_id_or_name", nargs='?', type=str, help="document_id or name of the correspondent/document_type/tag/storage_path of the documents to postprocess, or filename of the backup file to restore. Required for all selectors except 'all'.") + + # arg_parser.add_argument("--select", metavar=("ADDITIONAL_SELECTOR", "ITEM_NAME"), nargs=2, action="append", help="Additional optional selectors to apply to narrow the set of documents to apply postprocessing to. Ignored if SELECTOR is one of {all, document_id, restore}. ADDITIONAL_SELECTOR must be one of {correspondent, document_type, tag, storage_path}.") + + subparsers = arg_parser.add_subparsers(dest="mode", title='Modes', help="Use 'process [ARGS]' to choose which documents to process, or 'restore FILENAME' to restore a backup file.") + + process_subparser = subparsers.add_parser("process", usage=f"{os.path.basename(__file__)} [OPTIONS] process [SELECTORS]", description='Process documents where all the [SELECTORS] match (e.g. a collective "and"). At least one selector is required. If --all or --document-id is given, all the other selectors are ignored.') + selector_group = process_subparser.add_argument_group(title="SELECTORS") + selector_config = Config(Config.selector_options(), use_environment_variables=False) + for option_name in selector_config.options_spec.keys(): + selector_group.add_argument("--" + option_name.replace("_","-"), **selector_config.options_spec[option_name].argparse_args) + + restore_subparser = subparsers.add_parser("restore", usage=f"{os.path.basename(__file__)} [OPTIONS] restore FILENAME") + restore_subparser.add_argument("filename", metavar="FILENAME", type=str, help="Filename of the backup file to restore.") cli_options = vars(arg_parser.parse_args()) config.update_options(cli_options) + selector_config.update_options(cli_options) - config["selector"] = cli_options["selector"] - config["item_id_or_name"] = cli_options["item_id_or_name"] + # config["selector"] = cli_options["selector"] + # config["item_id_or_name"] = cli_options["item_id_or_name"] - logging.getLogger().setLevel(config["verbose"]) - logging.debug(f"Running {sys.argv[0]} with config {config}") - - if config["selector"] != "all" and config["item_id_or_name"] is None: - if config["selector"] == "restore": - logging.error(f"A filename is required to backup from.") - else: - logging.error(f"An item ID or name is required when postprocessing documents by {config['selector']}, but none was provided.") + config["mode"] = cli_options["mode"] + config["filename"] = cli_options.get("filename") - if config["selector"] == "restore" and config["backup"] is not None: - logging.critical("Can't restore and do a backup simultaneously. Please choose one or the other.") + logger = logging.getLogger("paperlessngx_postprocessor") + logger.setLevel(config["verbose"]) + logger.debug(f"Running {sys.argv[0]} with config {config} and {selector_config}") + + # if config["selector"] != "all" and config["item_id_or_name"] is None: + # if config["selector"] == "restore": + # logging.error(f"A filename is required to backup from.") + # else: + # logging.error(f"An item ID or name is required when postprocessing documents by {config['selector']}, but none was provided.") + + if config["mode"] == "restore" and config["backup"] is not None: + logger.critical("Can't restore and do a backup simultaneously. Please choose one or the other.") sys.exit(1) if config["dry_run"]: - logging.info("Doing a dry run. No changes will be made.") + # Force at least info level, by choosing whichever level is lower, the given level or info (since more verbose is lower) + logger.setLevel(min(logging.getLevelName(config["verbose"]), logging.getLevelName("INFO"))) + logger.info("Doing a dry run. No changes will be made.") api = PaperlessAPI(config["paperless_api_url"], auth_token = config["auth_token"], paperless_src_dir = config["paperless_src_dir"], - logger=logging.getLogger()) + logger=logger) postprocessor = Postprocessor(api, config["rulesets_dir"], postprocessing_tag = config["postprocessing_tag"], + invalid_tag = config["invalid_tag"], dry_run = config["dry_run"], - logger=logging.getLogger()) + skip_validation = config["skip_validation"], + logger=logger) documents = [] - if config["selector"] == "restore": - logging.info(f"Restoring backup from {config['item_id_or_name']}") - with open(config["item_id_or_name"], "r") as backup_file: + if config["mode"] == "restore": + logger.info(f"Restoring backup from {config['filename']}") + with open(config["filename"], "r") as backup_file: yaml_documents = list(yaml.safe_load_all(backup_file)) - logging.info(f" Restoring {len(yaml_documents)} documents") + logger.info(f" Restoring {len(yaml_documents)} documents") for yaml_document in yaml_documents: document_id = yaml_document['id'] yaml_document.pop("id") current_document = api.get_document_by_id(document_id) - logging.info(f"Restoring document {document_id}") + logger.info(f"Restoring document {document_id}") for key in yaml_document: - logging.info(f" {key}: '{current_document.get(key)}' --> '{yaml_document[key]}'") + logger.info(f" {key}: '{current_document.get(key)}' --> '{yaml_document[key]}'") if not config["dry_run"]: api.patch_document(document_id, yaml_document) sys.exit(0) - elif config["selector"] == "all": - documents = api.get_all_documents() - logging.info(f"Postprocessing all {len(documents)} documents") - elif config["selector"] == "document_id": - documents.append(api.get_document_by_id(config["item_id_or_name"])) - elif config["selector"] in ["correspondent", "document_type", "tag", "storage_path"]: - documents = api.get_documents_by_selector_name(config["selector"], config["item_id_or_name"]) + elif config["mode"] == "process": + if selector_config["all"]: + documents = api.get_all_documents() + logger.info(f"Postprocessing all {len(documents)} documents") + elif not(any(selector_config.values())): + logger.error("No SELECTORS provided. Please specify at least one SELECTOR.") + sys.exit(1) + elif selector_config.get("document_id"): + documents.append(api.get_document_by_id(selector_config.get("document_id"))) + else: + documents = api.get_documents_by_field_names(**selector_config.options()) + + # Filter out any null documents, and then warn if no documents are left + documents = list(filter(lambda doc: doc, documents)) if len(documents) == 0: - logging.warning(f"No documents found with {config['selector']} \'{config['item_id_or_name']}\'") + logger.warning(f"No documents found") + sys.exit(0) else: - logging.info(f"Postprocessing {len(documents)} documents with {config['selector']} \'{config['item_id_or_name']}\'") - - backup_documents = postprocessor.postprocess(documents) + logger.info(f"Processing {len(documents)} documents.") + # documents.append(api.get_ + + # elif config["selector"] == "all": + # documents = api.get_all_documents() + # logger.info(f"Postprocessing all {len(documents)} documents") + # elif config["selector"] == "document_id": + # documents.append(api.get_document_by_id(config["item_id_or_name"])) + # elif config["selector"] in ["correspondent", "document_type", "tag", "storage_path"]: + # fields = {config["selector"]: config["item_id_or_name"]} + # documents = api.get_documents_by_field_names() + # # documents = api.get_documents_by_selector_name(config["selector"], config["item_id_or_name"]) + # # if len(documents) == 0: + # # logger.warning(f"No documents found with {config['selector']} \'{config['item_id_or_name']}\'") + # # else: + # # logger.info(f"Postprocessing {len(documents)} documents with {config['selector']} \'{config['item_id_or_name']}\'") + + backup_documents = postprocessor.postprocess(documents) - if len(backup_documents) > 0 and config["backup"] is not None: - logging.debug(f"Writing backup to {config['backup']}") - with open(config["backup"], "w") as backup_file: - backup_file.write(yaml.dump_all(backup_documents)) + if len(backup_documents) > 0 and config["backup"] is not None: + logger.debug(f"Writing backup to {config['backup']}") + with open(config["backup"], "w") as backup_file: + backup_file.write(yaml.dump_all(backup_documents)) diff --git a/paperlessngx_postprocessor/config.py b/paperlessngx_postprocessor/config.py index 36173f0..5db3ca2 100644 --- a/paperlessngx_postprocessor/config.py +++ b/paperlessngx_postprocessor/config.py @@ -1,5 +1,6 @@ import os -from datetime import datetime +import dateutil.parser +from datetime import datetime, date from pathlib import Path class Config: @@ -10,62 +11,180 @@ def __init__(self, default, argparse_args): if "help" in self.argparse_args: self.argparse_args["help"] = self.argparse_args["help"].format(default = default) - + + _default_backup_name = datetime.now().strftime("%Y-%m-%d--%H-%M-%S")+".backup" + + def selector_options(): + return {"document_id": Config.OptionSpec(None, {"metavar": "DOCUMENT_ID", + "help": "Select a document by its DOCUMENT_ID"}), + "correspondent": Config.OptionSpec(None, {"metavar": "CORRESPONDENT_NAME", + "type": str, + "help": "Select documents by their CORRESPONDENT_NAME"}), + "document_type": Config.OptionSpec(None, {"metavar": "DOCUMENT_TYPE_NAME", + "type": str, + "help": "Select documents by their DOCUMENT_TYPE_NAME"}), + "tag": Config.OptionSpec(None, {"metavar": "TAG_NAME", + "type": str, + "help": "Select documents with tag TAG_NAME"}), + "storage_path": Config.OptionSpec(None, {"metavar": "STORAGE_PATH_NAME", + "type": str, + "help": "Select documents by their STORAGE_PATH_NAME"}), + "created_year": Config.OptionSpec(None, {"metavar": "YEAR", + "type": int, + "help": "Select documents created in YEAR."}), + "created_month": Config.OptionSpec(None, {"metavar": "MONTH", + "type": int, + "help": "Select documents created in MONTH."}), + "created_day": Config.OptionSpec(None, {"metavar": "DAY", + "type": int, + "help": "Select documents created in DAY."}), + "created_range": Config.OptionSpec(None, {"metavar": "DATE--DATE", + "type": str, + "help": "Select documents created in a given range (exclusive), where DATE is of the form YYYY-MM-DD. Example: To get all documents created in April of 2063, you would use '--created-range 2063-03-31--2063-05-01'. To only get documents created before or after a given date, use 'x' instead of date, e.g. 'x--2063-05-01'"}), + "created_year": Config.OptionSpec(None, {"metavar": "YEAR", + "type": int, + "help": "Select documents created in YEAR."}), + "added_month": Config.OptionSpec(None, {"metavar": "MONTH", + "type": int, + "help": "Select documents added in MONTH."}), + "added_day": Config.OptionSpec(None, {"metavar": "DAY", + "type": int, + "help": "Select documents added in DAY."}), + "added_range": Config.OptionSpec(None, {"metavar": "DATE--DATE", + "type": str, + "help": "Select documents added in a given range (exclusive), where DATE is of the form YYYY-MM-DD. Example: To get all documents added in April of 2063, you would use '--added-range 2063-03-31--2063-05-01'. To only get documents added before or after a given date, use 'x' instead of date, e.g. 'x--2063-05-01'"}), + "asn": Config.OptionSpec(None, {"metavar": "ASN", + "type": int, + "help": "Select document by its ASN"}), + "title": Config.OptionSpec(None, {"metavar": "TITLE", + "type": str, + "help": "Select document by its TITLE"}), + "all": Config.OptionSpec(False, {"action": "store_true", + "help": "Select all documents. WARNING! If you have a lot of documents, this will take a long time."}), + } - def __init__(self): - self._default_backup_name = datetime.now().strftime("%Y-%m-%d--%H-%M-%S")+".backup" - - self.options_spec = {"auth_token": Config.OptionSpec(None, {"metavar": "AUTH_TOKEN", - "type": str, - "help": "The auth token to access the REST API of Paperless-ngx. If not specified, postprocessor will try to automagically get it from Paperless-ngx's database directly."}), - "dry_run": Config.OptionSpec(False, {"action": "store_const", - "const": True, - "help": "Don't actually make any changes, just print what would happen. Forces the verbosity level to be at least INFO. (default: {default})"}), - #"dry_run": Config.OptionSpec(False, {"action": "store_true", - # "help": "Don't actually make any changes, just print what would happen. Forces the verbosity level to be at least INFO. (default: {default})"}), - "backup": Config.OptionSpec(None, {"nargs": '?', - "type": str, - "const": self._default_backup_name, - "help": "Backup file to write any changed values to. If no filename is given, one will be automatically generated based on the current date and time. If the path is a directory, the automatically generated file will be stored in that directory. (default: {default})"}), - "postprocessing_tag": Config.OptionSpec(None, {"metavar": "TAG", - "type": str, - "help": "A tag to apply if any changes are made during postprocessing. (default: {default})"}), - "verbose": Config.OptionSpec("WARNING", {"type": str, - "choices": ["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"], - "help": "The verbosity level for logging. (default: {default})"}), - "rulesets_dir": Config.OptionSpec("/usr/src/paperless-ngx-postprocessor/rulesets.d", {"metavar": "RULESETS_DIR", - "type": str, - "help": "The config directory containing the rulesets for postprocessing. (default: {default})"}), - "paperless_api_url": Config.OptionSpec("http://localhost:8000/api", {"metavar": "API_URL", - "type": str, - "help": "The full URL to access the Paperless-ngx REST API. (default: {default})"}), - "paperless_src_dir": Config.OptionSpec("/usr/src/paperless/src", {"metavar": "PAPERLESS_SRC_DIR", - "type": str, - "help": "The directory containing the source for the running instance of paperless. If this is set incorrectly, postprocessor will not be able to automagically acquire the AUTH_TOKEN. (default: {default})"}), + def general_options(): + return {"auth_token": Config.OptionSpec(None, {"metavar": "AUTH_TOKEN", + "type": str, + "help": "The auth token to access the REST API of Paperless-ngx. If not specified, postprocessor will try to automagically get it from Paperless-ngx's database directly."}), + "dry_run": Config.OptionSpec(False, {"action": "store_const", + "const": True, + "help": "Don't actually make any changes, just print what would happen. Forces the verbosity level to be at least INFO. (default: {default})"}), + "skip_validation": Config.OptionSpec(False, {"action": "store_const", + "const": True, + "help": "Don't process any validation rules. (default: {default})"}), + #"dry_run": Config.OptionSpec(False, {"action": "store_true", + # "help": "Don't actually make any changes, just print what would happen. Forces the verbosity level to be at least INFO. (default: {default})"}), + "backup": Config.OptionSpec(None, {"type": str, + "metavar": "FILENAME", + "help": "Backup file to write any changed values to. If the string DEFAULT is given, one will be automatically generated based on the current date and time. If the path is a directory, the automatically generated file will be stored in that directory. (default: YYYY-MM-DD--HH-MM-SS.backup)"}), + "postprocessing_tag": Config.OptionSpec(None, {"metavar": "TAG", + "type": str, + "help": "A tag to apply if any changes are made during postprocessing. (default: {default})"}), + "invalid_tag": Config.OptionSpec(None, {"metavar": "TAG", + "type": str, + "help": "A tag to apply if the resulting metadata doesn't satisfy any validation rules. (default: {default})"}), + "verbose": Config.OptionSpec("WARNING", {"type": str, + "choices": ["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"], + "help": "The verbosity level for logging. (default: {default})"}), + "rulesets_dir": Config.OptionSpec("/usr/src/paperless-ngx-postprocessor/rulesets.d", {"metavar": "RULESETS_DIR", + "type": str, + "help": "The config directory containing the rulesets for postprocessing. (default: {default})"}), + "paperless_api_url": Config.OptionSpec("http://localhost:8000/api", {"metavar": "API_URL", + "type": str, + "help": "The full URL to access the Paperless-ngx REST API. (default: {default})"}), + "paperless_src_dir": Config.OptionSpec("/usr/src/paperless/src", {"metavar": "PAPERLESS_SRC_DIR", + "type": str, + "help": "The directory containing the source for the running instance of paperless. If this is set incorrectly, postprocessor will not be able to automagically acquire the AUTH_TOKEN. (default: {default})"}), } + def __init__(self, options_spec, use_environment_variables = True): + #self._default_backup_name = datetime.now().strftime("%Y-%m-%d--%H-%M-%S")+".backup" + + # self.options_spec = {"auth_token": Config.OptionSpec(None, {"metavar": "AUTH_TOKEN", + # "type": str, + # "help": "The auth token to access the REST API of Paperless-ngx. If not specified, postprocessor will try to automagically get it from Paperless-ngx's database directly."}), + # "dry_run": Config.OptionSpec(False, {"action": "store_const", + # "const": True, + # "help": "Don't actually make any changes, just print what would happen. Forces the verbosity level to be at least INFO. (default: {default})"}), + # "skip_validation": Config.OptionSpec(False, {"action": "store_const", + # "const": True, + # "help": "Don't process any validation rules. (default: {default})"}), + # #"dry_run": Config.OptionSpec(False, {"action": "store_true", + # # "help": "Don't actually make any changes, just print what would happen. Forces the verbosity level to be at least INFO. (default: {default})"}), + # "backup": Config.OptionSpec(None, {"nargs": '?', + # "type": str, + # "const": self._default_backup_name, + # "help": "Backup file to write any changed values to. If no filename is given, one will be automatically generated based on the current date and time. If the path is a directory, the automatically generated file will be stored in that directory. (default: {default})"}), + # "postprocessing_tag": Config.OptionSpec(None, {"metavar": "TAG", + # "type": str, + # "help": "A tag to apply if any changes are made during postprocessing. (default: {default})"}), + # "invalid_tag": Config.OptionSpec(None, {"metavar": "TAG", + # "type": str, + # "help": "A tag to apply if the resulting metadata doesn't satisfy any validation rules. (default: {default})"}), + # "verbose": Config.OptionSpec("WARNING", {"type": str, + # "choices": ["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"], + # "help": "The verbosity level for logging. (default: {default})"}), + # "rulesets_dir": Config.OptionSpec("/usr/src/paperless-ngx-postprocessor/rulesets.d", {"metavar": "RULESETS_DIR", + # "type": str, + # "help": "The config directory containing the rulesets for postprocessing. (default: {default})"}), + # "paperless_api_url": Config.OptionSpec("http://localhost:8000/api", {"metavar": "API_URL", + # "type": str, + # "help": "The full URL to access the Paperless-ngx REST API. (default: {default})"}), + # "paperless_src_dir": Config.OptionSpec("/usr/src/paperless/src", {"metavar": "PAPERLESS_SRC_DIR", + # "type": str, + # "help": "The directory containing the source for the running instance of paperless. If this is set incorrectly, postprocessor will not be able to automagically acquire the AUTH_TOKEN. (default: {default})"}), + # } + + self.options_spec = options_spec + self._options = {} for option_name in self.options_spec.keys(): self._options[option_name] = self.options_spec[option_name].default - if os.environ.get("PNGX_POSTPROCESSOR_"+option_name.upper()) is not None: + if os.environ.get("PNGX_POSTPROCESSOR_"+option_name.upper()) is not None and use_environment_variables: self._options[option_name] = os.environ.get("PNGX_POSTPROCESSOR_"+option_name.upper()) self._fix_options() def _fix_options(self): - if isinstance(self._options["dry_run"], str): + if isinstance(self._options.get("dry_run"), str): if self._options["dry_run"].lower() in ["f", "false", "no"]: self._options["dry_run"] = False elif self._options["dry_run"].lower() in ["t", "true", "yes"]: self._options["dry_run"] = True - if isinstance(self._options["backup"], str): - if self._options["backup"].lower() in ["t", "true", "yes"]: - self._options["backup"] = self._default_backup_name - elif self._options["backup"].lower() in ["f", "false", "no"]: - self._options["backup"] = None + if isinstance(self._options.get("backup"), str): + if self._options["backup"].lower() == "default": + self._options["backup"] = Config._default_backup_name else: backup_path = Path(self._options["backup"]) if backup_path.is_dir(): - self._options["backup"] = str(backup_path / Path(self._default_backup_name)) + self._options["backup"] = str(backup_path / Path(Config._default_backup_name)) + if isinstance(self._options.get("created_range"), str): + dates = self._options.get("created_range").split("--") + if len(dates) == 2: + new_dates = [] + for date_str in dates: + try: + datetime_obj = dateutil.parser.isoparse(date_str) + new_dates.append(datetime_obj.date()) + except: + new_dates.append(None) + self._options["created_range"] = new_dates + else: + self._options["created_range"] = None + + if isinstance(self._options.get("added_range"), str): + dates = self._options.get("added_range").split("--") + if len(dates) == 2: + new_dates = [] + for date_str in dates: + try: + datetime_obj = dateutil.parser.isoparse(date_str) + new_dates.append(datetime_obj.date()) + except: + new_dates.append(None) + self._options["added_range"] = new_dates + else: + self._options["added_range"] = None def __getitem__(self, index): return self._options[index] @@ -75,7 +194,16 @@ def __setitem__(self, index, item): def __str__(self): return str(self._options) - + + def get(self, index, default=None): + return self._options.get(index, default) + + def values(self): + return self._options.values() + + def options(self): + return self._options + def update_options(self, new_options): for option_name in self.options_spec.keys(): if option_name in new_options and new_options[option_name] is not None: diff --git a/paperlessngx_postprocessor/paperless_api.py b/paperlessngx_postprocessor/paperless_api.py index ff0a149..b09243d 100644 --- a/paperlessngx_postprocessor/paperless_api.py +++ b/paperlessngx_postprocessor/paperless_api.py @@ -2,6 +2,7 @@ import logging import os import requests +from datetime import date from pathlib import Path class PaperlessAPI: @@ -22,6 +23,8 @@ def __init__(self, api_url, auth_token, paperless_src_dir, logger=None): logging.debug(f"Auth token {auth_token} acquired") self._auth_token = auth_token + self._cache = {} + self._cachable_types = ["correspondents", "document_types", "storage_paths", "tags"] def delete_document_by_id(self, document_id): item_type = "documents" @@ -47,6 +50,11 @@ def _get_item_by_id(self, item_type, item_id): return {} def _get_list(self, item_type, query=None): + # If the given item type has been cached, return it + if item_type in self._cache and query is None: + self._logger.debug(f"Returning {item_type} list from cache") + return self._cache[item_type] + items = [] next_url = f"{self._api_url}/{item_type}/" if query is not None: @@ -61,6 +69,9 @@ def _get_list(self, item_type, query=None): else: next_url = None + if item_type in self._cachable_types: + self._cache[item_type] = items + return items def get_item_id_by_name(self, item_type, item_name): @@ -84,6 +95,54 @@ def get_documents_by_selector_name(self, selector, name): query = f"{selector}s__id={selector_id}" return self._get_list("documents", query) + def get_documents_by_field_names(self, **fields): + allowed_fields = {"correspondent": "correspondent__name__iexact", + "document_type": "document_type__name__iexact", + "storage_path": "storage_path__name__iexact", + "added_year": "added__year", + "added_month": "added__month", + "added_day": "added_day", + "asn": "archive_serial_number", + "title": "title__iexact", + "created_year": "created__year", + "created_month": "created__month", + "created_day": "created__day", + } + + queries = [] + for key in allowed_fields.keys(): + if key in fields.keys() and fields[key] is not None: + queries.append(f"{allowed_fields[key]}={fields[key]}") + + if (isinstance(fields.get("added_range"), (tuple, list)) and + len(fields.get("added_range")) == 2): + if isinstance(fields["added_range"][0], date): + queries.append(f"added__date__gt={fields['added_range'][0].strftime('%F')}") + if isinstance(fields["added_range"][1], date): + queries.append(f"added__date__lt={fields['added_range'][1].strftime('%F')}") + + if (isinstance(fields.get("created_range"), (tuple, list)) and + len(fields.get("created_range")) == 2): + if isinstance(fields["created_range"][0], date): + queries.append(f"created__date__gt={fields['created_range'][0].strftime('%F')}") + if isinstance(fields["created_range"][1], date): + queries.append(f"created__date__lt={fields['created_range'][1].strftime('%F')}") + + + if isinstance(fields.get("added_date_object"), date): + queries.append(f"added__year={fields['added_date_object'].year}&added__month={fields['added_date_object'].month}&added__day={fields['added_date_object'].day}") + + if isinstance(fields.get("created_date_object"), date): + queries.append(f"created__year={fields['created_date_object'].year}&created__month={fields['created_date_object'].month}&created__day={fields['created_date_object'].day}") + + query = "&".join(queries) + self._logger.debug(f"Running query '{query}'") + return self._get_list("documents", query) + + + # def get_documents_from_query(self, query): + # return self._get_list("documents", query) + def get_all_documents(self): return self._get_list("documents") @@ -104,6 +163,7 @@ def get_tag_by_id(self, tag_id): def get_metadata_in_filename_format(self, metadata): new_metadata = {} + new_metadata["document_id"] = metadata["id"] new_metadata["correspondent"] = (self.get_correspondent_by_id(metadata["correspondent"])).get("name") new_metadata["document_type"] = (self.get_document_type_by_id(metadata["document_type"])).get("name") new_metadata["storage_path"] = (self.get_storage_path_by_id(metadata["storage_path"])).get("name") @@ -115,16 +175,21 @@ def get_metadata_in_filename_format(self, metadata): new_metadata["created_year"] = f"{created_date.year:04d}" new_metadata["created_month"] = f"{created_date.month:02d}" new_metadata["created_day"] = f"{created_date.day:02d}" + new_metadata["created_date"] = created_date.strftime("%F") # %F means YYYY-MM-DD + new_metadata["created_date_object"] = created_date new_metadata["added"] = metadata["added"] added_date = dateutil.parser.isoparse(new_metadata["added"]) new_metadata["added_year"] = f"{added_date.year:04d}" new_metadata["added_month"] = f"{added_date.month:02d}" new_metadata["added_day"] = f"{added_date.day:02d}" + new_metadata["added_date"] = added_date.strftime("%F") + new_metadata["added_date_object"] = added_date return new_metadata def get_metadata_from_filename_format(self, metadata_in_filename_format): result = {} + result["id"] = metadata_in_filename_format["document_id"] result["correspondent"] = self.get_item_id_by_name("correspondents", metadata_in_filename_format["correspondent"]) result["document_type"] = self.get_item_id_by_name("document_types", metadata_in_filename_format["document_type"]) result["storage_path"] = self.get_item_id_by_name("storage_paths", metadata_in_filename_format["storage_path"]) diff --git a/paperlessngx_postprocessor/postprocessor.py b/paperlessngx_postprocessor/postprocessor.py index af610cf..735e5d9 100644 --- a/paperlessngx_postprocessor/postprocessor.py +++ b/paperlessngx_postprocessor/postprocessor.py @@ -4,28 +4,35 @@ import logging import regex import yaml -from datetime import datetime +from datetime import date, datetime, timedelta from pathlib import Path from .paperless_api import PaperlessAPI class DocumentRuleProcessor: - def __init__(self, spec, logger = None): + def __init__(self, api, spec, logger = None): self._logger = logger if self._logger is None: logging.basicConfig(format="[%(asctime)s] [%(levelname)s] [%(module)s] %(message)s", level="CRITICAL") self._logger = logging.getLogger() + self._api = api + self.name = list(spec.keys())[0] self._match = spec[self.name].get("match") self._metadata_regex = spec[self.name].get("metadata_regex") self._metadata_postprocessing = spec[self.name].get("metadata_postprocessing") + self._validation_rule = spec[self.name].get("validation_rule") #self._title_format = spec[self.name].get("title_format") self._env = jinja2.Environment() self._env.filters["expand_two_digit_year"] = self._expand_two_digit_year self._env.filters["regex_match"] = self._jinja_filter_regex_match self._env.filters["regex_sub"] = self._jinja_filter_regex_sub + self._env.globals["last_date_object_of_month"] = self._last_date_object_of_month + self._env.globals["num_documents"] = self._num_documents + self._env.globals["date"] = date + self._env.globals["timedelta"] = timedelta def matches(self, metadata): if type(self._match) is str: @@ -65,6 +72,60 @@ def _expand_two_digit_year(self, year, prefix=None): else: return f"{year}" + def _last_date_object_of_month(self, date_object): + if isinstance(date_object, date): + return date(date_object.year, date_object.month, calendar.monthrange(date_object.year, date_object.month)[1]) + return None + + def _num_documents(self, **constraints): + # allowed_constraints = {"correspondent": "correspondent__name__iexact", + # "document_type": "document_type__name__iexact", + # "storage_path": "storage_path__name__iexact", + # "added_year": "added__year", + # "added_month": "added__month", + # "added_day": "added_day", + # "asn": "archive_serial_number", + # "title": "title__iexact", + # "created_year": "created__year", + # "created_month": "created__month", + # "created_day": "created__day", + # } + + # queries = [] + # for key in allowed_constraints.keys(): + # if key in constraints.keys(): + # queries.append(f"{allowed_constraints[key]}={constraints[key]}") + + # if (isinstance(constraints.get("added_range"), (tuple, list)) and + # len(constraints.get("added_range")) == 2): + # if isinstance(constraints["added_range"][0], date): + # queries.append(f"added__date__gt={constraints['added_range'][0].strftime('%F')}") + # if isinstance(constraints["added_range"][1], date): + # queries.append(f"added__date__lt={constraints['added_range'][1].strftime('%F')}") + + # if (isinstance(constraints.get("created_range"), (tuple, list)) and + # len(constraints.get("created_range")) == 2): + # if isinstance(constraints["created_range"][0], date): + # queries.append(f"created__date__gt={constraints['created_range'][0].strftime('%F')}") + # if isinstance(constraints["created_range"][1], date): + # queries.append(f"created__date__lt={constraints['created_range'][1].strftime('%F')}") + + + # if isinstance(constraints.get("added_date_object"), date): + # queries.append(f"added__year={constraints['added_date_object'].year}&added__month={constraints['added_date_object'].month}&added__day={constraints['added_date_object'].day}") + + # if isinstance(constraints.get("created_date_object"), date): + # queries.append(f"created__year={constraints['created_date_object'].year}&created__month={constraints['created_date_object'].month}&created__day={constraints['created_date_object'].day}") + + # query = "&".join(queries) + # self._logger.debug(f"Running query '{query}'") + + #items = self._api.get_documents_from_query(query) + items = self._api.get_documents_by_field_names(**constraints) + self._logger.debug(f"Found {len(items)} documents matching the query") + + return len(items) + def _jinja_filter_regex_match(self, string, pattern): '''Custom jinja filter for regex matching''' if regex.match(pattern, string): @@ -78,23 +139,39 @@ def _jinja_filter_regex_sub(self, string, pattern, repl): def _normalize_created_dates(self, new_metadata, old_metadata): result = new_metadata.copy() - #if "created_year" in metadata.keys(): try: result["created_year"] = str(int(new_metadata["created_year"])) except: result["created_year"] = old_metadata["created_year"] - #if "created_month" in metadata.keys(): result["created_month"] = self._normalize_month(new_metadata["created_month"], old_metadata["created_month"]) - #if "created_day" in metadata.keys(): result["created_day"] = self._normalize_day(new_metadata["created_day"], old_metadata["created_day"]) original_created_date = dateutil.parser.isoparse(old_metadata["created"]) - new_created_date = datetime(int(result["created_year"]), int(result["created_month"]), int(result["created_day"]), 12, tzinfo=original_created_date.tzinfo) + new_created_date = datetime(int(result["created_year"]), int(result["created_month"]), int(result["created_day"]), original_created_date.hour, tzinfo=original_created_date.tzinfo) result["created"] = new_created_date.isoformat() result["created_date"] = new_created_date.strftime("%F") # %F means YYYY-MM-DD - + result["created_date_object"] = date(int(result["created_year"]), int(result["created_month"]), int(result["created_day"])) + return result + def validate(self, metadata): + valid = True + + metadata = self._normalize_created_dates(metadata, metadata) + + # Try to apply the validation rule + if self._validation_rule is not None: + self._logger.debug(f"Validating for rule {self.name} using metadata={metadata}") + template = self._env.from_string(self._validation_rule) + template_result = template.render(**metadata).strip() + self._logger.debug(f"Validation template rendered to '{template_result}'") + valid = (template_result != "False") + if not valid: + self._logger.warning(f"Failed validation rule '{self._validation_rule}'") + else: + self._logger.debug(f"No validation rule found for {self.name}") + + return valid def get_new_metadata(self, metadata, content): read_only_metadata_keys = ["correspondent", @@ -104,7 +181,8 @@ def get_new_metadata(self, metadata, content): "added", "added_year", "added_month", - "added_day"] + "added_day", + "document_id"] read_only_metadata = {key: metadata[key] for key in read_only_metadata_keys if key in metadata} writable_metadata_keys = list(set(metadata.keys()) - set(read_only_metadata_keys)) writable_metadata = {key: metadata[key] for key in writable_metadata_keys if key in metadata} @@ -119,8 +197,8 @@ def get_new_metadata(self, metadata, content): writable_metadata = self._normalize_created_dates(writable_metadata, metadata) self._logger.debug(f"Regex results are {writable_metadata}") else: - self._logger.warning(f"Regex '{self._metadata_regex}' for '{self.name}' didn't match") - + self._logger.warning(f"Regex '{self._metadata_regex}' for '{self.name}' didn't match for document_id={metadata['document_id']}") + # Cycle throguh the postprocessing rules if self._metadata_postprocessing is not None: for variable_name in self._metadata_postprocessing.keys(): @@ -128,7 +206,7 @@ def get_new_metadata(self, metadata, content): old_value = writable_metadata.get(variable_name) merged_metadata = {**writable_metadata, **read_only_metadata} template = self._env.from_string(self._metadata_postprocessing[variable_name]) - writable_metadata[variable_name] = template.render(**merged_metadata) + writable_metadata[variable_name] = template.render(**merged_metadata) writable_metadata = self._normalize_created_dates(writable_metadata, metadata) self._logger.debug(f"Updating '{variable_name}' using template {self._metadata_postprocessing[variable_name]} and metadata {merged_metadata}\n: '{old_value}'->'{writable_metadata[variable_name]}'") except Exception as e: @@ -142,7 +220,7 @@ def get_new_metadata(self, metadata, content): class Postprocessor: - def __init__(self, api, rules_dir, postprocessing_tag = None, dry_run = False, logger = None): + def __init__(self, api, rules_dir, postprocessing_tag = None, invalid_tag = None, dry_run = False, skip_validation = False, logger = None): self._logger = logger if self._logger is None: logging.basicConfig(format="[%(asctime)s] [%(levelname)s] [%(module)s] %(message)s", level="CRITICAL") @@ -154,7 +232,15 @@ def __init__(self, api, rules_dir, postprocessing_tag = None, dry_run = False, l self._postprocessing_tag_id = self._api.get_item_id_by_name("tags", postprocessing_tag) else: self._postprocessing_tag_id = None + + if invalid_tag is not None: + self._invalid_tag_id = self._api.get_item_id_by_name("tags", invalid_tag) + else: + self._invalid_tag_id = None + + self._dry_run = dry_run + self._skip_validation = skip_validation self._processors = [] @@ -164,7 +250,7 @@ def __init__(self, api, rules_dir, postprocessing_tag = None, dry_run = False, l try: yaml_documents = yaml.safe_load_all(yaml_file) for yaml_document in yaml_documents: - self._processors.append(DocumentRuleProcessor(yaml_document, self._logger)) + self._processors.append(DocumentRuleProcessor(self._api, yaml_document, self._logger)) except Exception as e: self._logger.warning(f"Unable to parse yaml in {filename}: {e}") self._logger.debug(f"Loaded {len(self._processors)} rules") @@ -183,9 +269,16 @@ def _get_new_metadata_in_filename_format(self, metadata_in_filename_format, cont return new_metadata + def _validate(self, metadata_in_filename_format): + for processor in self._processors: + if processor.matches(metadata_in_filename_format): + if not processor.validate(metadata_in_filename_format): + return False + return True def postprocess(self, documents): backup_documents = [] + num_invalid = 0 for document in documents: metadata_in_filename_format = self._api.get_metadata_in_filename_format(document) self._logger.debug(f"metadata_in_filename_format={metadata_in_filename_format}") @@ -212,7 +305,29 @@ def postprocess(self, documents): self._logger.info(f"No changes for document_id={document['id']}") else: self._logger.info(f"No changes for document_id={document['id']}") - + + if (not self._skip_validation) and (self._invalid_tag_id is not None): + # Note that we have to refetch the document here to get the changes we just applied from postprocessing + metadata_in_filename_format = self._api.get_metadata_in_filename_format(self._api.get_document_by_id(document['id'])) + metadata = self._api.get_metadata_from_filename_format(metadata_in_filename_format) + valid = self._validate(metadata_in_filename_format) + if not valid: + num_invalid += 1 + metadata["tags"].append(self._invalid_tag_id) + self._logger.warning(f"document_id={document['id']} is invalid, adding tag {self._invalid_tag_id}") + if not self._dry_run: + self._api.patch_document(document["id"], {"tags": metadata["tags"]}) + backup_data = {"tags": metadata["tags"]} + backup_data["id"] = document["id"] + backup_documents.append(backup_data) + else: + self._logger.info(f"document_id={document['id']} is valid") + else: + self._logger.info(f"Validation was skipped since invalid_tag_id={self._invalid_tag_id} and skip_validation={self._skip_validation}") + + if num_invalid > 0: + self._logger.warning(f"Found {num_invalid}/{len(documents)} invalid documents") + return backup_documents # # if "created_year" in regex_data.keys(): diff --git a/post_consume_cid_fixer.py b/post_consume_cid_fixer.py index ec4cc8e..21dc221 100755 --- a/post_consume_cid_fixer.py +++ b/post_consume_cid_fixer.py @@ -12,7 +12,7 @@ if __name__ == "__main__": document_id = os.environ["DOCUMENT_ID"] - config = Config() + config = Config(Config.general_options()) logging.basicConfig(format="[%(asctime)s] [%(levelname)s] [%(module)s] %(message)s", level=config["verbose"]) api = PaperlessAPI(config["paperless_api_url"], diff --git a/post_consume_script.py b/post_consume_script.py index ba18028..2db4cec 100755 --- a/post_consume_script.py +++ b/post_consume_script.py @@ -15,14 +15,15 @@ if document_id is not None: subprocess.run((str(Path(directory)/"paperlessngx_postprocessor.py"), - "document_id", + "process", + "--document-id", document_id)) post_consume_script = os.environ.get("PNGX_POSTPROCESSOR_POST_CONSUME_SCRIPT") if post_consume_script is not None: logging.basicConfig(format="[%(asctime)s] [%(levelname)s] [%(module)s] %(message)s") - config = Config() + config = Config(Config.general_options()) logging.getLogger().setLevel(config["verbose"]) diff --git a/post_consume_title_change_detector.py b/post_consume_title_change_detector.py index 85d57dd..aba3f16 100755 --- a/post_consume_title_change_detector.py +++ b/post_consume_title_change_detector.py @@ -19,7 +19,7 @@ new_filename = Path(os.environ["DOCUMENT_SOURCE_PATH"]).name if old_filename != new_filename: - config = Config() + config = Config(Config.general_options()) api = PaperlessAPI(config["paperless_api_url"], auth_token = config["auth_token"], paperless_src_dir = config["paperless_src_dir"]) diff --git a/rulesets.d/example.yml b/rulesets.d/example.yml index ce4e3f8..22a251a 100644 --- a/rulesets.d/example.yml +++ b/rulesets.d/example.yml @@ -24,3 +24,4 @@ Parse creation date from filename: created_year: '{{ title_old | regex_sub("^(?P\d{4})-(?P\d{2})-(?P\d{2}) (?P.*)$", "\g<created_year>") }}' created_month: '{{ title_old | regex_sub("^(?P<created_year>\d{4})-(?P<created_month>\d{2})-(?P<created_day>\d{2}) (?P<title>.*)$", "\g<created_month>") }}' created_day: '{{ title_old | regex_sub("^(?P<created_year>\d{4})-(?P<created_month>\d{2})-(?P<created_day>\d{2}) (?P<title>.*)$", "\g<created_day>") }}' + validation_rule: '{{ num_documents(correspondent=correspondent, document_type=document_type, created_date_object=created_date_object) == 1 }}'