Skip to content

Enhances documents in Mayan EDMS with metadata retrieved from the mindee OCR API

Notifications You must be signed in to change notification settings

DrRSatzteil/mayan-mindee

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mayan-mindee

This is an add-on for the Mayan EDMS document management system. It adds metadata to your invoices based on any api provided by mindee (https://platform.mindee.com/).

Credits go to m42e and the great mayan-automatic-metadata addon from which I used the mayan API implementation and the architectural blueprint.

About Mayan Mindee

I've been using the Mayan MAM addon by m42e for a long time now. It uses a regex based approach which works well for any known documents but obviuosly does not work too well for new documents that you have not created a plugin for. In addition to that, maintaining the plugin configurations can be a bit cumbersome. Mayan Mindee instead relies solely on the APIs provided by mindee which use AI to extract key information from your documents.

This brings the following advantages:

  • It can deal with any document, even those that you've never seen before
  • Only an initial configuration for every API you want to use is required (an example configuration is provided in mayanmindee/config/api.json), no separate configurations for different document issuers
  • Dates and numeric amounts are extracted very reliably compared to a regexp that gets easily confused by small OCR errors.

There is also a downside as compared to the regexp based approach:

  • Text values like issuer or recipient names of documents are delivered as "guessed" by mindee. This means that they are generally provided all in upper case, German umlauts are usually not recognized and there is a high possibility that two different documents of the same type (e.g. two invoices of the same company) will be delivered with slightly different issuer names.

Therefore I still rely on the regexp based approach as a first stage and use the mindee api to only fill in the metadata that was not added by Mayan MAM in the second stage.

HowTo

Requirements:

  • Running Mayan EDMS, accessible from the node you run this on, and vice versa (for the webhook)
  • A user in mayan, which is allowed to access metadata configuration, documents incl. downloading documents
  • An account at mindee https://platform.mindee.com/login

Add the web and worker services to the docker-compose file to the one you are using for mayan already. See example below:

services:
  
  mayan-mindee-worker:
    image: drrsatzteil/mayan-mindee-worker:latest
    restart: unless-stopped
    profiles:
      - mindee
    networks:
      - mayan
    environment:
      REDIS_URL: redis://:${MAYAN_REDIS_PASSWORD:-mayanredispassword}@redis:6379/4
      MAYAN_USER: ${MAYAN_MAM_USER}
      MAYAN_PASSWORD: ${MAYAN_MAM_PASS}
      MAYAN_URL: http://app:8000/api/v4/
      MINDEE_API_KEY: ${MINDEE_API_KEY}
    volumes:
      - ./mindee/api.json:/app/mayanmindee/config/api.json
      - /etc/timezone:/etc/timezone:ro
      - /etc/localtime:/etc/localtime:ro

  mayan-mindee-web:
    image: drrsatzteil/mayan-mindee-web:latest
    restart: unless-stopped
    profiles:
      - mindee
    networks:
      - mayan
    environment:
      REDIS_URL: redis://:${MAYAN_REDIS_PASSWORD:-mayanredispassword}@redis:6379/4
    volumes:
      - /etc/timezone:/etc/timezone:ro
      - /etc/localtime:/etc/localtime:ro

Note that the volume mount of /app/mayanmindee/config/api.json is required as the config that is provided in this repository is only an example and needs to be fine tuned for your installation.

Mayan Mindee consists of two services:

1. mayan-mindee-web

This service only triggers tasks for the worker.

The following environment variables are relevant:

  • REDIS_URL: provide a proper redis url for the task queuing

Currently the following endpoints (in brackets the corresponding mindee apis) are provided:

The payroll endpoint is an example for a custom api and cannot be directly used by you. Instead, create and train your own api over at mindee and use this endpoint as a blueprint for your implementation.

2. mayan-mindee-worker

This service receives tasks queued by the web service.

The following environment variables are relevant:

  • REDIS_URL: provide a proper redis url for the task queuing
  • MAYAN_USER: User to access Mayan EDMS
  • MAYAN_PASSWORD: Password for MAYAN_USER
  • MAYAN_URL: URL of the Mayan EDMS instance. Should be http://app:8000/api/v4/ (Note that the trailing / is required) if on the same docker network with default service names and Mayan EDMS version v4.X. Should point to a https url when you are using OIDC login
  • MINDEE_API_KEY: Api key from your mindee account
  • OIDC_URL: Token URL of you OAUTH2 provider when you use OIDC login (OIDC login is only tested against authentik). When this property is set it is assumed that you want to use an OIDC login. The other OIDC variables have to be set then. MAYAN_USER and MAYAN_PASSWORD are not necessary in that case and will be ignored if set
  • OIDC_USER: Username of the service account user used for login
  • OIDC_PASSWORD/OIDC_PASSWORD_FILE: App token for service account user (supports docker SECRETS)
  • OIDC_CLIENT_ID: Client ID of your mayan OAUTH2 provider
  • OIDC_CLIENT_SECRET/OIDC_CLIENT_SECRET_FILE: Client secret of your mayan OAUTH2 provider (supports docker SECRETS)
  • OIDC_SCOPE: Requested scope e.g. 'openid profile email'

Configuration

Mayan Mindee requires some configurations to be able to map the api results to the metadata in your mayan instance. To change configuration you should copy the mayanmindee/config/api.json file to your host machine, make your changes and mount this file as /app/mayanmindee/config/api.json in your worker container.

The config file has separate sections for custom apis (self-created) and standard apis (provided by mindee). The provision of postprocess actions is optional and provided as example below. See mayanmindee/postprocess.py for available postprocessing actions. See full example of config file in mayanmindee/config/api.json.

{
    "custom": [
        {
            "documenttype": <String: Can be anything but must be unique>,
            "account": <String: Mindee account, see documentation over at mindee>,
            "endpoint": <String: Name of the endpoint as defined over at mindee>,
            "model": <Version of the used model, see model section in your api>,
            "pagelimit": <Integer: Limits the number of pages sent to mindee (see api limits and it may be used to save you cost). All pages exeeding the limit will be removed before the documents gets sent>,
            "storeocr": <String: Optional name of metadata item to store the complete prediction data including boxes etc.>,
            "metadata": {
                <String: Label of your 1st metadata type in mayan>: {
                    "fieldname": <String: Name of the field in the mindee api, equals the name of the field in the api. For standard apis (see below) this gets a bit more complicated, please see api documentation at mindee or debug an api response to see the exact fields>,
                    "overwrite": <Boolean: true -> Overwrite existing metadata value in mayan, false -> don't overwrite>
                },
                <String: Label of your 2nd metadata type in mayan>: {
                    "fieldname": <See above>,
                    "overwrite": <See above>,
                    "postprocess": [
                        {
                            "action": "replace",
                            "args": [",", ""]
                        },
                        {
                            "action": "replace",
                            "args": [".", ","]
                        },
                        {
                            "action": "append",
                            "args": ["€"]
                        }
                    ]
                },
            },
            "tags": {
                <String: Label of the tag in mayan>: {
                    "fieldname": <See above>,
                    "match": <String: String that should be matched (see partial ratio section in https://pypi.org/project/thefuzz/)>,
                    "confidence": <Integer: Minimum ratio for match in the range 0-100, 0: Always write tag, 100 only exact match writes tag>
                }
            }
        }
    ],
    "standard": [
        {
            "documenttype": "TypeInvoiceV4",
            "pagelimit": 10,
            "storeocr": "mindee",
            "metadata": {
                "receiptdate": {
                    "fieldname": "invoice_date.value",
                    "overwrite": false
                },
                "issuer": {
                    "fieldname": "supplier_name.value",
                    "overwrite": false
                },
                "invoicenumber": {
                    "fieldname": "invoice_number.value",
                    "overwrite": false
                },
                "invoiceamount": {
                    "fieldname": "total_amount.value",
                    "overwrite": false,
                    "postprocess": [
                        {
                            "action": "format",
                            "args": ["{:.2f}"]
                        },
                        {
                            "action": "replace",
                            "args": [",", ""]
                        },
                        {
                            "action": "replace",
                            "args": [".", ","]
                        },
                        {
                            "action": "append",
                            "args": ["€"]
                        }
                    ]
                },
                "iban": {
                    "fieldname": "supplier_payment_details[0].iban",
                    "overwrite": false
                },
                "bic": {
                    "fieldname": "supplier_payment_details[0].swift",
                    "overwrite": false,
                    "postprocess": [
                        {
                            "action": "schwifty",
                            "args": []
                        }
                    ]
                }
            },
            "tags": {
                "Augustus": {
                    "fieldname": "customer_name",
                    "match": "augustus gloop",
                    "confidence": 85
                },
                "Violet": {
                    "fieldname": "customer_name",
                    "match": "violet beauregarde",
                    "confidence": 85
                },
                "Rechnung": {
                    "fieldname": "customer_name",
                    "match": "",
                    "confidence": 0
                }
            }
        },
        {
            "documenttype": "TypeProofOfAddressV1",
            "pagelimit": 3,
            "metadata": {
                "receiptdate": {
                    "fieldname": "date.value",
                    "overwrite": false
                },
                "issuer": {
                    "fieldname": "issuer_name.value",
                    "overwrite": false
                }
            },
            "tags": {
                "Joe": {
                    "fieldname": "recipient_name.value",
                    "match": "granpa joe",
                    "confidence": 85
                },
                "Josephine": {
                    "fieldname": "recipient_name.value",
                    "match": "grandma josephine",
                    "confidence": 85
                },
                "General": {
                    "fieldname": "recipient_name.value",
                    "match": "",
                    "confidence": 0
                }
            }
        }
    ]
}

About

Enhances documents in Mayan EDMS with metadata retrieved from the mindee OCR API

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published