RESTful API for converting clinical documents and files
Docsmith is a RESTful API, built using Node.js and the Fastify web framework, that can convert files from:
- DOC to TXT
- DOCX to HTML
- DOCX to TXT
- HTML to TXT
- PDF to HTML
- PDF to TXT
- RTF to HTML (images are removed)
- RTF to TXT
- Scanned documents (as PDFs) to TXT using OCR
Docsmith was created in my spare time outside of work after identifying the need for an open-source document conversion service at Yeovil Hospital (ran by Somerset NHS Foundation Trust).
Being open-source, with the ability to be self-hosted, enables a data processor (i.e. an NHS trust) to confirm that a service is not storing and logging files with confidential patient identifiable data (PID) in them, which is essential for preventing potential GDPR breaches. This is something that the majority of existing closed-source document conversion services cannot offer. Docsmith was built to remedy this.
Before Docsmith, Yeovil Hospital was using expensive proprietary conversion tools that would regularly produce unreadable documents with issues such as text running off the page, paragraphs overlapping each other, and Windows-1252 to UTF-8 character encoding problems. GP surgeries in Somerset and Dorset would receive these corrupted documents through MESH and be unable to read them. This resulted in time and money wasted either posting or faxing them again, opening up the potential for further data breaches.
Docsmith enables a data processor to use a comprehensive, GDPR-compliant, open-source document conversion service. In comparison with equivalents in the market today it completes this vital task at a fraction of the cost (free!), whilst also ensuring a higher level of security and privacy for the data subjects.
These are only required if running the API outside of Docker:
- Node.js >=18.12.1
- Linux only:
poppler-data
>=0.4.9 - Linux only:
poppler-utils
>=20.12.0 - macOS only:
poppler
>=20.12.0 - Linux and macOS only:
unrtf
>=0.19.3
Perform the following steps before deployment:
- Download and extract the latest release asset
- Navigate to the extracted directory
- Make a copy of
.env.template
in the root directory and rename it to.env
- Configure the application using the environment variables in
.env
- Place additional trained data into
ocr_lang_data
directory (optional, info can be found here)
Note Set the following environment variables in
.env
to meet NHS England's recommendation to retain six months' worth of logs:
LOG_ROTATION_DATE_FORMAT="YYYY-MM-DD"
LOG_ROTATION_FREQUENCY="daily"
LOG_ROTATION_MAX_LOGS="180d"
- Run
npm ci --ignore-scripts --omit=dev
to install dependencies - Run
npm start
The service should be up and running on the port set in the config. Output similar to the following should appear in stdout or in the log file specified using the LOG_ROTATION_FILENAME
environment variable:
{
"level": "info",
"time": "2022-10-20T07:57:21.459Z",
"pid": 148,
"hostname": "MYCOMPUTER",
"msg": "Server listening at http://127.0.0.1:51173"
}
To test it, use Insomnia and import the example requests from ./test_resources/insomnia_test_requests.json
.
This requires Docker installed.
- Run
docker compose up
(ordocker compose up -d
to run in the background)
If this cannot be deployed into production using Docker, use a process manager such as PM2.
- Run
npm ci --ignore-scripts --omit=dev
to install dependencies - Run
npm i -g pm2
to install pm2 globally - Launch the application with
pm2 start .pm2.config.js
- Check that the application has been deployed using
pm2 list
orpm2 monit
If using a Microsoft Windows OS utilise pm2-installer to install PM2 as a Windows service.
Note PM2 will automatically restart the application if
.env
is modified.
API documentation can be found at /docs
:
The underlying OpenAPI definitions are found at /docs/openapi
.
Contributions are welcome, and any help is greatly appreciated!
See the contributing guide for details on how to get started. Please adhere to this project's Code of Conduct when contributing.
docsmith
is licensed under the MIT license.