Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start documenting design choices #574

Merged
merged 6 commits into from
Dec 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions docs/design/decision-record/0001-multiservice-architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Multi-service Architecture

* Status: accepted
* Deciders: Original PIXL team
* Date: 2023-11-01 (retrospectively)

## Context and Problem Statement

We want a software solution that has distinct functionality. How do we structure those services.

## Decision Drivers <!-- optional -->

* Will need to allow for concurrent processing of tasks
* Lingua franca of the team is python

## Considered Options

* Multi-service architecture
* Monolith architecture

## Decision Outcome

Chosen option: "Multi-service architecture", because it allows us to:

- Break up our code into logical packages and services
- Get around a single python process being blocked by the global interpreter lock, as each service runs in its own process
- Forces us to consider where code should go, and restrict services from using other services code
- Works well with UCLH's restriction to deploy services using docker and docker compose, open to extend into kubernetes should we want that
47 changes: 47 additions & 0 deletions docs/design/decision-record/0002-message-processing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Message-based Processing of Images

* Status: accepted
* Deciders: Original PIXL team, most recent changes: Stef Piatek & Paul Smith
* Date: 2024-12-12

## Context and Problem Statement

- We need a way to buffer the messages awaiting processing.
We expect hundreds to tens of thousands of imaging studies to be requested per project,
and want to find each study in the source systems individually.


## Decision Drivers <!-- optional -->

- Be able to process multiple research projects at the same time
- Should be persistent if services are taken down
- Allow for a secondary DICOM source to be used if study isn't found in the primary
- Limit the total number of images that are being processed for a given source system
- Studies that have already been successfully exported for a project should not get processed again

## Considered Options

* Use of database alone to schedule run, with the CLI driving a long-running job to process all studies in research project
* Use of queues to buffer requests that the `imaging-api` processes, database to track the status of exported studies

## Decision Outcome

Chosen option: `Queue for buffering requests, database to track status of export`,
because it fulfills all requirements and allows us to invest in use of generic technologies.

## Pros and Cons of the Options <!-- optional -->

### Database alone


* Good, simple to set up and use. Single solution for buffering requests and tracking status
* Bad, bespoke solution where we could use a technology that can be transferrable to other situations
* Bad, complex setup to limit total queries per source system

### Queue with database for status


* Good, fulfills all requirements fairly easily, creating a queue for primary and another for secondary imaging sources
* Good, because we have previously invested in rabbitmq as a technology
* Bad, extra services to manage and extra development
* Bad, because original implementation was broken and required effort to fix. Though we learned more about the libraries we're using.
44 changes: 44 additions & 0 deletions docs/design/decision-record/0003-dicom-processing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# DICOM server and processing

* Status: accepted
* Deciders: Original PIXL team
* Date: 2023-11-01 (retrospectively)

## Context and Problem Statement

We need to have a DICOM server to query DICOM images, store them, anonymise them and export them.


## Decision Drivers <!-- optional -->

* Will need to have a robust DICOM server that has been in use
* Keep original studies in a cache locally to reduce use of clinical imaging systems if failures in anonymisation or export
* The team's lingua franca is Python
* Per-project anonymisation profiles and custom hashing of fields will require plugins to be written for anonymisation
* UCLH infrastructure allows for running docker, but we don't have admin accounts and cannot install software directly onto the machines.

## Considered Options

* XNAT server
* Orthanc server

## Decision Outcome

Chosen option: `Orthanc`,
because it's relatively lightweight, under development and allows for python-based extensions to be written.

## Pros and Cons of the Options <!-- optional -->

### XNAT

* Good, ARC has a history of using this in the medical imaging subgroup.
* Good, widely regarded
* Bad, heavyweight and has many more features than we need to use. May take longer to learn and deploy
* Bad, does not allow for python-based plugins to be used for anonymisation without getting into running docker in docker

### Orthanc

* Good, has been battle tested
* Good, has DICOMWeb plugin to allow for export via that modality
* Good, allows for python-based plugins and running in docker
* Bad, no previous usage within ARC. Will be teething problems
68 changes: 68 additions & 0 deletions docs/design/decision-record/0004-multiple-project-configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Multiple-project configuration

* Status: accepted
* Deciders: Milan Malfait, Peter Tsrunchev, Jeremy Stein, Stef Piatek
* Date: 2024-03-05

Technical Story: [PIXL can take multiple projects](https://github.com/SAFEHR-data/PIXL/issues/330)

## Context and Problem Statement

Each project should be able to define its own anonymisation profile and export destinations.
How can we store the secrets.

![pixl-multi-project-config.png](../diagrams/pixl-multi-project-config.png)

## Decision Drivers <!-- optional -->

* When hashing a given field for an imaging study for two different projects, each project's hash should be different.
This it to avoid inappropriate linkage of data between research projects.
* Secure storage of secrets, especially for connection to destinations and per-project hashing salts

## Considered Options

* file-based: env files for docker or structured files for secrets
* self-hosted secret storage service
* azure keyvault

## Decision Outcome

Chosen option: "azure keyvault", because its low hassle and gives us good control of secret access.

### Positive Consequences <!-- optional -->

* Single place for secrets, which multiple deployments can access
* Can have secrets which expire, so even if compromised we limit the amount of secret leakage possible
* Per-project salts stored in a separate keyvault than the export endpoints
* Only need to define the destination type, with the keyvalut defining all the connection details

### Negative Consequences <!-- optional -->

* Requires other implementations to set up their own azure storage accounts or develop new secret management
* Developers also have to update a `.env` file for running the system test
* Slight increase in cost, can be slightly offset by caching credentials

## Pros and Cons of the Options <!-- optional -->

### File-based

* Good, simple to do
* Bad, will keep on expanding as time goes on which can be a pain to maintain
* Bad, no access control beyond unix permissions

### Self-hosted secret storage

* Good, fine-grained access control possible
* Good, free in terms of upfront cost
* Bad, another service to maintain - residual costs

### Azure keyvault

[example | description | pointer to more information | …] <!-- optional -->

* Good, fine-grained access control possible
* Bad, slight increase in cost

## Links <!-- optional -->

* Routing of studies based on projects in [ADR-0005](0005-project-based-study-routing.md)
50 changes: 50 additions & 0 deletions docs/design/decision-record/0005-project-based-study-routing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Project-based study routing

* Status: accepted
* Deciders: Stef Piatek, Paul Smith
* Date: 2024-11-27

Technical Story: [PIXL can take multiple projects](https://github.com/SAFEHR-data/PIXL/issues/330)

## Context and Problem Statement

Each study sent to `orthanc-anon` needs to be de-identified using the project-specific configuration.
We need a way to pass along this information along with the DICOM file.

## Considered Options

* DICOM tag: Adding a custom DICOM tag for the project name
* custom REST API endpoint: Creating a custom REST API endpoint for `orthanc-anon` to pull the data from `orthanc-raw`

## Decision Outcome

Chosen option: "custom REST API endpoint", because the project tag updating was causing `orthanc-raw` to crash,
and we were no longer using study stability to control export.

## Pros and Cons of the Options <!-- optional -->

### DICOM tag

Add private creator group to instances as they arrive, and a dummy value in the custom tag.
Once the study has been pulled from the DICOM source, update the tag with the filename stem of the project config.

* Good, because you can see which study was most recently exported
* Bad, `orthanc-raw` started crashing when updating the DICOM tag via the orthanc API on large studies
* Bad, because we were having to update the DICOM tag without changing the instance UIDs.
So that we can check for missing instances for studies which already exist in `orthanc-raw`
* Bad, studies appeared where instances didn't have their custom DICOM tag updates
* Bad, could have a race condition where the same study is trying to be exported by two projects

### Custom REST API endpoint

Once the study has been pulled from the DICOM source, send a REST request to `orthanc-anon` with the study UID and project.
`orthanc-anon` then adds this job to a threadpool and returns a 200 status to the client.

* Good, keeping original instances unaltered
* Good, thread-pooling allows for faster de-identification
* Good, simpler flow
* Bad, we can't alter the queue of de-identification jobs short of taking down `orthanc-anon`

## Links <!-- optional -->

* Related to multiple project configuration [ADR-0004](0004-multiple-project-configuration.md)
45 changes: 45 additions & 0 deletions docs/design/decision-record/0006-data-export.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Export of parquet files and DICOM data

* Status: accepted
* Deciders: Haroon Chughtai, Jeremy Stein, Milan Malfait, Ruaridh Gollifer, Stef Piatek
* Date: 2024-02-26

## Context and Problem Statement

The pipeline needs to be able to export DICOM images and structured data files to different endpoints.

## Decision Drivers <!-- optional -->

* We expect that some projects will have more data than we can store locally. Will need a rolling export of images
* We will need to be able to export images and structured data via FTPS in an automated fashion
* We will need to be able to export images via DICOMWeb


## Considered Options

* Shared python library for exporting of data, used in `orthanc-anon` and the `pixl` CLI.
* `export-api` service, which can export both DICOM and structured data files.

## Decision Outcome

Chosen options: "`export-api` service", for clear separation of responsibilities.

## Pros and Cons of the Options <!-- optional -->

### Shared python library

Add private creator group to instances as they arrive, and a dummy value in the custom tag.
Once the study has been pulled from the DICOM source, update the tag with the filename stem of the project config.

* Good, one less service to maintain
* Good, export via DICOMWeb is using the orthanc API already
* Bad, duplication of implementation for export
* Bad, duplication of areas where secrets are used

### `export-api` service

Instead of shared library the code would be in the service alone.

* Good, single service that will access all secrets and orchestrate exports
* Good, allows caching of export secrets in a long-running service
* Bad, would require extra code for interacting with the service from the CLI for parquet export
19 changes: 19 additions & 0 deletions docs/design/decision-record/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Architectural Decision Log

This log lists the architectural decisions for PIXL.

Using the [Markdown Architectural Decision Records](https://adr.github.io/madr/)

Regenerate the content by using `adr-log -i` from this directory.
You can install it via `npm install -g adr-log`, removing the template

<!-- adrlog -->

* [ADR-0001](0001-multiservice-architecture.md) - Multi-service Architecture
* [ADR-0002](0002-message-processing.md) - Message-based Processing of Images
* [ADR-0003](0003-dicom-processing.md) - DICOM server and processing
* [ADR-0004](0004-multiple-project-configuration.md) - Multiple-project configuration
* [ADR-0005](0005-project-based-study-routing.md) - Project-based study routing
* [ADR-0006](0006-data-export.md) - Export of parquet files and DICOM data

<!-- adrlogstop -->
72 changes: 72 additions & 0 deletions docs/design/decision-record/template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# [Template: short title of solved problem and solution]

* Status: [proposed | rejected | accepted | deprecated | … | superseded by [ADR-0105](0105-example.md)] <!-- optional -->
* Deciders: [list everyone involved in the decision] <!-- optional -->
* Date: [YYYY-MM-DD when the decision was last updated] <!-- optional -->

Technical Story: [description | ticket/issue URL] <!-- optional -->

## Context and Problem Statement

[Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.]

## Decision Drivers <!-- optional -->

* [driver 1, e.g., a force, facing concern, …]
* [driver 2, e.g., a force, facing concern, …]
* … <!-- numbers of drivers can vary -->

## Considered Options

* [option 1]
* [option 2]
* [option 3]
* … <!-- numbers of options can vary -->

## Decision Outcome

Chosen option: "[option 1]", because [justification. e.g., only option, which meets k.o. criterion decision driver | which resolves force force | … | comes out best (see below)].

### Positive Consequences <!-- optional -->

* [e.g., improvement of quality attribute satisfaction, follow-up decisions required, …]
* …

### Negative Consequences <!-- optional -->

* [e.g., compromising quality attribute, follow-up decisions required, …]
* …

## Pros and Cons of the Options <!-- optional -->

### [option 1]

[example | description | pointer to more information | …] <!-- optional -->

* Good, because [argument a]
* Good, because [argument b]
* Bad, because [argument c]
* … <!-- numbers of pros and cons can vary -->

### [option 2]

[example | description | pointer to more information | …] <!-- optional -->

* Good, because [argument a]
* Good, because [argument b]
* Bad, because [argument c]
* … <!-- numbers of pros and cons can vary -->

### [option 3]

[example | description | pointer to more information | …] <!-- optional -->

* Good, because [argument a]
* Good, because [argument b]
* Bad, because [argument c]
* … <!-- numbers of pros and cons can vary -->

## Links <!-- optional -->

* [Link type] [Link to ADR] <!-- example: Refined by [ADR-0005](0005-example.md) -->
* … <!-- numbers of links can vary -->
Loading