-
Notifications
You must be signed in to change notification settings - Fork 2
Towards a Relational Metadata Management System
Author: Jacob Lurye
Date: 3/22/21
In this document, I'm going to address how replacing our current JSON schema-based metadata management system with an entirely SQLAlchemy/PostgreSQL based system could work.
I'll try to show that this is technically feasible by presenting sketches for possible solutions to each of these problems:
- Modeling nested metadata relationships
- Modeling multiple versions of the same assay
- Connecting files to metadata
- Generating and parsing metadata spreadsheets
- Generating documentation about the data model
- Safely migrating from the existing system
Then, I'll list some open questions I still have about all of this.
Consider a hypothetical assay where each upload's metadata contains:
- a required overview blurb
- an array of antibodies used, chosen from a list of allowed antibodies
- an array of batch-level metadata, each containing an array of records containing sample-level metadata
I don't think our assay schemas get much more complicated than this. With JSON schema, we would model the above something like
{
"type": "object",
"required": ["overview"],
"properties": {
"overview": {
"type": "string"
},
"antibodies": {
"type": "array",
"items": {
"type": "object",
"mergeStrategy": "arrayMergeById",
"mergeOptions": { "idRef": "/antibody" },
"required": ["antibody"],
"properties": {
"antibody": { "type": "string", "enum": ["foo", "bar"] }
}
}
},
"batches": {
"type": "array",
"mergeStrategy": "arrayMergeById",
"mergeOptions": { "idRef": "/barcode" },
"items": {
"type": "object",
"required": ["barcode"],
"properties": {
"barcode": { "type": "integer" },
"batch_prop": { "type": "string" },
"records": {
"type": "object",
"mergeStrategy": "arrayMergeById",
"mergeOptions": { "idRef": "/cimac_id" },
"required": ["record_id"],
"properties": {
"cimac_id": {
/* in_doc_ref to samples */
},
"record_prop": { "type": "string" }
}
}
}
}
}
}
}
Such a schema would validate an instance that looks like:
{
"overview": "great assay, good work",
"antibodies": [{ "antibody": "foo" }, { "antibody": "bar" }],
"batches": [
{
"barcode": "bc1",
"batch_prop": "baz",
"records": [
{ "cimac_id": "CTTTPPS1.01", "record_prop": 1 },
{ "cimac_id": "CTTTPPS2.01", "record_prop": 2 }
]
},
{
"barcode": "bc2",
"batch_prop": "buz",
"records": [
{ "cimac_id": "CTTTPPS3.01", "record_prop": 1 },
{ "cimac_id": "CTTTPPS4.01", "record_prop": 2 }
]
}
]
}
It is also straightforward to model this example using sqlalchemy. Assume CommonColumns
includes an autoincrementing primary key column id
on all tables:
class ExampleAssayUpload(CommonColumns):
__tablename__ = "example_assay_uploads"
overview = Column(String, nullable=False)
# NOTE: We could use SQLAlchemy `relationship`s to make it really easy to
# look up records with foreign key relationships to this one, but I'll omit
# them for readability.
class ExampleAssayAntibody(CommonColumns):
__tablename__ = "example_assay_antibodies"
upload_id = Column(Integer, ForeignKey("example_assay_uploads.id"), nullable=False)
antibody = Column(Enum("foo", "bar", name="example_assay_antibody_enum"), nullable=False)
class ExampleAssayBatch(CommonColumns):
__tablename__ = "example_assay_batches"
upload_id = Column(Integer, ForeignKey("example_assay_uploads.id"), nullable=False)
barcode = Column(String, nullable=False)
batch_prop = Column(String)
# Barcodes must be unique within uploads but not across them
__table_args__ = (UniqueConstraint(upload_id, barcode),)
class ExampleAssayRecord(CommonColumns):
__tablename__ = "example_assay_records"
# Assume some samples table exists, and that we should have exactly one record
# per cimac id globally.
cimac_id = Column(String, ForeignKey("samples.cimac_id"), nullable=False, unique=True)
batch_barcode = Column(String, ForeignKey("example_assay_batches.barcode"), nullable=False)
record_prop = Column(Integer)
The same metadata could be added to the database by running code like
# build transaction without actually contacting db
upload = ExampleAssayUpload(overview="great assay, good work").insert()
ExampleAssayAntibody(upload_id=upload.id, antibody="foo").insert()
ExampleAssayAntibody(upload_id=upload.id, antibody="bar").insert()
ExampleAssayBatch(barcode="bc1", batch_prop="buz").insert()
ExampleAssayRecord(cimac_id="CTTTPPS1.01", batch_barcode="bc1", record_prop=1).insert()
ExampleAssayRecord(cimac_id="CTTTPPS2.01", batch_barcode="bc1", record_prop=2).insert()
ExampleAssayBatch(barcode="bc2", batch_prop="baz").insert()
ExampleAssayRecord(cimac_id="CTTTPPS3.01", batch_barcode="bc2", record_prop=1).insert()
ExampleAssayRecord(cimac_id="CTTTPPS4.01", batch_barcode="bc2", record_prop=2).insert()
# then something like this to insert...
session.commit()
When we model multiple assay versions, we have to be able to:
- make version-independent queries against all metadata for this assay (e.g., list all CyTOF uploads without having to query each version of the upload separately)
- know what version of an assay a piece of metadata conforms to
- easily write clean, version-responsive code
Experience (e.g., the explosion of variety across CyTOF) shows that we might best think of different assay versions in object-oriented terms as different subtypes that inherit from the same base type, rather than as points on a semantic versioning timeline.
Conveniently, SQLAlchemy provides nice support for object-oriented inheritance-style data modeling in the form of joined table inheritance. This would allow us to model an example assay (called RNA below for no reason) with the following variants:
- all versions have a required attribute
column_1
- Mt. Sinai used a version that also has a required
column_2
- Stanford used a version that requires
column_1
,column_2
, and a newcolumn_3
.
class RnaUpload(CommonColumns):
__tablename__ = "rna_uploads"
_version = Column(String, nullable=False)
protocol_id = Column(String, ForeignKey("trials.protocol_id"), nullable=False)
column_1 = Column(String, nullable=False)
__mapper_args__ = {
# this tells sqlalchemy to use the `_version` column to distinguish between subtypes
"polymorphic_on": _version,
# this tells sqlalchemy what `_version` value instances of this class should have when
# inserted into the `rna_uploads` table.
"polymorphic_identity": "base"
}
class RnaUploadMtSinai(RnaUpload):
__tablename__ = "rna_uploads_mt_sinai"
id = Column(Integer, ForeignKey("rna_uploads.id"), primary_key=True)
column_2 = Column(String, nullable=False)
__mapper_args__ = {
"polymorphic_identity": "mt. sinai"
}
class RnaUploadStanford(RnaUpload):
__tablename__ = "rna_uploads_stanford"
id = Column(Integer, ForeignKey("rna_uploads.id"), primary_key=True)
column_2 = Column(String, nullable=False)
column_3 = Column(String, nullable=False)
__mapper_args__ = {
"polymorphic_identity": "stanford"
}
The above makes it possible to write code like
# Read RNA uploads of all versions from the database
results = RnaUpload.list()
# Iterate through each upload, performing conditional logic based on version
for res in results:
if isinstance(res, RnaUploadMtSinai):
# res will have non-null attributes column_1 and column_2
pass
elif isinstance(res, RnaUploadStanford):
# res will have non-null attributes column_1, column_2, and column_3
pass
else:
# res will have non-null attribute column_1
pass
Under the hood, SQLAlchemy is joining across multiple SQL tables. To write by-hand SQL queries against particular versions requires doing the same sort of join
s between the base table and the version table:
select
r.id, r.column_1, s.column_2, s.column_3
from
rna_uploads r join rna_uploads_stanford s on r.id = s.id
For modeling files, I recognize a few requirements:
- It must be possible to execute efficient queries against all available files in the system
- Files must have some enforced connection to the upload metadata associated with them (i.e., inserts should be blocked without this info)
- Different file types have different associated metadata (e.g., an RNASeq BAM file must be connected to a sample-level RNASeq metadata record)
Note that this doesn't describe our current system: we do not strongly enforce any connection between file records and specific metadata, except at the trial-level.
The same SQLAlchemy inheritance tooling highlighted above could come in handy for fulfilling these file modeling requirements.
This could look something like:
class File(CommonColumns):
__tablename__ = "files"
upload_id = Column(Integer, ForeignKey('upload_jobs.id'), nullable=Falses)
object_url = Column(String, nullable=False, unique=True)
facet_group = Column(String, nullable=False)
file_size_bytes = Column(Integer, nullable=False)
... # other fields that all files should have
# polymorphic discriminator column
file_type = Column(String, nullable=False)
__mapper_args__ = {
"polymorphic_on": file_type,
"polymorphic_identity": "base"
}
class NpxFile(File):
__tablename__ = "npx_files"
upload_id = Column(String, ForeignKey('olink_uploads.id'), nullable=False)
chip_barcode = Column(String, ForeignKey('olink_records.chip_barcode'))
npx_manager_version = Column(String, nullable=False)
__mapper_args__ = {
"polymorphic_identity" = "npx",
}
class FcsFile(File):
__tablename__ = "fcs_files"
upload_id = Column(String, ForeignKey('cytof_uploads.id'), nullable=False)
cimac_id = Column(String, ForeignKey('cytof_records.cimac_id'), nullable=False)
__mapper_args__ = {
"polymorphic_identity" = "fcs",
}
This will allow efficient (and developer-friendly) querying across all files via the File
model, with easy file-type specific metadata look up for individual instances of particular file types.
It's important that we're able to continue to support our existing metadata spreadsheets to whatever extent we can under this new data model implementation, for a couple reasons:
- Switching up the metadata spreadsheets is not a good thing to do to the CIMAC labs
- Maintaining compatibility will ease our migration to the new system (see the rough migration plan below)
Luckily, it should be possible to do that under the relational data model. We can compose spreadsheet configs in python, similar to our existing "template" configs but with a few notable improvements:
- use python classes to explicitly structure and document spreadsheet config components
- rely on SQLAlchemy
Column
info, pulled directly from model class attributes, for data type info and documentation - infer db insertion order from foreign key relationships between the model instances, rather than requiring explicit specification from users
- no more prism or merge pointers
We could build template configs from python classes that look like
class MetadataWorksheetEntry:
"""
One field in a metadata worksheet. Provides configuration for reading a
value from a metadata spreadsheet cell into a sqlalchemy model attribute
(or set of related attributes).
Args:
column: column attribute on a SQLAlchemy model class.
label: optional human-readable label for this field, if not inferrable from `column`.
process_as: optional dictionary mapping column attributes to data processing function.
etc. etc...
"""
def __init__(
self,
column: Column,
label: Optional[str] = None,
process_as: Optional[Dict[Column, Callable[[str], Any]]] = None,
gcs_uri_format: Optional[str] = None,
):
...
# plus more useful helper methods, e.g. getting a formatted GCS URI,
# applying all process_as funcs to a value, etc.
class MetadataWorksheet:
"""
A worksheet within a metadata spreadsheet. A worksheet has a name, a list
of `preamble_rows` and a dictionary mapping data column group headers to lists
of `data_columns`.
etc. etc...
"""
def __init__(
self,
name: str,
preamble_rows: List[MetadataWorksheetEntry] = [],
data_columns: Dict[str, List[MetadataWorksheetEntry]] = {},
):
...
class MetadataSpreadsheet:
"""
A metadata spreadsheet. Must have attributes `upload_type` and `worksheets` defined.
etc. etc...
"""
upload_type: str
worksheets: List[MetadataWorksheet]
# + methods for writing empty spreadsheets and reading populated spreadsheets
This structure would allow us to specify templates entirely in python. For example, the PBMC template definition could look something like this, where Shipment
, Sample
, and Participant
are sqlalchemy model classes, and WS
and WSEntry
are short aliases for MetadataWorksheet
and MetadataWorksheetEntry
respectively:
class PBMCSpreadsheet(MetadataSpreadsheet):
worksheets = [
WS(
name="Shipment",
preamble_rows=[
WSEntry(Shipment.manifest_id),
WSEntry(Shipment.protocol_id, label="Protocol Identifier"),
WSEntry(Shipment.assay_priority),
WSEntry(Shipment.assay_type),
WSEntry(Shipment.receiving_party),
WSEntry(Shipment.courier),
WSEntry(Shipment.tracking_number),
WSEntry(Shipment.account_number),
WSEntry(Shipment.shipping_condition),
WSEntry(Shipment.date_shipped),
WSEntry(Shipment.date_received),
WSEntry(Shipment.quality_of_shipment),
WSEntry(Shipment.ship_from),
WSEntry(Shipment.ship_to),
],
),
WS(
name="Samples",
data_columns={
"IDs": [
WSEntry(Sample.shipping_entry_number, label="Entry (#)"),
WSEntry(Sample.collection_event_name),
WSEntry(Participant.cohort_name),
WSEntry(Participant.participant_id),
WSEntry(Sample.parent_sample_id),
WSEntry(Sample.processed_sample_id),
WSEntry(
Sample.cimac_id,
process_as={
Participant.cimac_participant_id: lambda c: c[:7],
Sample.cimac_participant_id: lambda c: c[:7],
},
),
],
"Filled by Biorepository": [
WSEntry(Sample.box_number),
WSEntry(Sample.sample_location),
WSEntry(Sample.type_of_sample),
WSEntry(Sample.sample_collection_procedure),
WSEntry(Sample.type_of_primary_container),
WSEntry(Sample.processed_sample_type),
WSEntry(Sample.processed_sample_volume),
WSEntry(Sample.processed_sample_volume_units),
WSEntry(Sample.processed_sample_concentration),
WSEntry(Sample.processed_sample_concentration_units),
],
"Filled by CIMAC Lab": [
WSEntry(Sample.pbmc_viability),
WSEntry(Sample.pbmc_recovery),
WSEntry(Sample.pbmc_resting_period_used),
WSEntry(Sample.material_used),
WSEntry(Sample.material_remaining),
WSEntry(Sample.material_storage_condition),
WSEntry(Sample.quality_of_sample),
WSEntry(Sample.sample_replacement),
WSEntry(Sample.residual_sample_use),
WSEntry(Sample.comments),
],
},
),
]
Since this is just python code, config components that are repeated across spreadsheets (like the "Shipment" worksheet preamble above) can be factored out and shared across configs.
Since these new python spreadsheet configs contain nearly identical information to the existing JSON configs, it should be relatively easy to adapt the existing XlTemplateWriter
code to generate empty spreadsheets based on such python configurations.
Our metadata management system takes filled-out metadata spreadsheets as input. Currently, it outputs nested JSON metadata blobs from these configurations (functionality that we call "prism"). Under the relational metadata model, we'll instead output a list of sqlalchemy model instances to insert into the database.
Here's a rough outline for how we can build sqlalchemy model instances from a filled-out metadata spreadsheet:
-
Read raw data from spreadsheets. Given a populated spreadsheet and a config like the one in the example above, we should be able to convert each row of data in the spreadsheet (considering the preamble section as equivalent to a vertical "row" of sorts) into a sqlalchemy-attribute-to-spreadsheet-value dictionary by matching up labels in the spreadsheet to the relevant
MetadataSpreadsheetEntry
in the config. These dictionaries will look something like:{ Participant.cimac_participant_id: 'CTTTPP', Sample.cimac_participant_id: 'CTTTPP', # let's say this is a foreign-key link to Participant, loaded by process_as Participant.cohort_name: 'Arm_A', Sample.cimac_id: 'CTTTPP01.01', Sample.collection_event_name: 'Baseline', ... }
All data in a metadata spreadsheet can be loaded as a list of such attribute-to-value dictionaries.
-
Translate raw data into SQLAlchemy model instances. Each of these dictionaries can be used to construct a list of model instances:
[ Participant(cimac_participant_id='CTTTPP', cohort_name='Arm_A', ...), Sample(cimac_id='CTTTPP01.01', cimac_participant_id='CTTTPP', collection_event_name='Baseline', ...) ]
So, we can convert the list of attribute-to-value dictionaries representing the spreadsheet data into a list of sqlalchemy model instances.
-
De-duplicate model instances. The list generated in step 2 may contain duplicates, where a "duplicate" is a model instance with the same values in unique-constrained columns (e.g.,
Participant.cimac_participant_id
,Sample.cimac_id
) as another model instance. This will happen, for example, in most manifest spreadsheets, where the sample-level data rows contain duplicate participant info across them.We need to de-duplicate records generated in step 2 by combining instances with the same values in unique-constrained columns. If any values in non-unique columns conflict between two duplicate records (e.g., the same participant is listed with two different cohort names), the data in the template is invalid.
Getting all unique columns for a model class is easy:
unique_columns = [column for column in model if column.unique]
This process will leave us with a list of model instances representing db records that a given spreadsheet intends to create.
Note: the above description omits behavior related to (multi-)artifact fields because I haven't really figured that out yet, but I'm optimistic there's a way to handle them that fits into this general outline.
This process requires roughly two steps:
-
Determine an insertion order. We can't just dump the list of model instances we extract from a spreadsheet into the database haphazardly - what if we try to insert the sample record in the example above before its participant exists? Rather, we need to find an insertion order that guarantees that if a record is about to be inserted, all other records from the spreadsheet that it might have foreign key dependencies on have already been inserted.
We can use the foreign key relationships between sqlalchemy models to solve for an insertion order with this guarantee. By thinking about the model classes in a spreadsheet config as a directed acyclic graph whose nodes are model classes and edges are foreign-key relationships, we can topologically sort the nodes in this graph to obtain a valid model instance insertion order. (Alternatively, if we find we need to be less clever and more explicit, we can just require that template configs specify a model class insertion order.)
Note: this insertion order only guarantees that if a record in a metadata spreadsheet depends on another record in that same spreadsheet, the latter record will be inserted before the former. It doesn't guarantee that trying to insert records from a metadata spreadsheet won't lead to foreign key errors, since those records can reference metadata outside of the scope of the spreadsheet (e.g., trials' protocol ids).
-
Insert the db records. Following the insertion order determined in step 1, try to insert the records one-by-one, catching any exceptions encountered, until all record inserts have been attempted. If no exceptions were encountered, the metadata was successfully ingested. If any exceptions were encountered, rollback all successful inserts and return the list of exceptions as validation errors. Open question: what PostgreSQL locking is required to make this validate-by-inserting process safe?
Pasha's comment:
- my understanding that it should rather be an
upsert
- because some objects/records might be already present, e.g. a participant for a new sample might already be created by some previous upload. -
validate-by-inserting
in one transaction should be "safe" in PG, as in: atomic. Or were you taking about some other safety?
- my understanding that it should rather be an
SQLAlchemy Column
s support a doc
attribute for storing descriptions, and we should be able to use these to document fields on different entities in the same manner as we use JSON schema description
s:
class MyModel(CommonColumns):
__tablename__ = "my_models"
column_1 = Column(String, nullable=False, doc="A nice, important property on this data type")
column_1 = Column(Integer, doc="A less crucial property on this data type")
It's possible some auto-docs library exists for SQLAlchemy, but I haven't managed to find one yet. Luckily, automatically iterating through all models in a schema and all fields on each model is straightforward, so generating documentation should be too:
# Assuming all model definitions in the schema inherit from CommonColumns...
all_models = [
c for c in CommonColumns._decl_class_registry.values()
if isinstance(c, type) and issubclass(c, CommonColumns)
]
for model in all_models:
# Do something with each Column definition
for column in model.__table__.columns:
# access the column type
column.type
# access the doc field
column.doc
# Do something with links to other models
for link in model.__table__.foreign_keys:
# access the linked `table.column` string
link.column
Adapting the above code into a jinja template (or what-have-you) should be relatively easy.
As I see it, a migration from our existing metadata system to this new one must:
- Be non-breaking from the end user's perspective. Nothing about users' experience of existing CIDC features should change as the migration is happening.
- Not block bug fixes or development of new features on top of the existing system.
- Allow us to build confidence in the new system using real data before phasing out the existing system.
The most straightforward way I've thought of to fulfill all these requirements is to build the new system into the database (then API, UI, CLI etc.) right alongside the existing system. The new relational metadata model can be introduced into the database schema in an entirely additive manner, using new tables to avoid modifying existing tables.
This means that, when a new manifest or bundle of assay data is uploaded, both the existing system and the new system can process it, inserting both JSON and relational copies of the metadata into the database respectively. By inspecting the relational metadata copy and comparing it to the metadata generated by the existing system, we can gain confidence in its integrity.
Once we're sufficiently confident that the relational data model is in a deployment-ready state, we should do these things in this order:
- Copy (without deleting!) all older, purely JSON-based metadata to the new relational model, probably via some giant ETL script. This might be a pain, but we only need to run it (correctly) once. Maybe whatever workflow arises for creating the relational-ized versions of JSON schema entities should include writing an ETL script for taking JSON instances and converting them to sqlalchemy model instances.
- One-at-a-time, update API endpoints that rely on metadata to use metadata read from the relational system, not the JSON system. Ensure the new endpoints work as expected.
- Once all user-facing metadata is derived from the relational system, phase out the existing metadata system by deleting JSON-style metadata and database tables that are no longer used. Make a by-hand backup of the database and save it in GCS before doing so.
Also - I think we should do all of this development work in the cidc-api-gae
repository, with the goal of archiving the cidc-schemas
repository once the migration is complete. So far, the logistical cost of having data modeling and validation code in a separate location from our server code has far outweighed any organizational benefit, in my opinion.
- Should the assay/analysis upload process mimick the current multi-step flow of the existing upload process, or is there a way to simplify it under the relational model?
- How can we define "confidence" in practice when it comes to evaluating the production-readiness of a relational data model?
- Are there better alternatives to the solutions proposed above?
- Have any key system requirements been glaringly omitted? -- Updates and modifications How to do them???.
- Does this migration definitely reduce the net complexity of the CIDC metadata management system?
- What sort of developer documentation is missing from the existing system that we want to ensure we have for the new system?
I think this process is where we are also seeing trouble with our existing setup. Manual editing is inappropriate as we've seen that can lead to inconsistencies.
Given that our system (to the end user) transacts on a "manifest" level, i.e. a spreadsheet (shipping, assay, analysis) causes data to be added to the system. I would strongly prefer if we could have a way to recall a "spreadsheet" which would remove all derivative data. Then we could re-add a modified spreadsheet to make corrections.
I.e. when bad data is inserted we correct it by DELETING the whole batch and re-inserting by loading a corrected spreadsheet.
Is this possible?
Pasha's comment: DELETING or "redo the spreadsheet wholly" - sql has (and PG supports) on delete cascade
- so i.e. when an upload_job/manifest gets deleted / re-uploaded (with the same id) code could do just delete for this manifest and all db records referring it should be deleted in a cascading way automatically.