Migrate all of the SQLite building code to a different repo. (#3)

This allows the SQLite-building code to be used in other (non-BioC) contexts. Now this repository is solely dedicated to building metadata indices for Bioconductor, along with hosting of the schemas for BioC metadata. We also take this opportunity to merge the scRNAseq schema into the BioC schema under the 'takane' application, given that none of the metadata fields were specific to the scRNAseq package anyway. This simplifies the setup for scRNAseq (and future packages) by consolidating everything into one SQLite file.
ArtifactDB · Feb 21, 2024 · ec66569 · ec66569
1 parent 5a61eca
commit ec66569
Show file tree

Hide file tree

Showing 58 changed files with 326 additions and 1,789 deletions.
diff --git a/.github/workflows/fresh-build.yaml b/.github/workflows/fresh-build.yaml
@@ -21,16 +21,14 @@ jobs:
         key: modules-${{ hashFiles('**/package.json') }}
 
     - name: Install packages
-      run: npm i --include-dev
+      run: npm i
 
     - name: Perform a fresh build
-      run: |
-        mkdir output
-        node scripts/fresh.js -o output
+      run: ./fresh.sh 
 
     - name: Publishing files
       uses: softprops/action-gh-release@v1
       with:
         name: Latest build
         tag_name: latest
-        files: output/**
+        files: build/**
diff --git a/.github/workflows/run-tests.yaml b/.github/workflows/run-tests.yaml
@@ -14,10 +14,10 @@ jobs:
   test:
     runs-on: ubuntu-latest
     steps:
-    - uses: actions/checkout@v3
+    - uses: actions/checkout@v4
 
     - name: Set up Node
-      uses: actions/setup-node@v2
+      uses: actions/setup-node@v4
       with:
         node-version: 20
 
@@ -30,5 +30,16 @@ jobs:
     - name: Install packages
       run: npm i --include-dev
 
+    - name: Create the merged schema
+      run: ./merge.sh
+
     - name: Run tests
       run: npm run test
+
+    - name: Publish schemas
+      if: github.ref == 'ref/head/master'
+      uses: JamesIves/github-pages-deploy-action@v4
+      with:
+        clean: false
+        branch: gh-pages
+        folder: merged
diff --git a/.github/workflows/update-indices.yaml b/.github/workflows/update-indices.yaml
@@ -3,7 +3,7 @@ on:
     - cron: '0 0 * * *'
   workflow_dispatch:
 
-name: Update indices
+name: Update builds
 
 jobs:
   build:
@@ -23,21 +23,21 @@ jobs:
         key: modules-${{ hashFiles('**/package.json') }}
 
     - name: Install packages
-      run: npm i --include-dev
+      run: npm i
 
     - name: Download SQLite files
       uses: robinraju/[email protected]
       with:
         latest: true
         fileName: "*"
-        out-file-path: "indices"
+        out-file-path: "build"
 
-    - name: Update indices
+    - name: Update builds
       id: updator
       run: |
-        OLDSTAMP=$(cat indices/modified)
-        node scripts/update.js -d indices
-        NEWSTAMP=$(cat indices/modified)
+        OLDSTAMP=$(cat build/modified)
+        ./update.sh
+        NEWSTAMP=$(cat build/modified)
         CHANGED=$(($NEWSTAMP != $OLDSTAMP))
         echo "modified=CHANGED" >> "$GITHUB_OUTPUT"
 
@@ -47,4 +47,4 @@ jobs:
       with:
         name: Latest build
         tag_name: latest
-        files: indices/**
+        files: build/**
diff --git a/.gitignore b/.gitignore
@@ -4,4 +4,8 @@ output/
 TEST_*
 *.swp
 *.sqlite3
-
+.config/
+build/
+stringy
+merged/
+.DS_Store
diff --git a/README.md b/README.md
@@ -1,190 +1,51 @@
-# SQLite databases for gypsum metadata
+# Bioconductor metadata index
 
 [![RunTests](https://github.com/ArtifactDB/gypsum-to-sqlite/actions/workflows/run-tests.yaml/badge.svg)](https://github.com/ArtifactDB/gypsum-to-sqlite/actions/workflows/run-tests.yaml)
 [![Updates](https://github.com/ArtifactDB/gypsum-to-sqlite/actions/workflows/update-indices.yaml/badge.svg)](https://github.com/ArtifactDB/gypsum-to-sqlite/actions/workflows/update-indices.yaml)
 
-## Overview 
+## Overview
 
-This repository contains schemas and code to generate SQLite files from metadata in the [**gypsum** backend](https://github.com/ArtifactDB/gypsum-worker).
-The SQLite files can then be used in client-side searches to find interesting objects for further analysis.
-We construct new indices by fetching metadata and converting them into records on one or more tables based on a JSON schema specification;
-existing indices are updated by routinely scanning the [logs](https://github.com/ArtifactDB/gypsum-worker#parsing-logs) for new or deleted content.
+This repository compiles metadata documents from the [**gypsum** backend](https://github.com/ArtifactDB/gypsum-worker) into a SQLite index,
+using scripts from the [gypsum-metadata-index](https://github.com/ArtifactDB/gypsum-metadata-index) repository.
+Applications can download these indices to extract the metadata and/or perform a range of queries including full-text searches.
+Check out the [relevant documentation](https://github.com/ArtifactDB/gypsum-metadata-index/blob/master/README.md) of the tables within each SQLite file.
 
-This document is intended for system administrators or the occasional developer who wants to create a new search index for their package(s).
+This document is intended for system administrators or the occasional developer who wants to create a new index for their package(s).
 Users should not have to interact with these indices directly, as this should be mediated by client packages in relevant frameworks like R/Bioconductor.
 For example, the [gypsum R client](https://github.com/ArtifactDB/gypsum-R) provides functions for obtaining the schemas and indices,
 which are then called by more user-facing packages like the [scRNAseq](https://github.com/LTLA/scRNAseq) R package.
 
-## From JSON schemas to SQLite
+## Metadata and schemas
 
-We expect that the metadata for any **gypsum** object can be represented as JSON, which allows for fairly complex metadata fields.
-The expectations for the metadata are described by [JSON schemas](https://json-schema.org) - readers can find some examples in the [schemas/](schemas/) subdirectory.
-The scripts in this repository use the JSON schemas to automtically initialize SQLite tables and to convert JSON metadata into table rows.
-Each JSON schema is used to generate its a separate SQLite file that contains one or more tables depending on the schema's complexity.
+To be eligible for inclusion in this index, uploads should include one or more JSON-formatted metadata documents that are assigned to each project-asset-version combination.
+The name of the metadata document determines the database into which it is inserted, as well as the [JSON schema](https://json-schema.org) used for metadata validation:
 
-### File contents
+- Files named `_bioconductor.json` should validate against the [Bioconductor metadata schema](schemas/bioconductor/v1.json).
+  These are compiled into the `bioconductor.sqlite3` file.
 
-Each SQLite file contains a `core` table.
-Each row corresponds to an indexed object in **gypsum**.
-The table will have at least the following fields, all of which have the `TEXT` type:
+Package developers may open a [pull request](https://github.com/ArtifactDB/gypsum-to-sqlite) on this repository to add application-specific metadata.
+This can involve either:
 
-- `_key`: the primary key, created by combining `_project`, `_version`, `_asset`, `_path`.
-- `_project`: the name of the project.
-- `_asset`: the name of the asset.
-- `_version`: the name of the version.
-- `_path`: the path to the object inside this versioned asset.
-  This is only supplied if the object is located inside a subdirectory of the asset, otherwise it is set to `null`.
-- `_object`: the object type.
+- Adding an application-specific subschema to the `schemas/bioconductor/MY_APP_HERE` subdirectory.
+  Any application-specific metadata will be automatically incorporated into the existing `bioconductor.sqlite3`.
+- Adding an entirely new schema in a `schemas/MY_APP_HERE` subdirectory.
+  This is more flexible and allows for metadata that is not compatible with the Bioconductor schema,
+  but requires additional updates to [`fresh.sh`](fresh.sh) and [`update.sh`](update.sh) to build and update the new database.
 
-The SQLite file may contain a `free_text` virtual FTS5 table where each row corresponds to an indexed **gypsum** object.
-This contains at least the `_key` column (same as above), along with one or more additional columns corresponding to free-text metadata fields.
-If no metadata fields are marked as free-text, the `free_text` table will not be present in the file.
-
-The SQLite file may contain any number of `multi_<FIELD>` tables, where `<FIELD>` is a metadata field that is a JSON array.
-This holds one-to-many mappings between a **gypsum** object and the array items, so an indexed object may have zero, one or many rows in this table.
-The table contains at least the `_key` column (same as above), with one or more additional columns depending on the type of the array items.
-If the items are booleans, integers, strings or numbers, the table contains exactly one `item` column of the corresponding type;
-if the items are objects, the table contains columns corresponding to the properties of the object.
-
-For `core` and `multi_<FIELD>` tables, we generate an index for each column.
-Clients can perform most complex queries efficiently with the relevant inner joins.
-
-### Table generation rules
-
-Each JSON schema is used to generate a SQLite table according to some simple rules:
-
-- The schema must be a top-level `"type": "object"`.
-- An `integer` property is converted to an `INTEGER` field on the `core` table.
-- A `boolean` property is converted to an `INTEGER` field on the `core` table.
-- A `number` property is converted to a `REAL` field on the `core` table. 
-- A `string` property is usually converted to a `TEXT` field.
-  However, if its `_attributes` contain `"free_text"`, it will instead be converted into a column of `free_text`.
-- An `array` property is converted to a separate `multi_<FIELD>` table, where `<FIELD>` is the name of the property.
-  - If the items are booleans, integers, strings or numbers, the type of the `item` column is determined as described above.
-    However, note that any `"free_text"` in the `_attributes` is ignored.
-  - If the items are objects, one column is generated per property in the object, following the rules described above.
-    Properties should be booleans, integers, strings or numbers; any `"free_text"` in the `_attributes` is again ignored.
-- We do not support `object` properties.
-  Schema authors should flatten their objects for table generation.
-- Properties should not start with an underscore, as these are reserved for special use.
-
-### Adding new schemas
-
-Package developers with objects stored in the **gypsum** backend may wish to create custom SQLite files to enable package-specific searches.
-This can be easily done by adding or modifying files in a few locations:
-
-- Add a new JSON schema to [`schemas/`](schemas/).
-  This file's name should have the `.json` file extension and should not contain any whitespace.
-  - It is also recommended to add some tests to [`tests/schemas`](tests/schemas) to ensure that the schema behaves as expected.
-- Modify the [`scripts/indexVersion.js`](scripts/indexVersion.js) file to gather metadata from the **gypsum** backend into a JSON object.
-  This should only involve fetching and parsing some small files from the bucket;
-  package developers should consider computing any complex metadata before or during the original upload to the bucket.
-
-Once a new schema is added, the code in this repository will automatically create and update a SQLite file corresponding to the new schema.
-
-## SQLite file generation and editing
-
-The [`scripts/`](scripts/) subdirectory contains several scripts for generating and updating the SQLite files.
-These expect to have a modestly recent version of Node.js (tested on 16.19.1) and required dependencies can be installed with the usual `npm install` process.
-
-### Creating new files
-
-The [`fresh.js`](scripts/fresh.js) script will generate one SQLite file corresponding to each JSON schema.
-This is done by listing all projects and assets in the **gypsum** backend,
-identifying the latest version of each asset,
-extracting metadata for objects in the latest version,
-and generating a SQLite file from the extracted metadata. 
-
-```shell
-# Older versions of Node.js may need --experimental-fetch.
-node scripts/fresh.js -s SCHEMAS -o OUTPUTS
-```
-
-The script has the following options:
-
-- `-s`, `--schemas`: the directory containing the JSON schema files.
-  Defaults to `./schemas`.
-- `-o`, `--outputs`: the directory in which to store the output SQLite files.
-  Each file will have the same prefix as its corresponding JSON schema.
-  Defaults to `./outputs`.
-- `-x`, `--only`: name of a project, indicating that indexing should only be performed for that project.
-  If not supplied, indexing is performed for all projects.
-  Useful for debugging specific projects.
-- `-a`, `--after`: any string such that indexing is only performed for projects with names that sort after that string.
-  If not supplied, indexing is performed for all projects.
-  Useful for debugging a set of projects.
-- `-w`, `--overwrite`: boolean that specifies whether to overwrite existing SQLite files in the output directory.
-  This can be turned off and combined with `--after` to iteratively construct the full index.
-  Defaults to `true`.
-
-In addition to creating new SQLite files, `fresh.js` will also add a `modified` file containing a Unix timestamp.
-This will be used by `update.js` (see below) to determine which logs to consider during updates.
-
-### Updating files from logs
-
-The [`update.js`](scripts/update.js) script will modify each SQLite file based on recent changes in the **gypsum** backend.
-It does so by scanning the logs in the backend, filtering for those generated after the `modified` timestamp.
-Each log may be used to perform an update to the SQLite file based on its action type (see [here](https://github.com/ArtifactDB/gypsum-worker#parsing-logs)),
-either by inserting rows corresponding to new objects or (more rarely) by deleting rows corresponding to deleted assets, versions or projects.
-The `add-version` and `delete-version` actions will only have an effect if the affected version is the latest;
-for `delete-version`, the script will insert metadata for objects in the currently-latest version into the SQLite file.
-
-```shell
-# Older versions of Node.js may need --experimental-fetch.
-node scripts/update.js -s SCHEMAS -d DIR
-```
-
-The script has the following options:
-
-- `-s`, `--schemas`: the directory containing the JSON schema files.
-  Defaults to `./schemas`.
-- `-d`, `--dir`: the directory containing the SQLite files to be modified.
-  Each file will have the same prefix as its corresponding JSON schema.
-  Defaults to the working directory.
-
-In addition to modifying the SQLite files, `update.js` will update the `modified` file to the timestamp of the last log.
-
-### Manual updates
-
-The [`manual.js`](scripts/manual.js) script will modify each SQLite file based on its arguments.
-It uses the same logic as `update.js` and is intended for testing/debugging the update code.
-Any updates to SQLite files in production should still be performed by `update.js`.
-
-```shell
-# Older versions of Node.js may need --experimental-fetch.
-node scripts/manual.js -t add-version -p PROJECT -a ASSET -v VERSION -l true
-node scripts/manual.js -t delete-version -p PROJECT -a ASSET -v VERSION -l true
-node scripts/manual.js -t delete-asset -p PROJECT -a ASSET
-node scripts/manual.js -t delete-project -p PROJECT -a ASSET
-```
-
-The script has the following options:
-
-- `-s`, `--schemas`: the directory containing the JSON schema files.
-  Defaults to `./schemas`.
-- `-d`, `--dir`: the directory containing the SQLite files to be modified.
-  Each file will have the same prefix as its corresponding JSON schema.
-  Defaults to the working directory.
-- `-t`, `--type`: the type of action to perform.
-  This is a required argument and should be one of `add-version`, `delete-version`, `delete-asset` or `delete-project`.
-- `-p`, `--project`: the name of the project.
-  This is a required argument.
-- `-a`, `--asset`: the name of the project.
-  This is a required argument for all `type` except for `delete-project`.
-- `-v`, `--version`: the name of the project.
-  This is a required argument for `add-version` and `delete-version`.
-- `-l`, `--latest`: boolean indicating whether the specified version is the latest of its asset.
-  This is a required argument for `add-version` and `delete-version`.
+In general, we recommend breaking down large JSON schemas into smaller subschemas for easier development.
+We then use [merge-json-schemas](https://github.com/ArtifactDB/merge-json-schemas) to merge subschemas into the full Bioconductor schema, which is then published on GitHub Pages.
+Developers adding entirely new schemas should udpate [`merge.sh`](merge.sh) to ensure these schemas are merged and published.
 
 ## Publishing SQLite files
 
 The various GitHub Actions in this repository will publish the SQLite files as release assets.
 
-- The [`fresh-build` Action](https://github.com/ArtifactDB/gypsum-to-sqlite/actions/workflows/fresh-build.yaml) will run the `fresh.js` script to create and publish a fresh build.
+- The [`fresh-build` Action](https://github.com/ArtifactDB/gypsum-to-sqlite/actions/workflows/fresh-build.yaml) will run the `fresh.sh` script to create and publish a fresh build.
   This is manually triggered and can be used on the rare occasions where the existing release is irrecoverably out of sync with the **gypsum** bucket.
-- The [`update-indices` Action](https://github.com/ArtifactDB/gypsum-to-sqlite/actions/workflows/update-indices.yaml) runs the `update.js` script daily to match changes to the bucket contents.
+- The [`update-indices` Action](https://github.com/ArtifactDB/gypsum-to-sqlite/actions/workflows/update-indices.yaml) runs the `update.sh` script daily to match changes to the bucket contents.
   This will only publish a new release if any changes were performed.
   - Note that cron jobs in GitHub Actions require some semi-routine nudging to indicate that the repository is still active, otherwise the workflow is disabled.
 
 The latest version of the SQLite files are available [here](https://github.com/ArtifactDB/gypsum-to-sqlite/releases/tag/latest).
-Clients can check the `modified` file to determine when the files were last updated (and whether local caches need to be refreshed).
+The `modified` file contains the Unix timestamp for the last update of the files;
+clients can check this file to determine whether local caches need to be refreshed.
diff --git a/fresh.sh b/fresh.sh
@@ -0,0 +1,11 @@
+#!/bin/sh
+
+set -e
+set -u
+
+rm -rf build
+mkdir build
+NODE_OPTIONS='--experimental-fetch' npx --package=gypsum-metadata-index fresh \
+    --class _bioconductor.json,bioconductor.sqlite3 \
+    --gypsum https://gypsum.artifactdb.com \
+    --dir build
diff --git a/merge.sh b/merge.sh
@@ -0,0 +1,10 @@
+#!/bin/sh
+
+set -e
+set -u
+
+rm -rf merged
+mkdir merged
+
+mkdir merged/bioconductor
+npx --package=merge-json-schemas merge schemas/bioconductor/v1.json > merged/bioconductor/v1.json
diff --git a/package.json b/package.json
@@ -3,19 +3,19 @@
   "name": "gypsum-to-sqlite",
   "version": "1.0.0",
   "type": "module",
-  "description": "Index for Bioconductor-related metadata in gypsum backends",
+  "description": "Index Bioconductor-related metadata from the gypsum backend",
   "author": "LTLA <[email protected]>",
   "main": "src/index.js",
   "license": "MIT",
   "scripts": {
     "test": "NODE_OPTIONS='--experimental-vm-modules --experimental-fetch' npx jest --testTimeout=100000"
   },
   "devDependencies": {
+    "ajv": "^8.11.0",
     "jest": "^29.3.1"
   },
   "dependencies": {
-    "@aws-sdk/client-s3": "^3.470.0",
-    "ajv": "^8.11.0",
-    "better-sqlite3": "^9.2.2"
+    "gypsum-metadata-index": "github:ArtifactDB/gypsum-metadata-index",
+    "merge-json-schemas": "github:ArtifactDB/merge-json-schemas"
   }
 }
diff --git a/schemas/bioconductor/takane/data_frame/v1.json b/schemas/bioconductor/takane/data_frame/v1.json
@@ -0,0 +1,30 @@
+{
+    "properties": {
+        "data_frame": {
+            "type": "object",
+
+            "title": "Data frame",
+            "description": "Information about a data frame object.",
+
+            "properties": {
+                "rows": {
+                    "type": "integer",
+                    "description": "Number of rows.",
+                    "minimum": 0
+                },
+
+                "column_names": {
+                    "type": "array",
+                    "description": "Names of the columns.",
+                    "items": {
+                        "type": "string"
+                    }
+                }
+            },
+
+            "required": [ "rows", "column_names" ],
+
+            "additionalProperties": false
+        }
+    }
+}