hubmapconsortium · gesinaphillips · Apr 25, 2024 · Apr 11, 2024 · Apr 11, 2024 · Apr 15, 2024
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,6 +8,7 @@
 - Accommodate dir schema minor versions
 - Fix ORCID URL checking
 - Add MUSIC next-gen directory schema
+- Updating documentation
 
 ## v0.0.18
 

diff --git a/README.md b/README.md
@@ -1,13 +1,13 @@
 # ingest-validation-tools
 
-HuBMAP data upload guidelines, and tools which check that uploads adhere to those guidelines.
+HuBMAP data upload guidelines and instructions for checking that uploads adhere to those guidelines.
 Assay documentation is on [Github Pages](https://hubmapconsortium.github.io/ingest-validation-tools/).
 
 HuBMAP has three distinct metadata processes:
 
 - **Donor** metadata is handled by Jonathan Silverstein on an adhoc basis: He works with whatever format the TMC can provide, and aligns it with controlled vocabularies.
-- **Sample** metadata is handled by Brendan Honick and Bill Shirey. [The standard operating procedure is outlined here.](https://docs.google.com/document/d/1K-PvBaduhrN-aU-vzWd9gZqeGvhGF3geTwRR0ww74Jo/edit)
-- **Dataset** uploads should be validated first by the TMCs. Dataset upload validation is the focus of this repo. [Details below.](#upload-process-and-upload-directory-structure)
+- **Sample** metadata is ingested by the [HuBMAP Data Ingest Portal](https://ingest.hubmapconsortium.org/)--see "Upload Sample Metadata" at the top of the page.
+- **Dataset** uploads should be validated first by the TMCs. Dataset upload validation is the focus of this repo. [Details below.](#for-data-submitters-and-curators)
 
 ## For assay type working groups:
 
@@ -26,7 +26,7 @@ When all the parts are finalized,
 
 ### Stability
 
-Once approved, both the list of metadata fields (metadata schema)
+Once approved, both the CEDAR Metadata Template (metadata schema)
 and the list of files (directory schema) are fixed in a particular version.
 The metadata for a particular assay type needs to be consistent for all datasets,
 as does the set of files which comprise a dataset.
@@ -42,6 +42,12 @@ contact Phil Blood (@pdblood).
 
 ## For data submitters and curators:
 
+### Validate TSVs
+
+To validate your metadata TSV files, use the [HuBMAP Metadata Spreadsheet Validator](https://metadatavalidator.metadatacenter.org/). This tool is a web-based application that will categorize any errors in your spreadsheet and provide help fixing those errors. More detailed instructions about using the tool can be found in the [Spreadsheet Validator Documentation](https://metadatacenter.github.io/spreadsheet-validator-docs/).
+
+### Validate Directory Structure
+
 Checkout the repo and install dependencies:
 
 ```
@@ -55,73 +61,66 @@ src/validate_upload.py --help
 
 You should see [the documention for `validate_upload.py`](script-docs/README-validate_upload.py.md)
 
-**Note**: you need to have _git_ installed in your system.
-
 Now run it against one of the included examples, giving the path to an upload directory:
 
 ```
 src/validate_upload.py \
   --local_directory examples/dataset-examples/bad-tsv-formats/upload \
+  --no_url_checks \
   --output as_text
 ```
+**Note**: URL checking is not supported via `validate_upload.py` at this time, and is disabled with the use of the `--no_url_checks` flag. Please ensure that any fields containing a HuBMAP ID (such as `parent-sample_id`) or an ORCID (`orcid`) are accurate.
 
 You should now see [this (extensive) error message](examples/dataset-examples/bad-tsv-formats/README.md).
 This example TSV has been constructed with a mistake in every column, just to demonstrate the checks which are available. Hopefully, more often your experience will be like this:
 
 ```
 src/validate_upload.py \
-  --local_directory examples/dataset-examples/good-codex-akoya/upload
+  --local_directory examples/dataset-examples/good-codex-akoya-metadata-v1/upload \
+  --no_url_checks
 ```
 
 ```
 No errors!
 ```
 
 Documentation and metadata TSV templates for each assay type are [here](https://hubmapconsortium.github.io/ingest-validation-tools/).
-Addition help for certain common error messages is available [here](README-validate-upload-help.md)
-
-### Validating single TSVs:
-
-If you don't have an entire upload directory at hand, you can validate individual
-metadata, antibodies, contributors, or sample TSVs:
-
-```
-src/validate_tsv.py \
-  --schema metadata \
-  --path examples/dataset-examples/good-scatacseq-v1/upload/metadata.tsv
-```
-
-```
-No errors!
-```
 
 ### Running plugin tests:
 
 Additional plugin tests can also be run.
 These additional tests confirm that the files themselves are valid, not just that the directory structures are correct.
 These additional tests are in a separate repo, and have their own dependencies.
 
+Starting from ingest-validation-tools...
 ```
-# Starting from ingest-validation-tools...
 cd ..
 git clone https://github.com/hubmapconsortium/ingest-validation-tests.git
 cd ingest-validation-tests
 pip install -r requirements.txt
+```
 
-# Back to ingest-validation-tools...
+Back to ingest-validation-tools...
+```
 cd ../ingest-validation-tools
+```
+
+Failing example, see [README.md](examples/plugin-tests/expected-failure/README.md)
+```
 src/validate_upload.py \
-  --local_directory examples/dataset-examples/good-codex-akoya/upload \
+  --local_directory examples/plugin-tests/expected-failure/upload \
+  --run_plugins \
+  --no_url_checks \
   --plugin_directory ../ingest-validation-tests/src/ingest_validation_tests/
 ```
 
 ## For developers and contributors:
 
-A good example is of programatic usage is `validate-upload.py`; In a nutshell:
+An example of the core error-reporting functionality underlying `validate-upload.py`:
 
 ```python
 upload = Upload(directory_path=path)
-report = ErrorReport(upload.get_errors())
+report = ErrorReport(errors=upload.get_errors(), info=upload.get_info())
 print(report.as_text())
 ```
 

diff --git a/docs/field-assays.yaml → ...ecated/_docs/field-assays_deprecated.yaml b/docs/field-assays.yaml → ...ecated/_docs/field-assays_deprecated.yaml
diff --git a/docs/field-descriptions.yaml → .../_docs/field-descriptions_deprecated.yaml b/docs/field-descriptions.yaml → .../_docs/field-descriptions_deprecated.yaml
diff --git a/docs/field-entities.yaml → ...ated/_docs/field-entities_deprecated.yaml b/docs/field-entities.yaml → ...ated/_docs/field-entities_deprecated.yaml
diff --git a/docs/field-schemas.xlsx → ...cated/_docs/field-schemas_deprecated.xlsx b/docs/field-schemas.xlsx → ...cated/_docs/field-schemas_deprecated.xlsx
diff --git a/docs/field-schemas.yaml → ...cated/_docs/field-schemas_deprecated.yaml b/docs/field-schemas.yaml → ...cated/_docs/field-schemas_deprecated.yaml
diff --git a/docs/field-types.yaml → ...recated/_docs/field-types_deprecated.yaml b/docs/field-types.yaml → ...recated/_docs/field-types_deprecated.yaml
diff --git a/script-docs/README-factor_field.py.md → ...docs/README-factor_field_deprecated.py.md b/script-docs/README-factor_field.py.md → ...docs/README-factor_field_deprecated.py.md
diff --git a/...docs/README-generate_field_enum_csv.py.md → ...-generate_field_enum_csv_deprecated.py.md b/...docs/README-generate_field_enum_csv.py.md → ...-generate_field_enum_csv_deprecated.py.md
diff --git a/...cs/README-generate_field_values_csv.py.md → ...enerate_field_values_csv_deprecated.py.md b/...cs/README-generate_field_values_csv.py.md → ...enerate_field_values_csv_deprecated.py.md
diff --git a/script-docs/README-generate_field_yaml.py.md → ...ADME-generate_field_yaml_deprecated.py.md b/script-docs/README-generate_field_yaml.py.md → ...ADME-generate_field_yaml_deprecated.py.md
diff --git a/script-docs/README-generate_grid.py.md → ...ocs/README-generate_grid_deprecated.py.md b/script-docs/README-generate_grid.py.md → ...ocs/README-generate_grid_deprecated.py.md
diff --git a/script-docs/README-generate_schema.py.md → ...s/README-generate_schema_deprecated.py.md b/script-docs/README-generate_schema.py.md → ...s/README-generate_schema_deprecated.py.md
diff --git a/_deprecated/_tests/test-generate-docs_deprecated.sh b/_deprecated/_tests/test-generate-docs_deprecated.sh
@@ -0,0 +1,82 @@
+#!/usr/bin/env bash
+set -o errexit
+
+die() { set +v; echo "$*" 1>&2 ; sleep 1; exit 1; }
+
+# Test field-descriptions.yaml and field-types.yaml:
+
+ATTR_LIST='description type entity assay schema'
+RERUNS=''
+for ATTR in $ATTR_LIST; do
+  PLURAL="${ATTR}s"
+  [ "$PLURAL" == 'entitys' ] && PLURAL='entities'
+  REAL_DEST="docs/field-${PLURAL}.yaml"
+  TEST_DEST="docs-test/field-${PLURAL}.yaml"
+  echo "Checking $REAL_DEST"
+
+  REAL_CMD="src/generate_field_yaml.py --attr $ATTR > $REAL_DEST;"
+  TEST_CMD="src/generate_field_yaml.py --attr $ATTR > $TEST_DEST"
+
+  mkdir docs-test || echo "Already exists"
+  eval $TEST_CMD || die "Command failed: $TEST_CMD"
+  diff -r $REAL_DEST $TEST_DEST || RERUNS="$RERUNS $REAL_CMD"
+  rm -rf docs-test
+done
+[ -z "$RERUNS" ] || die "Update YAMLs: $RERUNS"
+
+# Test Excel summary:
+# This relies on the YAML created above.
+
+FILE="field-schemas.xlsx"
+echo "Checking $FILE"
+
+mkdir docs-test
+REAL_DEST="docs/$FILE"
+TEST_DEST="docs-test/$FILE"
+REAL_CMD="src/generate_grid.py $REAL_DEST"
+TEST_CMD="src/generate_grid.py $TEST_DEST"
+eval $TEST_CMD || die "Command failed: $TEST_CMD"
+diff $REAL_DEST $TEST_DEST || die "Update needed: $REAL_CMD"
+
+# Test docs:
+
+for TYPE in $(ls -d docs/*); do
+  # Skip directories that are unpopulated:
+  TYPE=`basename $TYPE`
+  LOOKFOR_CURRENT_ASSAY="docs/$TYPE/current/$TYPE-metadata.tsv"
+  LOOKFOR_CURRENT_OTHER="docs/$TYPE/current/$TYPE.tsv"
+  LOOKFOR_DEPRECATED_ASSAY="docs/$TYPE/deprecated/$TYPE-metadata.tsv"
+  LOOKFOR_DEPRECATED_OTHER="docs/$TYPE/deprecated/$TYPE.tsv"
+  if [ ! -e $LOOKFOR_CURRENT_ASSAY ] && [ ! -e $LOOKFOR_CURRENT_OTHER ] && [ ! -e $LOOKFOR_DEPRECATED_ASSAY ] && [ ! -e $LOOKFOR_DEPRECATED_OTHER ]; then
+    echo "Skipping $TYPE. To add: 'touch $LOOKFOR_CURRENT_ASSAY' for assays, or 'touch $LOOKFOR_CURRENT_OTHER' for other."
+    continue
+  fi
+
+  echo "Testing $TYPE generation..."
+
+  REAL_DEST="docs/$TYPE"
+  TEST_DEST="docs-test/$TYPE"
+
+  REAL_CMD="src/generate_docs.py $TYPE $REAL_DEST"
+  TEST_CMD="src/generate_docs.py $TYPE $TEST_DEST"
+
+  mkdir -p $TEST_DEST || echo "$TEST_DEST already exists"
+  echo "Running: $TEST_CMD"
+  eval $TEST_CMD
+
+  if [ -e $REAL_DEST/current ] && [ -e $TEST_DEST/current ]; then
+    diff -r $REAL_DEST/current $TEST_DEST/current --exclude="*.tsv" --exclude="*.xlsx" \
+      || die "Update needed: $REAL_CMD
+  Or:" 'for D in `ls -d docs/*/`; do D=`basename $D`; src/generate_docs.py $D docs/$D; done'
+  fi
+
+  if [ -e $REAL_DEST/deprecated ] && [ -e $TEST_DEST/deprecated ]; then
+    diff -r $REAL_DEST/deprecated $TEST_DEST/deprecated --exclude="*.tsv" --exclude="*.xlsx" \
+      || die "Update needed: $REAL_CMD
+  Or:" 'for D in `ls -d docs/*/`; do D=`basename $D`; src/generate_docs.py $D docs/$D; done'
+  fi
+
+  rm -rf $TEST_DEST
+  ((++GENERATE_COUNT))
+done
+[[ $GENERATE_COUNT -gt 0 ]] || die "No files generated"
diff --git a/tests/test-schemas-exist.sh → ...d/_tests/test-schemas-exist_deprecated.sh b/tests/test-schemas-exist.sh → ...d/_tests/test-schemas-exist_deprecated.sh
diff --git a/docs/_config.yml b/docs/_config.yml
@@ -29,11 +29,12 @@ categories-order:
 # Exclude from processing.
 # The following items will not be processed, by default. Create a custom list
 # to override the default setting.
-# exclude:
-#   - Gemfile
-#   - Gemfile.lock
-#   - node_modules
-#   - vendor/bundle/
-#   - vendor/cache/
-#   - vendor/gems/
-#   - vendor/ruby/
+exclude:
+  - Gemfile
+  - Gemfile.lock
+  - node_modules
+  - vendor/bundle/
+  - vendor/cache/
+  - vendor/gems/
+  - vendor/ruby/
+  - _deprecated/
diff --git a/examples/README.md b/examples/README.md
@@ -1,5 +1,7 @@
 # `examples/`
 
+Validation is run offline by the tests in the `tests` directory. Online validation can be run manually; see [tests-manual/README.md](tests-manual/README.md).
+
 This directory contains example inputs and outputs of different components of this system.
 Adding more examples is one way to test new schemas or new code...
 but it's possible to have too much of a good thing:
@@ -22,36 +24,32 @@ an `input.tsv` to validate, and an `output.txt` with the error message produced.
 
 ## `dataset-examples/`
 
-The core of `ingest-validation-tools` is Dataset upload validation.
-Each subdirectory here is an end-to-end test of upload validation: Each contains
+The core of `ingest-validation-tools` is dataset upload validation.
+Each subdirectory here is an end-to-end test of upload validation. Each contains:
 
-- a `upload` directory, containing one or more metadata TSVs, dataset directories, and contributors and antibodies TSVs,
+- an `upload` directory, containing one or more metadata TSVs, dataset directories, and contributors and antibodies TSVs,
+- a `fixtures.json` file, containing responses from the assayclassifier endpoint and Spreadsheet Validator that are used as fixture data in offline testing,
 - and a `README.md` with the output when validating that directory.
 
 Examples which are expected to produce errors are prefixed with `bad-`, those that are good, `good-`.
 
-In `test-dataset-examples.sh`, validation is run with several commandline options which may differ
-from those used by end-users. This exercises less common options,
-minimizes dependence on network resources during tests (`--offline`) and formats the output (`--output as_md`)
-
-To add a new test, create a new subdirectory with a `good-` or `bad-` name, add your `upload` subdirectory,
-and an empty `README.md`. Then run `tests/test-dataset-examples.sh`: It will fail on your example,
-and give you a command to run that will fix your example. Run this command, _but make sure the result makes sense!_
-The software can tell you what the result of validation is, but it can't know whether that result is actually correct.
-
-A few additional commandline options are required for CEDAR validation:
+To add a new test:
+- Create a new subdirectory with a `good-` or `bad-` name and add your `upload` subdirectory.
+- Run the following command to create the `README.md` and `fixtures.json` files:
 
-- globus_token: you can find your personal Globus token by logging in to a site that requires Globus authentication (e.g. https://ingest.hubmapconsortium.org/) and looking at the Authorization header for your request in the Network tab of your browser. Omit the "Bearer " prefix.
+```
+env PYTHONPATH=/ingest-validation-tools python -m tests-manual.update_test_data -t examples/<path_to_your_example_dir>/upload --globus_token <globus_token>
+```
 
-See `/tests-manual/README.md` for more information about testing using the CEDAR API.
+Note: You can find your personal Globus token by logging in to a site that requires Globus authentication (e.g. https://ingest.hubmapconsortium.org/) and looking at the Authorization header for your request in the Network tab of your browser. Omit the "Bearer " prefix.
+- Make sure the result makes sense! The software can tell you what the result of validation is, but it can't know whether that result is actually correct.
 
 ## `dataset-iec-examples/`
 
 After upload, TSVs are split up, and directory structures are re-arranged.
 These structures can still be validated, but it takes a slightly different set of options,
 and those options are tested here.
 
-## `sample-examples/`
+## `plugin-tests`
 
-Distinct from `validate_upload.py`, `validate_samples.py` validates Sample TSVs.
-These are much simpler than Dataset uploads, so we only need a single good and bad example.
+Plugins are turned off by default for testing, as they require an additional repo: [ingest-validation-tests](https://github.com/hubmapconsortium/ingest-validation-tests). See [tests-manual/README.md](tests-manual/README.md) for more information about plugin tests.
diff --git a/script-docs/README-validate-upload-help.md b/script-docs/README-validate-upload-help.md
diff --git a/script-docs/README-validate_upload.py.md b/script-docs/README-validate_upload.py.md
@@ -1,7 +1,8 @@
 ```text
 usage: validate_upload.py [-h] --local_directory PATH
-                          [--optional_fields FIELD [FIELD ...]] [--offline]
-                          [--clear_cache] [--ignore_deprecation]
+                          [--optional_fields FIELD [FIELD ...]]
+                          [--no_url_checks] [--clear_cache]
+                          [--ignore_deprecation]
                           [--dataset_ignore_globs GLOB [GLOB ...]]
                           [--upload_ignore_globs GLOB [GLOB ...]]
                           [--encoding ENCODING]
@@ -20,7 +21,8 @@ optional arguments:
   --optional_fields FIELD [FIELD ...]
                         The listed fields will be treated as optional. (But if
                         they are supplied in the TSV, they will be validated.)
-  --offline             Skip checks that require network access.
+  --no_url_checks       Skip URL checks (Spreadsheet Validator API checks
+                        still run).
   --clear_cache         Clear cache of network check responses.
   --ignore_deprecation  Allow validation against deprecated versions of
                         metadata schemas.

@@ -33,6 +33,8 @@ def main():
     parser.add_argument("target", type=dir_path, help="Directory to write output to")
     args = parser.parse_args()
 
+    if str(args.type).startswith("_"):
+        return
     table_schema_versions = dict_table_schema_versions()[args.type]
     assert table_schema_versions, f"No versions for {args.type}"