Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation update #1321

Merged
merged 12 commits into from
Apr 25, 2024
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
- Accommodate dir schema minor versions
- Fix ORCID URL checking
- Add MUSIC next-gen directory schema
- Updating documentation

## v0.0.18

Expand Down
55 changes: 27 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# ingest-validation-tools

HuBMAP data upload guidelines, and tools which check that uploads adhere to those guidelines.
HuBMAP data upload guidelines and instructions for checking that uploads adhere to those guidelines.
Assay documentation is on [Github Pages](https://hubmapconsortium.github.io/ingest-validation-tools/).

HuBMAP has three distinct metadata processes:

- **Donor** metadata is handled by Jonathan Silverstein on an adhoc basis: He works with whatever format the TMC can provide, and aligns it with controlled vocabularies.
- **Sample** metadata is handled by Brendan Honick and Bill Shirey. [The standard operating procedure is outlined here.](https://docs.google.com/document/d/1K-PvBaduhrN-aU-vzWd9gZqeGvhGF3geTwRR0ww74Jo/edit)
- **Dataset** uploads should be validated first by the TMCs. Dataset upload validation is the focus of this repo. [Details below.](#upload-process-and-upload-directory-structure)
- **Sample** metadata is ingested by the [HuBMAP Data Ingest Portal](https://ingest.hubmapconsortium.org/)--see "Upload Sample Metadata" at the top of the page.
- **Dataset** uploads should be validated first by the TMCs. Dataset upload validation is the focus of this repo. [Details below.](#for-data-submitters-and-curators)

## For assay type working groups:

Expand All @@ -26,7 +26,7 @@ When all the parts are finalized,

### Stability

Once approved, both the list of metadata fields (metadata schema)
Once approved, both the CEDAR Metadata Template (metadata schema)
and the list of files (directory schema) are fixed in a particular version.
The metadata for a particular assay type needs to be consistent for all datasets,
as does the set of files which comprise a dataset.
Expand All @@ -42,6 +42,12 @@ contact Phil Blood (@pdblood).

## For data submitters and curators:

### Validate TSVs

To validate your metadata TSV files, use the [HuBMAP Metadata Spreadsheet Validator](https://metadatavalidator.metadatacenter.org/). This tool is a web-based application that will categorize any errors in your spreadsheet and provide help fixing those errors. More detailed instructions about using the tool can be found in the [Spreadsheet Validator Documentation](https://metadatacenter.github.io/spreadsheet-validator-docs/).

### Validate Directory Structure

Checkout the repo and install dependencies:

```
Expand All @@ -55,73 +61,66 @@ src/validate_upload.py --help

You should see [the documention for `validate_upload.py`](script-docs/README-validate_upload.py.md)

**Note**: you need to have _git_ installed in your system.

Now run it against one of the included examples, giving the path to an upload directory:

```
src/validate_upload.py \
--local_directory examples/dataset-examples/bad-tsv-formats/upload \
--no_url_checks \
--output as_text
```
**Note**: URL checking is not supported via `validate_upload.py` at this time, and is disabled with the use of the `--no_url_checks` flag. Please ensure that any fields containing a HuBMAP ID (such as `parent-sample_id`) or an ORCID (`orcid`) are accurate.

You should now see [this (extensive) error message](examples/dataset-examples/bad-tsv-formats/README.md).
This example TSV has been constructed with a mistake in every column, just to demonstrate the checks which are available. Hopefully, more often your experience will be like this:

```
src/validate_upload.py \
--local_directory examples/dataset-examples/good-codex-akoya/upload
--local_directory examples/dataset-examples/good-codex-akoya-metadata-v1/upload \
--no_url_checks
```

```
No errors!
```

Documentation and metadata TSV templates for each assay type are [here](https://hubmapconsortium.github.io/ingest-validation-tools/).
Addition help for certain common error messages is available [here](README-validate-upload-help.md)

### Validating single TSVs:

If you don't have an entire upload directory at hand, you can validate individual
metadata, antibodies, contributors, or sample TSVs:

```
src/validate_tsv.py \
--schema metadata \
--path examples/dataset-examples/good-scatacseq-v1/upload/metadata.tsv
```

```
No errors!
```

### Running plugin tests:

Additional plugin tests can also be run.
These additional tests confirm that the files themselves are valid, not just that the directory structures are correct.
These additional tests are in a separate repo, and have their own dependencies.

Starting from ingest-validation-tools...
```
# Starting from ingest-validation-tools...
cd ..
git clone https://github.com/hubmapconsortium/ingest-validation-tests.git
cd ingest-validation-tests
pip install -r requirements.txt
```

# Back to ingest-validation-tools...
Back to ingest-validation-tools...
```
cd ../ingest-validation-tools
```

Failing example, see [README.md](examples/plugin-tests/expected-failure/README.md)
```
src/validate_upload.py \
--local_directory examples/dataset-examples/good-codex-akoya/upload \
--local_directory examples/plugin-tests/expected-failure/upload \
--run_plugins \
--no_url_checks \
--plugin_directory ../ingest-validation-tests/src/ingest_validation_tests/
```

## For developers and contributors:

A good example is of programatic usage is `validate-upload.py`; In a nutshell:
An example of the core error-reporting functionality underlying `validate-upload.py`:

```python
upload = Upload(directory_path=path)
report = ErrorReport(upload.get_errors())
report = ErrorReport(errors=upload.get_errors(), info=upload.get_info())
print(report.as_text())
```

Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
82 changes: 82 additions & 0 deletions _deprecated/_tests/test-generate-docs_deprecated.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
#!/usr/bin/env bash
set -o errexit

die() { set +v; echo "$*" 1>&2 ; sleep 1; exit 1; }

# Test field-descriptions.yaml and field-types.yaml:

ATTR_LIST='description type entity assay schema'
RERUNS=''
for ATTR in $ATTR_LIST; do
PLURAL="${ATTR}s"
[ "$PLURAL" == 'entitys' ] && PLURAL='entities'
REAL_DEST="docs/field-${PLURAL}.yaml"
TEST_DEST="docs-test/field-${PLURAL}.yaml"
echo "Checking $REAL_DEST"

REAL_CMD="src/generate_field_yaml.py --attr $ATTR > $REAL_DEST;"
TEST_CMD="src/generate_field_yaml.py --attr $ATTR > $TEST_DEST"

mkdir docs-test || echo "Already exists"
eval $TEST_CMD || die "Command failed: $TEST_CMD"
diff -r $REAL_DEST $TEST_DEST || RERUNS="$RERUNS $REAL_CMD"
rm -rf docs-test
done
[ -z "$RERUNS" ] || die "Update YAMLs: $RERUNS"

# Test Excel summary:
# This relies on the YAML created above.

FILE="field-schemas.xlsx"
echo "Checking $FILE"

mkdir docs-test
REAL_DEST="docs/$FILE"
TEST_DEST="docs-test/$FILE"
REAL_CMD="src/generate_grid.py $REAL_DEST"
TEST_CMD="src/generate_grid.py $TEST_DEST"
eval $TEST_CMD || die "Command failed: $TEST_CMD"
diff $REAL_DEST $TEST_DEST || die "Update needed: $REAL_CMD"

# Test docs:

for TYPE in $(ls -d docs/*); do
# Skip directories that are unpopulated:
TYPE=`basename $TYPE`
LOOKFOR_CURRENT_ASSAY="docs/$TYPE/current/$TYPE-metadata.tsv"
LOOKFOR_CURRENT_OTHER="docs/$TYPE/current/$TYPE.tsv"
LOOKFOR_DEPRECATED_ASSAY="docs/$TYPE/deprecated/$TYPE-metadata.tsv"
LOOKFOR_DEPRECATED_OTHER="docs/$TYPE/deprecated/$TYPE.tsv"
if [ ! -e $LOOKFOR_CURRENT_ASSAY ] && [ ! -e $LOOKFOR_CURRENT_OTHER ] && [ ! -e $LOOKFOR_DEPRECATED_ASSAY ] && [ ! -e $LOOKFOR_DEPRECATED_OTHER ]; then
echo "Skipping $TYPE. To add: 'touch $LOOKFOR_CURRENT_ASSAY' for assays, or 'touch $LOOKFOR_CURRENT_OTHER' for other."
continue
fi

echo "Testing $TYPE generation..."

REAL_DEST="docs/$TYPE"
TEST_DEST="docs-test/$TYPE"

REAL_CMD="src/generate_docs.py $TYPE $REAL_DEST"
TEST_CMD="src/generate_docs.py $TYPE $TEST_DEST"

mkdir -p $TEST_DEST || echo "$TEST_DEST already exists"
echo "Running: $TEST_CMD"
eval $TEST_CMD

if [ -e $REAL_DEST/current ] && [ -e $TEST_DEST/current ]; then
diff -r $REAL_DEST/current $TEST_DEST/current --exclude="*.tsv" --exclude="*.xlsx" \
|| die "Update needed: $REAL_CMD
Or:" 'for D in `ls -d docs/*/`; do D=`basename $D`; src/generate_docs.py $D docs/$D; done'
fi

if [ -e $REAL_DEST/deprecated ] && [ -e $TEST_DEST/deprecated ]; then
diff -r $REAL_DEST/deprecated $TEST_DEST/deprecated --exclude="*.tsv" --exclude="*.xlsx" \
|| die "Update needed: $REAL_CMD
Or:" 'for D in `ls -d docs/*/`; do D=`basename $D`; src/generate_docs.py $D docs/$D; done'
fi

rm -rf $TEST_DEST
((++GENERATE_COUNT))
done
[[ $GENERATE_COUNT -gt 0 ]] || die "No files generated"
17 changes: 9 additions & 8 deletions docs/_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,12 @@ categories-order:
# Exclude from processing.
# The following items will not be processed, by default. Create a custom list
# to override the default setting.
# exclude:
# - Gemfile
# - Gemfile.lock
# - node_modules
# - vendor/bundle/
# - vendor/cache/
# - vendor/gems/
# - vendor/ruby/
exclude:
- Gemfile
- Gemfile.lock
- node_modules
- vendor/bundle/
- vendor/cache/
- vendor/gems/
- vendor/ruby/
- _deprecated/
34 changes: 16 additions & 18 deletions examples/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# `examples/`

Validation is run offline by the tests in the `tests` directory. Online validation can be run manually; see [tests-manual/README.md](tests-manual/README.md).

This directory contains example inputs and outputs of different components of this system.
Adding more examples is one way to test new schemas or new code...
but it's possible to have too much of a good thing:
Expand All @@ -22,36 +24,32 @@ an `input.tsv` to validate, and an `output.txt` with the error message produced.

## `dataset-examples/`

The core of `ingest-validation-tools` is Dataset upload validation.
Each subdirectory here is an end-to-end test of upload validation: Each contains
The core of `ingest-validation-tools` is dataset upload validation.
Each subdirectory here is an end-to-end test of upload validation. Each contains:

- a `upload` directory, containing one or more metadata TSVs, dataset directories, and contributors and antibodies TSVs,
- an `upload` directory, containing one or more metadata TSVs, dataset directories, and contributors and antibodies TSVs,
- a `fixtures.json` file, containing responses from the assayclassifier endpoint and Spreadsheet Validator that are used as fixture data in offline testing,
- and a `README.md` with the output when validating that directory.

Examples which are expected to produce errors are prefixed with `bad-`, those that are good, `good-`.

In `test-dataset-examples.sh`, validation is run with several commandline options which may differ
from those used by end-users. This exercises less common options,
minimizes dependence on network resources during tests (`--offline`) and formats the output (`--output as_md`)

To add a new test, create a new subdirectory with a `good-` or `bad-` name, add your `upload` subdirectory,
and an empty `README.md`. Then run `tests/test-dataset-examples.sh`: It will fail on your example,
and give you a command to run that will fix your example. Run this command, _but make sure the result makes sense!_
The software can tell you what the result of validation is, but it can't know whether that result is actually correct.

A few additional commandline options are required for CEDAR validation:
To add a new test:
- Create a new subdirectory with a `good-` or `bad-` name and add your `upload` subdirectory.
- Run the following command to create the `README.md` and `fixtures.json` files:

- globus_token: you can find your personal Globus token by logging in to a site that requires Globus authentication (e.g. https://ingest.hubmapconsortium.org/) and looking at the Authorization header for your request in the Network tab of your browser. Omit the "Bearer " prefix.
```
env PYTHONPATH=/ingest-validation-tools python -m tests-manual.update_test_data -t examples/<path_to_your_example_dir>/upload --globus_token <globus_token>
```

See `/tests-manual/README.md` for more information about testing using the CEDAR API.
Note: You can find your personal Globus token by logging in to a site that requires Globus authentication (e.g. https://ingest.hubmapconsortium.org/) and looking at the Authorization header for your request in the Network tab of your browser. Omit the "Bearer " prefix.
- Make sure the result makes sense! The software can tell you what the result of validation is, but it can't know whether that result is actually correct.

## `dataset-iec-examples/`

After upload, TSVs are split up, and directory structures are re-arranged.
These structures can still be validated, but it takes a slightly different set of options,
and those options are tested here.

## `sample-examples/`
## `plugin-tests`

Distinct from `validate_upload.py`, `validate_samples.py` validates Sample TSVs.
These are much simpler than Dataset uploads, so we only need a single good and bad example.
Plugins are turned off by default for testing, as they require an additional repo: [ingest-validation-tests](https://github.com/hubmapconsortium/ingest-validation-tests). See [tests-manual/README.md](tests-manual/README.md) for more information about plugin tests.
31 changes: 0 additions & 31 deletions script-docs/README-validate-upload-help.md

This file was deleted.

8 changes: 5 additions & 3 deletions script-docs/README-validate_upload.py.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
```text
usage: validate_upload.py [-h] --local_directory PATH
[--optional_fields FIELD [FIELD ...]] [--offline]
[--clear_cache] [--ignore_deprecation]
[--optional_fields FIELD [FIELD ...]]
[--no_url_checks] [--clear_cache]
[--ignore_deprecation]
[--dataset_ignore_globs GLOB [GLOB ...]]
[--upload_ignore_globs GLOB [GLOB ...]]
[--encoding ENCODING]
Expand All @@ -20,7 +21,8 @@ optional arguments:
--optional_fields FIELD [FIELD ...]
The listed fields will be treated as optional. (But if
they are supplied in the TSV, they will be validated.)
--offline Skip checks that require network access.
--no_url_checks Skip URL checks (Spreadsheet Validator API checks
still run).
--clear_cache Clear cache of network check responses.
--ignore_deprecation Allow validation against deprecated versions of
metadata schemas.
Expand Down
2 changes: 2 additions & 0 deletions src/generate_docs.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ def main():
parser.add_argument("target", type=dir_path, help="Directory to write output to")
args = parser.parse_args()

if str(args.type).startswith("_"):
return
table_schema_versions = dict_table_schema_versions()[args.type]
assert table_schema_versions, f"No versions for {args.type}"

Expand Down
Loading
Loading