diff --git a/CHANGELOG.md b/CHANGELOG.md index 07c495f3d..f98361dad 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,6 +8,7 @@ - Accommodate dir schema minor versions - Fix ORCID URL checking - Add MUSIC next-gen directory schema +- Updating documentation ## v0.0.18 diff --git a/README.md b/README.md index b271132c8..e9946d76a 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,13 @@ # ingest-validation-tools -HuBMAP data upload guidelines, and tools which check that uploads adhere to those guidelines. +HuBMAP data upload guidelines and instructions for checking that uploads adhere to those guidelines. Assay documentation is on [Github Pages](https://hubmapconsortium.github.io/ingest-validation-tools/). HuBMAP has three distinct metadata processes: - **Donor** metadata is handled by Jonathan Silverstein on an adhoc basis: He works with whatever format the TMC can provide, and aligns it with controlled vocabularies. -- **Sample** metadata is handled by Brendan Honick and Bill Shirey. [The standard operating procedure is outlined here.](https://docs.google.com/document/d/1K-PvBaduhrN-aU-vzWd9gZqeGvhGF3geTwRR0ww74Jo/edit) -- **Dataset** uploads should be validated first by the TMCs. Dataset upload validation is the focus of this repo. [Details below.](#upload-process-and-upload-directory-structure) +- **Sample** metadata is ingested by the [HuBMAP Data Ingest Portal](https://ingest.hubmapconsortium.org/)--see "Upload Sample Metadata" at the top of the page. +- **Dataset** uploads should be validated first by the TMCs. Dataset upload validation is the focus of this repo. [Details below.](#for-data-submitters-and-curators) ## For assay type working groups: @@ -26,7 +26,7 @@ When all the parts are finalized, ### Stability -Once approved, both the list of metadata fields (metadata schema) +Once approved, both the CEDAR Metadata Template (metadata schema) and the list of files (directory schema) are fixed in a particular version. The metadata for a particular assay type needs to be consistent for all datasets, as does the set of files which comprise a dataset. @@ -42,6 +42,12 @@ contact Phil Blood (@pdblood). ## For data submitters and curators: +### Validate TSVs + +To validate your metadata TSV files, use the [HuBMAP Metadata Spreadsheet Validator](https://metadatavalidator.metadatacenter.org/). This tool is a web-based application that will categorize any errors in your spreadsheet and provide help fixing those errors. More detailed instructions about using the tool can be found in the [Spreadsheet Validator Documentation](https://metadatacenter.github.io/spreadsheet-validator-docs/). + +### Validate Directory Structure + Checkout the repo and install dependencies: ``` @@ -55,22 +61,23 @@ src/validate_upload.py --help You should see [the documention for `validate_upload.py`](script-docs/README-validate_upload.py.md) -**Note**: you need to have _git_ installed in your system. - Now run it against one of the included examples, giving the path to an upload directory: ``` src/validate_upload.py \ --local_directory examples/dataset-examples/bad-tsv-formats/upload \ + --no_url_checks \ --output as_text ``` +**Note**: URL checking is not supported via `validate_upload.py` at this time, and is disabled with the use of the `--no_url_checks` flag. Please ensure that any fields containing a HuBMAP ID (such as `parent-sample_id`) or an ORCID (`orcid`) are accurate. You should now see [this (extensive) error message](examples/dataset-examples/bad-tsv-formats/README.md). This example TSV has been constructed with a mistake in every column, just to demonstrate the checks which are available. Hopefully, more often your experience will be like this: ``` src/validate_upload.py \ - --local_directory examples/dataset-examples/good-codex-akoya/upload + --local_directory examples/dataset-examples/good-codex-akoya-metadata-v1/upload \ + --no_url_checks ``` ``` @@ -78,22 +85,6 @@ No errors! ``` Documentation and metadata TSV templates for each assay type are [here](https://hubmapconsortium.github.io/ingest-validation-tools/). -Addition help for certain common error messages is available [here](README-validate-upload-help.md) - -### Validating single TSVs: - -If you don't have an entire upload directory at hand, you can validate individual -metadata, antibodies, contributors, or sample TSVs: - -``` -src/validate_tsv.py \ - --schema metadata \ - --path examples/dataset-examples/good-scatacseq-v1/upload/metadata.tsv -``` - -``` -No errors! -``` ### Running plugin tests: @@ -101,27 +92,35 @@ Additional plugin tests can also be run. These additional tests confirm that the files themselves are valid, not just that the directory structures are correct. These additional tests are in a separate repo, and have their own dependencies. +Starting from ingest-validation-tools... ``` -# Starting from ingest-validation-tools... cd .. git clone https://github.com/hubmapconsortium/ingest-validation-tests.git cd ingest-validation-tests pip install -r requirements.txt +``` -# Back to ingest-validation-tools... +Back to ingest-validation-tools... +``` cd ../ingest-validation-tools +``` + +Failing example, see [README.md](examples/plugin-tests/expected-failure/README.md) +``` src/validate_upload.py \ - --local_directory examples/dataset-examples/good-codex-akoya/upload \ + --local_directory examples/plugin-tests/expected-failure/upload \ + --run_plugins \ + --no_url_checks \ --plugin_directory ../ingest-validation-tests/src/ingest_validation_tests/ ``` ## For developers and contributors: -A good example is of programatic usage is `validate-upload.py`; In a nutshell: +An example of the core error-reporting functionality underlying `validate-upload.py`: ```python upload = Upload(directory_path=path) -report = ErrorReport(upload.get_errors()) +report = ErrorReport(errors=upload.get_errors(), info=upload.get_info()) print(report.as_text()) ``` diff --git a/docs/field-assays.yaml b/_deprecated/_docs/field-assays_deprecated.yaml similarity index 100% rename from docs/field-assays.yaml rename to _deprecated/_docs/field-assays_deprecated.yaml diff --git a/docs/field-descriptions.yaml b/_deprecated/_docs/field-descriptions_deprecated.yaml similarity index 100% rename from docs/field-descriptions.yaml rename to _deprecated/_docs/field-descriptions_deprecated.yaml diff --git a/docs/field-entities.yaml b/_deprecated/_docs/field-entities_deprecated.yaml similarity index 100% rename from docs/field-entities.yaml rename to _deprecated/_docs/field-entities_deprecated.yaml diff --git a/docs/field-schemas.xlsx b/_deprecated/_docs/field-schemas_deprecated.xlsx similarity index 100% rename from docs/field-schemas.xlsx rename to _deprecated/_docs/field-schemas_deprecated.xlsx diff --git a/docs/field-schemas.yaml b/_deprecated/_docs/field-schemas_deprecated.yaml similarity index 100% rename from docs/field-schemas.yaml rename to _deprecated/_docs/field-schemas_deprecated.yaml diff --git a/docs/field-types.yaml b/_deprecated/_docs/field-types_deprecated.yaml similarity index 100% rename from docs/field-types.yaml rename to _deprecated/_docs/field-types_deprecated.yaml diff --git a/script-docs/README-factor_field.py.md b/_deprecated/_script-docs/README-factor_field_deprecated.py.md similarity index 100% rename from script-docs/README-factor_field.py.md rename to _deprecated/_script-docs/README-factor_field_deprecated.py.md diff --git a/script-docs/README-generate_field_enum_csv.py.md b/_deprecated/_script-docs/README-generate_field_enum_csv_deprecated.py.md similarity index 100% rename from script-docs/README-generate_field_enum_csv.py.md rename to _deprecated/_script-docs/README-generate_field_enum_csv_deprecated.py.md diff --git a/script-docs/README-generate_field_values_csv.py.md b/_deprecated/_script-docs/README-generate_field_values_csv_deprecated.py.md similarity index 100% rename from script-docs/README-generate_field_values_csv.py.md rename to _deprecated/_script-docs/README-generate_field_values_csv_deprecated.py.md diff --git a/script-docs/README-generate_field_yaml.py.md b/_deprecated/_script-docs/README-generate_field_yaml_deprecated.py.md similarity index 100% rename from script-docs/README-generate_field_yaml.py.md rename to _deprecated/_script-docs/README-generate_field_yaml_deprecated.py.md diff --git a/script-docs/README-generate_grid.py.md b/_deprecated/_script-docs/README-generate_grid_deprecated.py.md similarity index 100% rename from script-docs/README-generate_grid.py.md rename to _deprecated/_script-docs/README-generate_grid_deprecated.py.md diff --git a/script-docs/README-generate_schema.py.md b/_deprecated/_script-docs/README-generate_schema_deprecated.py.md similarity index 100% rename from script-docs/README-generate_schema.py.md rename to _deprecated/_script-docs/README-generate_schema_deprecated.py.md diff --git a/src/factor_field.py b/_deprecated/_src/factor_field_deprecated.py similarity index 100% rename from src/factor_field.py rename to _deprecated/_src/factor_field_deprecated.py diff --git a/src/generate_field_enum_csv.py b/_deprecated/_src/generate_field_enum_csv_deprecated.py similarity index 100% rename from src/generate_field_enum_csv.py rename to _deprecated/_src/generate_field_enum_csv_deprecated.py diff --git a/src/generate_field_values_csv.py b/_deprecated/_src/generate_field_values_csv_deprecated.py similarity index 100% rename from src/generate_field_values_csv.py rename to _deprecated/_src/generate_field_values_csv_deprecated.py diff --git a/src/generate_field_yaml.py b/_deprecated/_src/generate_field_yaml_deprecated.py similarity index 100% rename from src/generate_field_yaml.py rename to _deprecated/_src/generate_field_yaml_deprecated.py diff --git a/src/generate_grid.py b/_deprecated/_src/generate_grid_deprecated.py similarity index 100% rename from src/generate_grid.py rename to _deprecated/_src/generate_grid_deprecated.py diff --git a/src/generate_schema.py b/_deprecated/_src/generate_schema_deprecated.py similarity index 100% rename from src/generate_schema.py rename to _deprecated/_src/generate_schema_deprecated.py diff --git a/_deprecated/_tests/test-generate-docs_deprecated.sh b/_deprecated/_tests/test-generate-docs_deprecated.sh new file mode 100755 index 000000000..40fe69ee9 --- /dev/null +++ b/_deprecated/_tests/test-generate-docs_deprecated.sh @@ -0,0 +1,82 @@ +#!/usr/bin/env bash +set -o errexit + +die() { set +v; echo "$*" 1>&2 ; sleep 1; exit 1; } + +# Test field-descriptions.yaml and field-types.yaml: + +ATTR_LIST='description type entity assay schema' +RERUNS='' +for ATTR in $ATTR_LIST; do + PLURAL="${ATTR}s" + [ "$PLURAL" == 'entitys' ] && PLURAL='entities' + REAL_DEST="docs/field-${PLURAL}.yaml" + TEST_DEST="docs-test/field-${PLURAL}.yaml" + echo "Checking $REAL_DEST" + + REAL_CMD="src/generate_field_yaml.py --attr $ATTR > $REAL_DEST;" + TEST_CMD="src/generate_field_yaml.py --attr $ATTR > $TEST_DEST" + + mkdir docs-test || echo "Already exists" + eval $TEST_CMD || die "Command failed: $TEST_CMD" + diff -r $REAL_DEST $TEST_DEST || RERUNS="$RERUNS $REAL_CMD" + rm -rf docs-test +done +[ -z "$RERUNS" ] || die "Update YAMLs: $RERUNS" + +# Test Excel summary: +# This relies on the YAML created above. + +FILE="field-schemas.xlsx" +echo "Checking $FILE" + +mkdir docs-test +REAL_DEST="docs/$FILE" +TEST_DEST="docs-test/$FILE" +REAL_CMD="src/generate_grid.py $REAL_DEST" +TEST_CMD="src/generate_grid.py $TEST_DEST" +eval $TEST_CMD || die "Command failed: $TEST_CMD" +diff $REAL_DEST $TEST_DEST || die "Update needed: $REAL_CMD" + +# Test docs: + +for TYPE in $(ls -d docs/*); do + # Skip directories that are unpopulated: + TYPE=`basename $TYPE` + LOOKFOR_CURRENT_ASSAY="docs/$TYPE/current/$TYPE-metadata.tsv" + LOOKFOR_CURRENT_OTHER="docs/$TYPE/current/$TYPE.tsv" + LOOKFOR_DEPRECATED_ASSAY="docs/$TYPE/deprecated/$TYPE-metadata.tsv" + LOOKFOR_DEPRECATED_OTHER="docs/$TYPE/deprecated/$TYPE.tsv" + if [ ! -e $LOOKFOR_CURRENT_ASSAY ] && [ ! -e $LOOKFOR_CURRENT_OTHER ] && [ ! -e $LOOKFOR_DEPRECATED_ASSAY ] && [ ! -e $LOOKFOR_DEPRECATED_OTHER ]; then + echo "Skipping $TYPE. To add: 'touch $LOOKFOR_CURRENT_ASSAY' for assays, or 'touch $LOOKFOR_CURRENT_OTHER' for other." + continue + fi + + echo "Testing $TYPE generation..." + + REAL_DEST="docs/$TYPE" + TEST_DEST="docs-test/$TYPE" + + REAL_CMD="src/generate_docs.py $TYPE $REAL_DEST" + TEST_CMD="src/generate_docs.py $TYPE $TEST_DEST" + + mkdir -p $TEST_DEST || echo "$TEST_DEST already exists" + echo "Running: $TEST_CMD" + eval $TEST_CMD + + if [ -e $REAL_DEST/current ] && [ -e $TEST_DEST/current ]; then + diff -r $REAL_DEST/current $TEST_DEST/current --exclude="*.tsv" --exclude="*.xlsx" \ + || die "Update needed: $REAL_CMD + Or:" 'for D in `ls -d docs/*/`; do D=`basename $D`; src/generate_docs.py $D docs/$D; done' + fi + + if [ -e $REAL_DEST/deprecated ] && [ -e $TEST_DEST/deprecated ]; then + diff -r $REAL_DEST/deprecated $TEST_DEST/deprecated --exclude="*.tsv" --exclude="*.xlsx" \ + || die "Update needed: $REAL_CMD + Or:" 'for D in `ls -d docs/*/`; do D=`basename $D`; src/generate_docs.py $D docs/$D; done' + fi + + rm -rf $TEST_DEST + ((++GENERATE_COUNT)) +done +[[ $GENERATE_COUNT -gt 0 ]] || die "No files generated" diff --git a/tests/test-schemas-exist.sh b/_deprecated/_tests/test-schemas-exist_deprecated.sh similarity index 100% rename from tests/test-schemas-exist.sh rename to _deprecated/_tests/test-schemas-exist_deprecated.sh diff --git a/docs/_config.yml b/docs/_config.yml index a7593c2ba..161123477 100644 --- a/docs/_config.yml +++ b/docs/_config.yml @@ -29,11 +29,12 @@ categories-order: # Exclude from processing. # The following items will not be processed, by default. Create a custom list # to override the default setting. -# exclude: -# - Gemfile -# - Gemfile.lock -# - node_modules -# - vendor/bundle/ -# - vendor/cache/ -# - vendor/gems/ -# - vendor/ruby/ +exclude: + - Gemfile + - Gemfile.lock + - node_modules + - vendor/bundle/ + - vendor/cache/ + - vendor/gems/ + - vendor/ruby/ + - _deprecated/ diff --git a/examples/README.md b/examples/README.md index bb5432119..ad5d2ff26 100644 --- a/examples/README.md +++ b/examples/README.md @@ -1,5 +1,7 @@ # `examples/` +Validation is run offline by the tests in the `tests` directory. Online validation can be run manually; see [tests-manual/README.md](tests-manual/README.md). + This directory contains example inputs and outputs of different components of this system. Adding more examples is one way to test new schemas or new code... but it's possible to have too much of a good thing: @@ -22,28 +24,25 @@ an `input.tsv` to validate, and an `output.txt` with the error message produced. ## `dataset-examples/` -The core of `ingest-validation-tools` is Dataset upload validation. -Each subdirectory here is an end-to-end test of upload validation: Each contains +The core of `ingest-validation-tools` is dataset upload validation. +Each subdirectory here is an end-to-end test of upload validation. Each contains: -- a `upload` directory, containing one or more metadata TSVs, dataset directories, and contributors and antibodies TSVs, +- an `upload` directory, containing one or more metadata TSVs, dataset directories, and contributors and antibodies TSVs, +- a `fixtures.json` file, containing responses from the assayclassifier endpoint and Spreadsheet Validator that are used as fixture data in offline testing, - and a `README.md` with the output when validating that directory. Examples which are expected to produce errors are prefixed with `bad-`, those that are good, `good-`. -In `test-dataset-examples.sh`, validation is run with several commandline options which may differ -from those used by end-users. This exercises less common options, -minimizes dependence on network resources during tests (`--offline`) and formats the output (`--output as_md`) - -To add a new test, create a new subdirectory with a `good-` or `bad-` name, add your `upload` subdirectory, -and an empty `README.md`. Then run `tests/test-dataset-examples.sh`: It will fail on your example, -and give you a command to run that will fix your example. Run this command, _but make sure the result makes sense!_ -The software can tell you what the result of validation is, but it can't know whether that result is actually correct. - -A few additional commandline options are required for CEDAR validation: +To add a new test: +- Create a new subdirectory with a `good-` or `bad-` name and add your `upload` subdirectory. +- Run the following command to create the `README.md` and `fixtures.json` files: -- globus_token: you can find your personal Globus token by logging in to a site that requires Globus authentication (e.g. https://ingest.hubmapconsortium.org/) and looking at the Authorization header for your request in the Network tab of your browser. Omit the "Bearer " prefix. +``` +env PYTHONPATH=/ingest-validation-tools python -m tests-manual.update_test_data -t examples//upload --globus_token +``` -See `/tests-manual/README.md` for more information about testing using the CEDAR API. +Note: You can find your personal Globus token by logging in to a site that requires Globus authentication (e.g. https://ingest.hubmapconsortium.org/) and looking at the Authorization header for your request in the Network tab of your browser. Omit the "Bearer " prefix. +- Make sure the result makes sense! The software can tell you what the result of validation is, but it can't know whether that result is actually correct. ## `dataset-iec-examples/` @@ -51,7 +50,6 @@ After upload, TSVs are split up, and directory structures are re-arranged. These structures can still be validated, but it takes a slightly different set of options, and those options are tested here. -## `sample-examples/` +## `plugin-tests` -Distinct from `validate_upload.py`, `validate_samples.py` validates Sample TSVs. -These are much simpler than Dataset uploads, so we only need a single good and bad example. +Plugins are turned off by default for testing, as they require an additional repo: [ingest-validation-tests](https://github.com/hubmapconsortium/ingest-validation-tests). See [tests-manual/README.md](tests-manual/README.md) for more information about plugin tests. diff --git a/script-docs/README-validate-upload-help.md b/script-docs/README-validate-upload-help.md deleted file mode 100644 index 5a9a18542..000000000 --- a/script-docs/README-validate-upload-help.md +++ /dev/null @@ -1,31 +0,0 @@ -This document lists common `validate_upload.py` errors, their interpretation, and their remedies. - -## `404` error -``` -row 22, protocols_io_doi 10.17504/protocols.io.be8mjhu7: 404 -``` - -### Description -Certain fields require entities that end in a number or letter. If the entity is a DOI, and you have dragged the contents from the cell in row 2 down to fill the rows below, the number or letter at the end of the entity will increase incrementally. DOIs will generate 404 errors because the resulting DOIs are not valid (hopefully). - -### Remedy -If every row in the document is meant to contain precisely the same entity, then the entity can be copied from one cell, which saves it on a clipboard. Highlight the cells which should contain a copy and paste the entity into those cells. - -## Is not “datetime” and format -``` -The value "12/23/20 12:00" in row 2 and column 3 ("C") is not type "datetime -and format "%Y-%m-%d %H:%M" -``` - -### Description -The metadata documentation for each assay in github describes input for each field. -In most cases, input is required. In many cases, the format is constrained. Datetime and date are two examples of fields in which input is required in a constrained format. If the user is populating these fields in excel, the datetime input must be specifically formatted as follows: -- `yyyy-mm-dd hh:mm` (e.g. where required input is datetime) -- `yyyy-mm-dd` (where required input is date - e.g. in the lot_number field for custom antibodies in the antibodies TSV). - -### Remedy -In Excel, highlight the column and in "Format Cells" select "Custom" and give -`yyyy-mm-dd hh:mm` as the format. -This reformatting is required every time you modify the document in Excel - - diff --git a/script-docs/README-validate_upload.py.md b/script-docs/README-validate_upload.py.md index 4aaa700ac..fa695f1a4 100644 --- a/script-docs/README-validate_upload.py.md +++ b/script-docs/README-validate_upload.py.md @@ -1,7 +1,8 @@ ```text usage: validate_upload.py [-h] --local_directory PATH - [--optional_fields FIELD [FIELD ...]] [--offline] - [--clear_cache] [--ignore_deprecation] + [--optional_fields FIELD [FIELD ...]] + [--no_url_checks] [--clear_cache] + [--ignore_deprecation] [--dataset_ignore_globs GLOB [GLOB ...]] [--upload_ignore_globs GLOB [GLOB ...]] [--encoding ENCODING] @@ -20,7 +21,8 @@ optional arguments: --optional_fields FIELD [FIELD ...] The listed fields will be treated as optional. (But if they are supplied in the TSV, they will be validated.) - --offline Skip checks that require network access. + --no_url_checks Skip URL checks (Spreadsheet Validator API checks + still run). --clear_cache Clear cache of network check responses. --ignore_deprecation Allow validation against deprecated versions of metadata schemas. diff --git a/src/generate_docs.py b/src/generate_docs.py index 598d02d23..ba1c01312 100755 --- a/src/generate_docs.py +++ b/src/generate_docs.py @@ -33,6 +33,8 @@ def main(): parser.add_argument("target", type=dir_path, help="Directory to write output to") args = parser.parse_args() + if str(args.type).startswith("_"): + return table_schema_versions = dict_table_schema_versions()[args.type] assert table_schema_versions, f"No versions for {args.type}" diff --git a/src/ingest_validation_tools/schema_loader.py b/src/ingest_validation_tools/schema_loader.py index 424deafb7..f6f5b5859 100644 --- a/src/ingest_validation_tools/schema_loader.py +++ b/src/ingest_validation_tools/schema_loader.py @@ -196,7 +196,7 @@ def _get_schema_filename(schema_name: str, version: str) -> str: def get_table_schema( schema_version: SchemaVersion, optional_fields: List[str] = [], - offline: bool = False, + no_url_checks: bool = False, keep_headers: bool = False, ) -> dict: try: @@ -222,7 +222,7 @@ def get_table_schema( if schema_version.metadata_type == "assays": _add_level_1_description(schema_field) _validate_level_1_enum(schema_field) - _add_constraints(schema_field, optional_fields, offline=offline, names=names) + _add_constraints(schema_field, optional_fields, no_url_checks=no_url_checks, names=names) if schema_version.metadata_type == "assays": _validate_field(schema_field) @@ -354,7 +354,7 @@ def _validate_level_1_enum(field: dict) -> None: def _add_constraints( - field: dict, optional_fields: List[str], offline=None, names: List[str] = [] + field: dict, optional_fields: List[str], no_url_checks=None, names: List[str] = [] ) -> None: """ Modifies field in-place, adding implicit constraints @@ -464,8 +464,8 @@ def _add_constraints( if field["name"] in optional_fields: field["constraints"]["required"] = False - # Remove network checks if offline: - if offline: + # Remove network checks if no_url_checks: + if no_url_checks: c_c = "custom_constraints" if c_c in field and "url" in field[c_c]: del field[c_c]["url"] diff --git a/src/ingest_validation_tools/upload.py b/src/ingest_validation_tools/upload.py index d3650007d..5a8bad68f 100644 --- a/src/ingest_validation_tools/upload.py +++ b/src/ingest_validation_tools/upload.py @@ -52,7 +52,7 @@ def __init__( upload_ignore_globs: list = [], plugin_directory: Union[Path, None] = None, encoding: str = "utf-8", - offline: bool = False, + no_url_checks: bool = False, ignore_deprecation: bool = False, extra_parameters: Union[dict, None] = None, globus_token: str = "", @@ -66,7 +66,7 @@ def __init__( self.upload_ignore_globs = upload_ignore_globs self.plugin_directory = plugin_directory self.encoding = encoding - self.offline = offline + self.no_url_checks = no_url_checks self.add_notes = add_notes self.ignore_deprecation = ignore_deprecation self.errors = {} @@ -85,7 +85,6 @@ def __init__( self.encoding, self.app_context["ingest_url"], self.directory_path, - offline=self.offline, ) for path in (tsv_paths if tsv_paths else directory_path.glob(f"*{TSV_SUFFIX}")) } @@ -308,7 +307,7 @@ def _validate( schema = get_table_schema( schema_version, self.optional_fields, - self.offline, + self.no_url_checks, ) except Exception as e: return {f"{tsv_path} (as {schema_version.table_schema})": e} @@ -320,19 +319,9 @@ def _validate( if local_errors: local_validated[f"{tsv_path} (as {schema_version.table_schema})"] = local_errors else: - """ - Passing offline=True will skip all API/URL validation; - GitHub actions therefore do not test via the CEDAR - Spreadsheet Validator API, so tests must be run - manually (see tests-manual/README.md) - """ - if self.offline: - logging.info(f"{tsv_path}: Offline validation selected, cannot reach API.") - return errors - else: - api_errors = self.online_checks(tsv_path, schema_version, report_type) - if api_errors: - api_validated[f"{tsv_path}"] = api_errors + api_errors = self.online_checks(tsv_path, schema_version, report_type) + if api_errors: + api_validated[f"{tsv_path}"] = api_errors if local_validated: errors["Local Validation Errors"] = local_validated if api_validated: @@ -535,6 +524,8 @@ def _url_checks( """ errors: Dict = {} + if self.no_url_checks: + return errors # assay -> parent_sample_id # sample -> sample_id # organ -> organ_id @@ -741,7 +732,6 @@ def _check_other_path(self, metadata_path: Path, other_path_value: str, path_typ self.encoding, self.app_context["ingest_url"], self.directory_path, - offline=self.offline, ) except Exception as e: errors[f"{metadata_path}, column '{path_type}_path', value '{other_path_value}'"] = [e] diff --git a/src/ingest_validation_tools/validation_utils.py b/src/ingest_validation_tools/validation_utils.py index 44127ce6c..fcae08255 100644 --- a/src/ingest_validation_tools/validation_utils.py +++ b/src/ingest_validation_tools/validation_utils.py @@ -35,7 +35,6 @@ def get_schema_version( encoding: str, ingest_url: str = "", directory_path: Optional[Path] = None, - offline: bool = False, ) -> SchemaVersion: try: rows = read_rows(path, encoding) @@ -52,8 +51,6 @@ def get_schema_version( ) return sv message = [] - if offline: - message.append("Running in offline mode, cannot reach assayclassifier.") if not (rows[0].get("dataset_type") or rows[0].get("assay_type")): message.append(f"No assay_type or dataset_type in {path}.") if "channel_id" in rows[0]: @@ -270,7 +267,7 @@ def get_tsv_errors( tsv_path: Union[str, Path], schema_name: str, optional_fields: List[str] = [], - offline: bool = False, + no_url_checks: bool = False, ignore_deprecation: bool = False, report_type: ReportType = ReportType.STR, globus_token: str = "", @@ -336,7 +333,7 @@ def get_tsv_errors( tsv_paths=[Path(tsv_path)], optional_fields=optional_fields, globus_token=globus_token, - offline=offline, + no_url_checks=no_url_checks, ignore_deprecation=ignore_deprecation, app_context=app_context, ) diff --git a/src/validate_upload.py b/src/validate_upload.py index 84307e220..746484a2e 100755 --- a/src/validate_upload.py +++ b/src/validate_upload.py @@ -65,9 +65,9 @@ def make_parser(): "(But if they are supplied in the TSV, they will be validated.)", ) parser.add_argument( - "--offline", + "--no_url_checks", action="store_true", - help="Skip checks that require network access.", + help="Skip URL checks (Spreadsheet Validator API checks still run).", ) parser.add_argument( "--clear_cache", @@ -162,7 +162,7 @@ def main(): upload_args = { "add_notes": args.add_notes, "encoding": args.encoding, - "offline": args.offline, + "no_url_checks": args.no_url_checks, "globus_token": args.globus_token, "optional_fields": args.optional_fields, "ignore_deprecation": args.ignore_deprecation, diff --git a/tests-manual/README.md b/tests-manual/README.md index 329f95e11..ab7fa4a81 100644 --- a/tests-manual/README.md +++ b/tests-manual/README.md @@ -7,7 +7,7 @@ Automated testing (e.g. via GitHub action or by running `./test.sh`) does not hi Run the following from the top-level directory: ``` -./tests-manual/test-dataset-examples-online.sh +./tests-manual/test-dataset-examples-online.sh ``` This test mechanism calls validate_upload.py and does not update files. It is good for reliable manual online testing. diff --git a/tests/test-generate-docs.sh b/tests/test-generate-docs.sh index 40fe69ee9..296441c37 100755 --- a/tests/test-generate-docs.sh +++ b/tests/test-generate-docs.sh @@ -3,41 +3,6 @@ set -o errexit die() { set +v; echo "$*" 1>&2 ; sleep 1; exit 1; } -# Test field-descriptions.yaml and field-types.yaml: - -ATTR_LIST='description type entity assay schema' -RERUNS='' -for ATTR in $ATTR_LIST; do - PLURAL="${ATTR}s" - [ "$PLURAL" == 'entitys' ] && PLURAL='entities' - REAL_DEST="docs/field-${PLURAL}.yaml" - TEST_DEST="docs-test/field-${PLURAL}.yaml" - echo "Checking $REAL_DEST" - - REAL_CMD="src/generate_field_yaml.py --attr $ATTR > $REAL_DEST;" - TEST_CMD="src/generate_field_yaml.py --attr $ATTR > $TEST_DEST" - - mkdir docs-test || echo "Already exists" - eval $TEST_CMD || die "Command failed: $TEST_CMD" - diff -r $REAL_DEST $TEST_DEST || RERUNS="$RERUNS $REAL_CMD" - rm -rf docs-test -done -[ -z "$RERUNS" ] || die "Update YAMLs: $RERUNS" - -# Test Excel summary: -# This relies on the YAML created above. - -FILE="field-schemas.xlsx" -echo "Checking $FILE" - -mkdir docs-test -REAL_DEST="docs/$FILE" -TEST_DEST="docs-test/$FILE" -REAL_CMD="src/generate_grid.py $REAL_DEST" -TEST_CMD="src/generate_grid.py $TEST_DEST" -eval $TEST_CMD || die "Command failed: $TEST_CMD" -diff $REAL_DEST $TEST_DEST || die "Update needed: $REAL_CMD" - # Test docs: for TYPE in $(ls -d docs/*); do @@ -67,7 +32,7 @@ for TYPE in $(ls -d docs/*); do if [ -e $REAL_DEST/current ] && [ -e $TEST_DEST/current ]; then diff -r $REAL_DEST/current $TEST_DEST/current --exclude="*.tsv" --exclude="*.xlsx" \ || die "Update needed: $REAL_CMD - Or:" 'for D in `ls -d docs/*/`; do D=`basename $D`; src/generate_docs.py $D docs/$D; done' + Or:" 'for D in `ls -d docs/*/`; do D=`basename $D`|| continue; src/generate_docs.py $D docs/$D; echo $D; done' fi if [ -e $REAL_DEST/deprecated ] && [ -e $TEST_DEST/deprecated ]; then