Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jpuerto/nihdev 454 ivt adjust dir schema validation for shared uploads #1290

Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
ac74197
General: Add pseudo-code
Jan 23, 2024
54e457f
General: Update CHANGELOG
Jan 23, 2024
23af64a
General: Initial approach for shared directory support
Jan 24, 2024
a798d88
General: non-global != non_global
Jan 24, 2024
88d8fe0
General: Comments for changes
Jan 25, 2024
02755f2
General: Add example for shared uploads
Jan 25, 2024
ece10b5
Merge branch 'main' of https://github.com/hubmapconsortium/ingest-val…
Feb 1, 2024
c2d8820
General: Resolve changelog conflict with main
Feb 1, 2024
da58c61
General: Ignore multi_ref if it is a shared upload
Feb 2, 2024
7391f0f
General: Change how plugin_validation functions for shared uploads
Feb 2, 2024
42d0332
General: Fix typing issue
Feb 2, 2024
06e4745
General: Remove TODO - handled deeper than where it was placed.
Feb 2, 2024
f218c9e
General: Update upload.py to check for non_global_file existence
Feb 5, 2024
110f6a4
General: Fix upload.py
Feb 5, 2024
50feec5
General: Update docs/tests
Feb 5, 2024
0303cbf
General: Change error messaging to be more clear about where the erro…
Feb 5, 2024
cd09efb
General: Change error messaging to be more clear about where the erro…
Feb 5, 2024
72d84ce
General: Update syntax to pass formatting checks
Feb 5, 2024
bc1a311
Plugins: Update paths passed to be global and non_global
Feb 5, 2024
b6aa499
Upload: Change shared_directories to just be a bool, we don't use the…
Feb 6, 2024
454d224
Directory Validation: Fix error message for paths
Feb 6, 2024
5d8acab
General: Line length change
Feb 6, 2024
61d3e99
Merge branch 'main' of https://github.com/hubmapconsortium/ingest-val…
Feb 6, 2024
672687a
Directory Validation: Fix import
Feb 6, 2024
c38a076
General: Formatting changes
Feb 6, 2024
b2ea0ef
Merge branch 'main' of https://github.com/hubmapconsortium/ingest-val…
Feb 27, 2024
b0d12cb
General: Remove unused functions
Feb 27, 2024
820cec4
General: Reformatting
Feb 27, 2024
f73e03e
Merge branch 'main' of https://github.com/hubmapconsortium/ingest-val…
Mar 25, 2024
97b824f
General: Formatting changes
Mar 25, 2024
5a8fcb6
Tests: Update tests
Mar 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

## v0.0.18 (in progress)
- Update PhenoCycler directory schema
- Directory validation changes for "shared" uploads

## v0.0.17

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"PhenoCycler": {"args": ["HBM894.RZFJ.749", "", "https://dx.doi.org/10.17504/protocols.io.eq2lyno9qvx9/v1", "PhenoCycler", "DNA", "Yes", "Andor", "Axio Observer 7", "1", "day", "", "", "./contributors.tsv", ".", "./antibodies.tsv", "11", "hour", "1", "1", "1", "1", "", "NAKATPASE", "DAPI", "./lab_processed/images/1-tissue-boundary.geojson", "62af6829-743d-423e-a701-204710e56beb"], "response": {"assaytype": "phenocycler", "contains-pii": false, "dataset-type": "PhenoCycler", "description": "PhenoCycler", "dir-schema": "phenocycler-v2", "primary": true, "vitessce-hints": []}}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
No errors!
Time: WILL_CHANGE
Git version: WILL_CHANGE
Directory: examples/dataset-iec-examples/good-cedar-phenocycler-shared/upload
TSVs:
metadata.tsv:
Schema: phenocycler-v2
Metadata schema version: '2'
Directory schema versions: phenocycler-v2


Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
No errors!
Time: WILL_CHANGE
Git version: WILL_CHANGE
Directory: examples/dataset-iec-examples/good-cedar-phenocycler-shared/upload
TSVs:
metadata.tsv:
Schema: phenocycler-v2
Metadata schema version: '2'
Directory schema versions: phenocycler-v2


Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
channel_id hgnc_symbol antibody_rrid uniprot_accession_number lot_number dilution_factor antibody_concentration_value antibody_concentration_unit conjugated_cat_number conjugated_tag metadata_schema_id
1 ID2B AB_10002075 Q9NNX6 GR3238979-1 312f7be0-9aec-4cae-b942-a8864c0aa1ce
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
first_name last_name middle_name_or_initial display_name affiliation orcid email is_contact is_principal_investigator is_operator metadata_schema_id
Juan Puerto Juan Puerto PSC 0000-0002-8928-741X Yes No No 94dae6f8-0756-4ab0-a47b-138e446a9501
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
parent_sample_id lab_id preparation_protocol_doi dataset_type analyte_class is_targeted acquisition_instrument_vendor acquisition_instrument_model source_storage_duration_value source_storage_duration_unit time_since_acquisition_instrument_calibration_value time_since_acquisition_instrument_calibration_unit contributors_path data_path antibodies_path total_run_time_value total_run_time_unit number_of_antibodies number_of_channels number_of_biomarker_imaging_rounds number_of_total_imaging_rounds slide_id cell_boundary_marker_or_stain nuclear_marker_or_stain non_global_files metadata_schema_id
HBM894.RZFJ.749 https://dx.doi.org/10.17504/protocols.io.eq2lyno9qvx9/v1 PhenoCycler DNA Yes Andor Axio Observer 7 1 day ./contributors.tsv . ./antibodies.tsv 11 hour 1 1 1 1 NAKATPASE DAPI ./lab_processed/images/1-tissue-boundary.geojson 62af6829-743d-423e-a701-204710e56beb
41 changes: 22 additions & 19 deletions src/ingest_validation_tools/directory_validator.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ def __init__(self, errors):


def validate_directory(
path: Path, schema_files: List[Dict], dataset_ignore_globs: List[str] = []
paths: list[Path], schema_files: List[Dict], dataset_ignore_globs: List[str] = []
) -> None:
"""
Given a directory path, and a directory schema,
Expand All @@ -22,27 +22,30 @@ def validate_directory(
required_patterns, allowed_patterns = _get_required_allowed(schema_files)
dependencies = _get_dependencies(schema_files)
except KeyError as e:
raise DirectoryValidationErrors(f"Error finding patterns for {path}: {e}")
raise DirectoryValidationErrors(f"Error finding patterns for {','.join(paths)}: {e}")
jpuerto-psc marked this conversation as resolved.
Show resolved Hide resolved
required_missing_errors: List[str] = []
not_allowed_errors: List[str] = []
if not path.exists():
raise FileNotFoundError(0, "No such file or directory", str(path))
actual_paths = []
for triple in os.walk(path):
(dir_path, _, file_names) = triple
# [1:] removes leading '/', if any.
prefix = dir_path.replace(str(path), "")[1:]
if os.name == "nt":
# Convert MS backslashes to forward slashes.
prefix = prefix.replace("\\", "/")
# If this is not the root of the path and is a leaf directory
if not file_names and prefix:
actual_paths += [f"{prefix}/"]
# Otherwise this should be a branch directory
else:
actual_paths += (
[f"{prefix}/{name}" for name in file_names] if prefix else file_names
)

for path in paths:
if not path.exists():
raise FileNotFoundError(0, "No such file or directory", str(path))

for triple in os.walk(path):
(dir_path, _, file_names) = triple
# [1:] removes leading '/', if any.
prefix = dir_path.replace(str(path), "")[1:]
if os.name == "nt":
# Convert MS backslashes to forward slashes.
prefix = prefix.replace("\\", "/")
# If this is not the root of the path and is a leaf directory
if not file_names and prefix:
actual_paths += [f"{prefix}/"]
# Otherwise this should be a branch directory
else:
actual_paths += (
[f"{prefix}/{name}" for name in file_names] if prefix else file_names
)

"""TODO: message_munger adds periods at the end of these messages
which is very confusing for regex! Also human readability of required_patterns
Expand Down
32 changes: 25 additions & 7 deletions src/ingest_validation_tools/upload.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ def __init__(
extra_parameters: Union[dict, None] = None,
globus_token: str = "",
run_plugins: bool = False,
shared_directories: Union[set, None] = False,
jpuerto-psc marked this conversation as resolved.
Show resolved Hide resolved
):
self.directory_path = directory_path
self.optional_fields = optional_fields
Expand Down Expand Up @@ -92,6 +93,12 @@ def __init__(

self._check_multi_assay()

self.shared_directories = {
x
for x in self.directory_path.glob("*global")
if x.is_dir() and x.name in ["global", "non_global"]
}

except PreflightError as e:
self.errors["Preflight"] = e

Expand Down Expand Up @@ -303,17 +310,22 @@ def _get_multi_assay_dir_errors(
return errors

def _multi_assay_dir_check(self, schema: SchemaVersion, data_path: str) -> Dict:
data_dir_path = Path(data_path)
errors = {}
abs_data_path = self.directory_path / data_path

if not schema.dir_schema:
raise Exception(
f"No directory schema found for data_path {abs_data_path} in {schema.path}!"
f"No directory schema found for data_path {self.directory_path / data_path} "
f"in {schema.path}!"
)

ref_errors = get_data_dir_errors(
schema.dir_schema,
abs_data_path,
root_path=self.directory_path,
data_dir_path=data_dir_path,
dataset_ignore_globs=self.dataset_ignore_globs,
)

if ref_errors:
errors[f"{schema.path}, column 'data_path', value {data_path}"] = ref_errors
return errors
Expand Down Expand Up @@ -377,6 +389,7 @@ def _validate(
def _get_reference_errors(self) -> dict:
errors: Dict[str, Any] = {}
no_ref_errors = self.__get_no_ref_errors()
# TODO: Ignore multi_ref_errors if its a shared upload
multi_ref_errors = self.__get_multi_ref_errors()
if no_ref_errors:
errors["No References"] = no_ref_errors
Expand Down Expand Up @@ -678,17 +691,22 @@ def _get_ref_errors(
def _check_data_path(
self, schema_version: SchemaVersion, metadata_path: Path, path_value: str
):
data_path = Path(path_value)
errors = {}
data_path = self.directory_path / path_value

if not schema_version.dir_schema:
raise Exception(
f"No directory schema found for data_path {data_path} in {metadata_path}!"
f"No directory schema found for data_path "
f"{self.directory_path / data_path} in {metadata_path}!"
)

ref_errors = get_data_dir_errors(
schema_version.dir_schema,
data_path,
root_path=self.directory_path,
data_dir_path=data_path,
dataset_ignore_globs=self.dataset_ignore_globs,
)

if ref_errors:
errors[
f"{str(metadata_path)}, column 'data_path', value '{path_value}'"
Expand Down Expand Up @@ -764,7 +782,7 @@ def __get_multi_ref_errors(self) -> dict:
]
for path, references in data_references.items():
if path not in multi_references:
if len(references) > 1:
if len(references) > 1 and not self.shared_directories:
errors[path] = references
return errors

Expand Down
26 changes: 22 additions & 4 deletions src/ingest_validation_tools/validation_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -183,12 +183,28 @@ def read_rows(path: Path, encoding: str) -> List:

def get_data_dir_errors(
dir_schema: str,
data_path: Path,
root_path: Path,
data_dir_path: Path,
dataset_ignore_globs: List[str] = [],
) -> Optional[dict]:
"""
Validate a single data_path.
"""
expected_shared_directories = {'global', 'non_global'}
# Create the most common data_path
data_paths = [root_path / data_dir_path]

# Check to see whether the shared upload directories exist
shared_directories = {x for x in root_path.glob('*')
if x.is_dir() and x.name in expected_shared_directories}

# Iterate over the set of paths and ensure that the names of those paths
# matches the set of expected shared directories above
if {x.name for x in shared_directories} == expected_shared_directories:
# If they exist create a list of the paths
data_paths = list(shared_directories)
# Otherwise, do nothing we can just use the predefine data_path

schema = get_directory_schema(dir_schema=dir_schema)

if schema is None:
Expand All @@ -203,22 +219,24 @@ def get_data_dir_errors(
else None
)

printable_data_paths = [x.as_posix() for x in data_paths]

try:
validate_directory(
data_path, schema["files"], dataset_ignore_globs=dataset_ignore_globs
data_paths, schema["files"], dataset_ignore_globs=dataset_ignore_globs
)
except DirectoryValidationErrors as e:
# If there are DirectoryValidationErrors and the schema is deprecated/draft...
# schema deprecation/draft status is more important.
if schema_warning:
return schema_warning
errors = {}
errors[f"{data_path} (as {dir_schema})"] = e.errors
errors[f"{','.join(printable_data_paths)} (as {dir_schema})"] = e.errors
return errors
except OSError as e:
# If there are OSErrors and the schema is deprecated/draft...
# the OSErrors are more important.
return {f"{data_path} (as {dir_schema})": {e.strerror: e.filename}}
return {f"{','.join(printable_data_paths)} (as {dir_schema})": {e.strerror: e.filename}}
if schema_warning:
return schema_warning

Expand Down
Loading