-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add validators for bboxes annotation files #32
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #32 +/- ##
==========================================
+ Coverage 0.00% 96.42% +96.42%
==========================================
Files 1 3 +2
Lines 5 112 +107
==========================================
+ Hits 0 108 +108
+ Misses 5 4 -1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work as usual @sfmig!
I see nothing fundamentally wrong with your approach here, seems sensible to me.
The functionality is also sufficiently covered by tests imo.
I've left a few specific comments/questions/suggestions, but nothing major, so I will pre-emptively approve this PR, and let you decide on what to do about each suggestion.
f"Error decoding JSON data from file: {filepath}. " | ||
"The data being deserialized is not a valid JSON. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally find the original error message (the one you override) very cryptic, so imo it's worth doing the override.
except Exception as error: | ||
raise error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can simply remove these lines, because there is no additional logic added to catching the generic Exception
. This is equivalent to letting Python handle it, I think.
except Exception as error: | |
raise error |
from pathlib import Path | ||
|
||
import jsonschema | ||
import jsonschema.exceptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's purely a matter of taste, but I would do from jsonschema.exceptions import SchemaError, ValidationError
But I can also see the merits of how you've done it, so feel free to ignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or if you take my next suggestion, no need to import these at all.
try: | ||
jsonschema.validate(instance=data, schema=schema) | ||
except jsonschema.exceptions.ValidationError as val_err: | ||
raise val_err | ||
except jsonschema.exceptions.SchemaError as schema_err: | ||
raise schema_err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are just re-raising the same exceptions you are catching, is there any point to this try-expect block?
try: | |
jsonschema.validate(instance=data, schema=schema) | |
except jsonschema.exceptions.ValidationError as val_err: | |
raise val_err | |
except jsonschema.exceptions.SchemaError as schema_err: | |
raise schema_err | |
jsonschema.validate(instance=data, schema=schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point being, there is no point in doing error handling here if you are not going to change the error type, add a message, or "absorb" the error instead of raising it. In this particular case here, I would completely do away with it, as per my suggestion.
"a key may not be found correctly if the schema keywords " | ||
"(such as 'properties', 'type' or 'items') are not spelt " | ||
"correctly." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a common occurrence? I wonder if we can do without the second sentence.
The first sentence clearly tells you what the problem is and checking for spelling errors in the schema would be a common-sene place to start debugging.
But it doesn't hurt to be extra helpful I guess?
) | ||
|
||
|
||
def _extract_properties_keys(schema: dict, parent_key="") -> list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is quite complex to read and understand, but I think that's just a reflection of the complexity of the task you are doing.
I tried rewriting it in a different way. My version ended up shorter but arguably harder to understand (it includes recursive calls to the same function). Anyway, I will leave my version below in case you are interested, but I don't think it's worth changing.
Note that existing tests still pass with my version, but haven't verified 100% that it behaves the same way as yours in all cases.
My attempt
def _extract_properties_keys(schema: dict) -> list:
"""Extract keys from all "properties" dictionaries in a JSON schema.
Traverses a JSON schema and collects all property keys, including nested
ones. Returns them as a sorted list of strings with full paths
(e.g. 'parent/child').
"""
def _collect_keys(current_schema: dict, prefix: str = "") -> list:
result: list[str] = []
# Skip if not an object schema
if (
not isinstance(current_schema, dict)
or "type" not in current_schema
):
return result
# Handle properties
if "properties" in current_schema:
for key, value in current_schema["properties"].items():
full_key = f"{prefix}/{key}" if prefix else key
result.append(full_key)
# Recurse into nested properties
result.extend(_collect_keys(value, full_key))
# Handle additionalProperties
if "additionalProperties" in current_schema:
props = current_schema["additionalProperties"]
result.extend(_collect_keys(props, prefix))
# Handle array items
if "items" in current_schema:
result.extend(_collect_keys(current_schema["items"], prefix))
return result
return sorted(_collect_keys(schema))
def _get_default_VIA_schema() -> dict: | ||
"""Get the VIA schema as a dictionary.""" | ||
via_schema_path = Path(__file__).parent / "schemas" / "VIA_schema.json" | ||
with open(via_schema_path) as file: | ||
via_schema_dict = json.load(file) | ||
return via_schema_dict | ||
|
||
|
||
def _get_default_COCO_schema() -> dict: | ||
"""Get the COCO schema as a dictionary.""" | ||
coco_schema_path = Path(__file__).parent / "schemas" / "COCO_schema.json" | ||
with open(coco_schema_path) as file: | ||
coco_schema_dict = json.load(file) | ||
return coco_schema_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not important, but you could also merge these two, and then call _get_default_schema("VIA")
/ _get_default_schema("COCO")
# init=False makes the attribute to be unconditionally initialized | ||
# with the specified default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what that comment means. Does it mean I cannot overwrite the schema passing a different dict?
In any case, this comment should be moved to the first occurrence of init=False
in this file.
|
||
The section `_via_image_id_list` contains an ordered list of image keys using a unique key: `FILENAME-FILESIZE`, the position in the list defines the image ID. | ||
|
||
The section `_via_attributes` region attributes and file attributes, to display in VIA's UI and to classify the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feels like a verb is missing here, "contains"?
invalid_input_file: str, | ||
validator: type[ValidVIA | ValidCOCO], | ||
expected_exception: pytest.raises, | ||
log_message: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what you are actualling checking should be called "error_message" not "log_message", write. There is no logging setup yet, and you are not checking tha logs, just the exception info.
Rebase after #27
Description
What is this PR
Why is this PR needed?
To validate input data files with bounding boxes annotations
What does this PR do?
annotations/validator
module to define classes for valid JSON files with bounding boxes annotationsannotations/json_schemas
module with the default schemas for COCO and VIA and autils
module with helper functions.test_required_keys_in_COCO_schema
,test_required_keys_in_VIA_schema
). This is because the validation against a schema only checks the type for the keys in the schema that appear also in the file. So for example, if for any reason the schemas that we ship become empty dictionaries, the validation of a file against those schema will still pass.Question
I am tempted to do away with the "check file is JSON" check (via
_check_file_is_json
) - the only reason I have it now is because the error message we overwrite is slightly more clear, but to be honest I am not sure it is worth it.Any other benefit to doing this that I may be missing?
References
#18
How has this PR been tested?
Tests pass locally and in CI.
Is this a breaking change?
No.
Does this PR require an update to the documentation?
Not for now.
Checklist: