Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validators for bboxes annotation files #32

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

sfmig
Copy link
Contributor

@sfmig sfmig commented Jan 21, 2025

Rebase after #27

Description

What is this PR

  • Bug fix
  • Addition of a new feature
  • Other

Why is this PR needed?
To validate input data files with bounding boxes annotations

What does this PR do?

  • Adds an annotations/validator module to define classes for valid JSON files with bounding boxes annotations
    • We support two formats: COCO-JSON and VIA-JSON
    • We use JSON schemas (via the jsonschema package) to check the types of the fields defined in a JSON file.
    • I added an annotations/json_schemas module with the default schemas for COCO and VIA and a utils module with helper functions.
  • Adds tests
    • I include a test to check the schemas that we ship include the minimum required data (test_required_keys_in_COCO_schema, test_required_keys_in_VIA_schema). This is because the validation against a schema only checks the type for the keys in the schema that appear also in the file. So for example, if for any reason the schemas that we ship become empty dictionaries, the validation of a file against those schema will still pass.

Question
I am tempted to do away with the "check file is JSON" check (via _check_file_is_json) - the only reason I have it now is because the error message we overwrite is slightly more clear, but to be honest I am not sure it is worth it.

Any other benefit to doing this that I may be missing?

References

#18

How has this PR been tested?

Tests pass locally and in CI.

Is this a breaking change?

No.

Does this PR require an update to the documentation?

Not for now.

Checklist:

  • The code has been tested locally
  • Tests have been added to cover all new functionality
  • [ n/a ] The documentation has been updated to reflect any changes
  • The code has been formatted with pre-commit

Copy link

codecov bot commented Jan 21, 2025

Codecov Report

Attention: Patch coverage is 98.13084% with 2 lines in your changes missing coverage. Please review.

Project coverage is 96.42%. Comparing base (bd43585) to head (272d617).

Files with missing lines Patch % Lines
ethology/annotations/json_schemas/utils.py 96.72% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##            main      #32       +/-   ##
==========================================
+ Coverage   0.00%   96.42%   +96.42%     
==========================================
  Files          1        3        +2     
  Lines          5      112      +107     
==========================================
+ Hits           0      108      +108     
+ Misses         5        4        -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sfmig sfmig changed the title Add validators for bboxes files Add validators for bboxes annotation files Jan 21, 2025
@sfmig sfmig requested a review from niksirbi January 21, 2025 19:13
Copy link
Member

@niksirbi niksirbi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work as usual @sfmig!

I see nothing fundamentally wrong with your approach here, seems sensible to me.
The functionality is also sufficiently covered by tests imo.

I've left a few specific comments/questions/suggestions, but nothing major, so I will pre-emptively approve this PR, and let you decide on what to do about each suggestion.

Comment on lines +34 to +35
f"Error decoding JSON data from file: {filepath}. "
"The data being deserialized is not a valid JSON. "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally find the original error message (the one you override) very cryptic, so imo it's worth doing the override.

Comment on lines +37 to +38
except Exception as error:
raise error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can simply remove these lines, because there is no additional logic added to catching the generic Exception. This is equivalent to letting Python handle it, I think.

Suggested change
except Exception as error:
raise error

from pathlib import Path

import jsonschema
import jsonschema.exceptions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's purely a matter of taste, but I would do from jsonschema.exceptions import SchemaError, ValidationError
But I can also see the merits of how you've done it, so feel free to ignore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or if you take my next suggestion, no need to import these at all.

Comment on lines +54 to +59
try:
jsonschema.validate(instance=data, schema=schema)
except jsonschema.exceptions.ValidationError as val_err:
raise val_err
except jsonschema.exceptions.SchemaError as schema_err:
raise schema_err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are just re-raising the same exceptions you are catching, is there any point to this try-expect block?

Suggested change
try:
jsonschema.validate(instance=data, schema=schema)
except jsonschema.exceptions.ValidationError as val_err:
raise val_err
except jsonschema.exceptions.SchemaError as schema_err:
raise schema_err
jsonschema.validate(instance=data, schema=schema)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point being, there is no point in doing error handling here if you are not going to change the error type, add a message, or "absorb" the error instead of raising it. In this particular case here, I would completely do away with it, as per my suggestion.

Comment on lines +79 to +81
"a key may not be found correctly if the schema keywords "
"(such as 'properties', 'type' or 'items') are not spelt "
"correctly."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a common occurrence? I wonder if we can do without the second sentence.
The first sentence clearly tells you what the problem is and checking for spelling errors in the schema would be a common-sene place to start debugging.
But it doesn't hurt to be extra helpful I guess?

)


def _extract_properties_keys(schema: dict, parent_key="") -> list:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is quite complex to read and understand, but I think that's just a reflection of the complexity of the task you are doing.

I tried rewriting it in a different way. My version ended up shorter but arguably harder to understand (it includes recursive calls to the same function). Anyway, I will leave my version below in case you are interested, but I don't think it's worth changing.

Note that existing tests still pass with my version, but haven't verified 100% that it behaves the same way as yours in all cases.

My attempt
def _extract_properties_keys(schema: dict) -> list:
    """Extract keys from all "properties" dictionaries in a JSON schema.

    Traverses a JSON schema and collects all property keys, including nested
    ones. Returns them as a sorted list of strings with full paths
    (e.g. 'parent/child').
    """

    def _collect_keys(current_schema: dict, prefix: str = "") -> list:
        result: list[str] = []

        # Skip if not an object schema
        if (
            not isinstance(current_schema, dict)
            or "type" not in current_schema
        ):
            return result

        # Handle properties
        if "properties" in current_schema:
            for key, value in current_schema["properties"].items():
                full_key = f"{prefix}/{key}" if prefix else key
                result.append(full_key)
                # Recurse into nested properties
                result.extend(_collect_keys(value, full_key))

        # Handle additionalProperties
        if "additionalProperties" in current_schema:
            props = current_schema["additionalProperties"]
            result.extend(_collect_keys(props, prefix))

        # Handle array items
        if "items" in current_schema:
            result.extend(_collect_keys(current_schema["items"], prefix))

        return result

    return sorted(_collect_keys(schema))

Comment on lines +10 to +23
def _get_default_VIA_schema() -> dict:
"""Get the VIA schema as a dictionary."""
via_schema_path = Path(__file__).parent / "schemas" / "VIA_schema.json"
with open(via_schema_path) as file:
via_schema_dict = json.load(file)
return via_schema_dict


def _get_default_COCO_schema() -> dict:
"""Get the COCO schema as a dictionary."""
coco_schema_path = Path(__file__).parent / "schemas" / "COCO_schema.json"
with open(coco_schema_path) as file:
coco_schema_dict = json.load(file)
return coco_schema_dict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not important, but you could also merge these two, and then call _get_default_schema("VIA") / _get_default_schema("COCO")

Comment on lines +142 to +143
# init=False makes the attribute to be unconditionally initialized
# with the specified default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what that comment means. Does it mean I cannot overwrite the schema passing a different dict?
In any case, this comment should be moved to the first occurrence of init=False in this file.


The section `_via_image_id_list` contains an ordered list of image keys using a unique key: `FILENAME-FILESIZE`, the position in the list defines the image ID.

The section `_via_attributes` region attributes and file attributes, to display in VIA's UI and to classify the data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like a verb is missing here, "contains"?

invalid_input_file: str,
validator: type[ValidVIA | ValidCOCO],
expected_exception: pytest.raises,
log_message: str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what you are actualling checking should be called "error_message" not "log_message", write. There is no logging setup yet, and you are not checking tha logs, just the exception info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants