New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add validators for bboxes annotation files #32

Open

sfmig wants to merge 14 commits into main from smg/annotations-validators

Contributor

sfmig commented Jan 21, 2025 •

edited

Loading

Rebase after #27

Description

What is this PR

Bug fix
Addition of a new feature
Other

Why is this PR needed?
To validate input data files with bounding boxes annotations

What does this PR do?

Adds an annotations/validator module to define classes for valid JSON files with bounding boxes annotations
- We support two formats: COCO-JSON and VIA-JSON
- We use JSON schemas (via the jsonschema package) to check the types of the fields defined in a JSON file.
- I added an annotations/json_schemas module with the default schemas for COCO and VIA and a utils module with helper functions.
Adds tests
- I include a test to check the schemas that we ship include the minimum required data (test_required_keys_in_COCO_schema, test_required_keys_in_VIA_schema). This is because the validation against a schema only checks the type for the keys in the schema that appear also in the file. So for example, if for any reason the schemas that we ship become empty dictionaries, the validation of a file against those schema will still pass.

Question
I am tempted to do away with the "check file is JSON" check (via _check_file_is_json) - the only reason I have it now is because the error message we overwrite is slightly more clear, but to be honest I am not sure it is worth it.

Any other benefit to doing this that I may be missing?

References

How has this PR been tested?

Tests pass locally and in CI.

Is this a breaking change?

No.

Does this PR require an update to the documentation?

Not for now.

Checklist:

The code has been tested locally
Tests have been added to cover all new functionality
[ n/a ] The documentation has been updated to reflect any changes
The code has been formatted with pre-commit

sfmig added 7 commits

January 17, 2025 16:06


          Set up pooch registry fixture for tests

cb8962a


          Add annotation specific fixtures

9aaa6ab


          Delete placeholder test

e1e889d


          Recover placeholder for CI to pass

49fc060


          Add json schemas

9bdcffd


          Add validators for VIA and COCO files

289e6d1


          Update MANIFEST

20b72f2

codecov bot commented Jan 21, 2025 •

edited

Loading

Codecov Report

Attention: Patch coverage is 98.13084% with 2 lines in your changes missing coverage. Please review.

Project coverage is 96.42%. Comparing base (bd43585) to head (272d617).

Files with missing lines	Patch %	Lines
ethology/annotations/json_schemas/utils.py	96.72%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##            main      #32       +/-   ##
==========================================
+ Coverage   0.00%   96.42%   +96.42%     
==========================================
  Files          1        3        +2     
  Lines          5      112      +107     
==========================================
+ Hits           0      108      +108     
+ Misses         5        4        -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sfmig added 7 commits

January 21, 2025 17:56


          Add tests for supported validators

599fe87


          Combine validators tests

02e714c


          Simplify JSON check

91addb1


          Delete placeholder

f448c57


          Rename schemas

e01b993


          Small edits caps

534d32b


          Update docstrings

272d617

sfmig changed the title ~~Add validators for bboxes files~~ Add validators for bboxes annotation files

sfmig requested a review from niksirbi

January 21, 2025 19:13

sfmig mentioned this pull request

Read bounding boxes data as a dataframe #31

Draft

7 tasks

niksirbi approved these changes

View reviewed changes

Member

niksirbi left a comment

Excellent work as usual @sfmig!

I see nothing fundamentally wrong with your approach here, seems sensible to me.
The functionality is also sufficiently covered by tests imo.

I've left a few specific comments/questions/suggestions, but nothing major, so I will pre-emptively approve this PR, and let you decide on what to do about each suggestion.

ethology/annotations/json_schemas/utils.py

Comment on lines +34 to +35

		f"Error decoding JSON data from file: {filepath}. "
		"The data being deserialized is not a valid JSON. "

Member

niksirbi Jan 22, 2025

I personally find the original error message (the one you override) very cryptic, so imo it's worth doing the override.

ethology/annotations/json_schemas/utils.py

Comment on lines +37 to +38

		except Exception as error:
		raise error

Member

niksirbi Jan 22, 2025

I think you can simply remove these lines, because there is no additional logic added to catching the generic Exception. This is equivalent to letting Python handle it, I think.

Suggested change

      
                except Exception as error:
          
                    raise error

ethology/annotations/json_schemas/utils.py

+              from pathlib import Path
+              import jsonschema
+              import jsonschema.exceptions

Member

niksirbi Jan 22, 2025

It's purely a matter of taste, but I would do from jsonschema.exceptions import SchemaError, ValidationError
But I can also see the merits of how you've done it, so feel free to ignore.

Member

niksirbi Jan 23, 2025

Or if you take my next suggestion, no need to import these at all.

ethology/annotations/json_schemas/utils.py

Comment on lines +54 to +59

+                      try:
+                          jsonschema.validate(instance=data, schema=schema)
+                      except jsonschema.exceptions.ValidationError as val_err:
+                          raise val_err
+                      except jsonschema.exceptions.SchemaError as schema_err:
+                          raise schema_err

Member

niksirbi Jan 22, 2025

Since you are just re-raising the same exceptions you are catching, is there any point to this try-expect block?

Suggested change

      
                    try:
          
                        jsonschema.validate(instance=data, schema=schema)
          
                    except jsonschema.exceptions.ValidationError as val_err:
          
                        raise val_err
          
                    except jsonschema.exceptions.SchemaError as schema_err:
          
                        raise schema_err
          
                    jsonschema.validate(instance=data, schema=schema)

Member

niksirbi Jan 23, 2025

My point being, there is no point in doing error handling here if you are not going to change the error type, add a message, or "absorb" the error instead of raising it. In this particular case here, I would completely do away with it, as per my suggestion.

ethology/annotations/json_schemas/utils.py

Comment on lines +79 to +81

+                          "a key may not be found correctly if the schema keywords "
+                          "(such as 'properties', 'type' or 'items') are not spelt "
+                          "correctly."

Member

niksirbi Jan 22, 2025

Is this a common occurrence? I wonder if we can do without the second sentence.
The first sentence clearly tells you what the problem is and checking for spelling errors in the schema would be a common-sene place to start debugging.
But it doesn't hurt to be extra helpful I guess?

ethology/annotations/json_schemas/utils.py

		)


		def _extract_properties_keys(schema: dict, parent_key="") -> list:

Member

niksirbi Jan 22, 2025

This function is quite complex to read and understand, but I think that's just a reflection of the complexity of the task you are doing.

I tried rewriting it in a different way. My version ended up shorter but arguably harder to understand (it includes recursive calls to the same function). Anyway, I will leave my version below in case you are interested, but I don't think it's worth changing.

Note that existing tests still pass with my version, but haven't verified 100% that it behaves the same way as yours in all cases.

My attempt

def _extract_properties_keys(schema: dict) -> list:
    """Extract keys from all "properties" dictionaries in a JSON schema.

    Traverses a JSON schema and collects all property keys, including nested
    ones. Returns them as a sorted list of strings with full paths
    (e.g. 'parent/child').
    """

    def _collect_keys(current_schema: dict, prefix: str = "") -> list:
        result: list[str] = []

        # Skip if not an object schema
        if (
            not isinstance(current_schema, dict)
            or "type" not in current_schema
        ):
            return result

        # Handle properties
        if "properties" in current_schema:
            for key, value in current_schema["properties"].items():
                full_key = f"{prefix}/{key}" if prefix else key
                result.append(full_key)
                # Recurse into nested properties
                result.extend(_collect_keys(value, full_key))

        # Handle additionalProperties
        if "additionalProperties" in current_schema:
            props = current_schema["additionalProperties"]
            result.extend(_collect_keys(props, prefix))

        # Handle array items
        if "items" in current_schema:
            result.extend(_collect_keys(current_schema["items"], prefix))

        return result

    return sorted(_collect_keys(schema))

ethology/annotations/json_schemas/utils.py

Comment on lines +10 to +23

+              def _get_default_VIA_schema() -> dict:
+                  """Get the VIA schema as a dictionary."""
+                  via_schema_path = Path(__file__).parent / "schemas" / "VIA_schema.json"
+                  with open(via_schema_path) as file:
+                      via_schema_dict = json.load(file)
+                  return via_schema_dict
+              def _get_default_COCO_schema() -> dict:
+                  """Get the COCO schema as a dictionary."""
+                  coco_schema_path = Path(__file__).parent / "schemas" / "COCO_schema.json"
+                  with open(coco_schema_path) as file:
+                      coco_schema_dict = json.load(file)
+                  return coco_schema_dict

Member

niksirbi Jan 23, 2025

Not important, but you could also merge these two, and then call _get_default_schema("VIA") / _get_default_schema("COCO")

ethology/annotations/validators.py

Comment on lines +142 to +143

		# init=False makes the attribute to be unconditionally initialized
		# with the specified default

Member

niksirbi Jan 23, 2025

I'm not sure what that comment means. Does it mean I cannot overwrite the schema passing a different dict?
In any case, this comment should be moved to the first occurrence of init=False in this file.

ethology/annotations/json_schemas/schemas/README.md


		The section `_via_image_id_list` contains an ordered list of image keys using a unique key: `FILENAME-FILESIZE`, the position in the list defines the image ID.

		The section `_via_attributes` region attributes and file attributes, to display in VIA's UI and to classify the data.

Member

niksirbi Jan 23, 2025

feels like a verb is missing here, "contains"?

tests/test_unit/test_annotations/test_validators.py

+                  invalid_input_file: str,
+                  validator: type[ValidVIA | ValidCOCO],
+                  expected_exception: pytest.raises,
+                  log_message: str,

Member

niksirbi Jan 23, 2025

what you are actualling checking should be called "error_message" not "log_message", write. There is no logging setup yet, and you are not checking tha logs, just the exception info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet