Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validators for bboxes annotation files #32

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ repos:
args: [--fix=lf]
- id: name-tests-test
args: ["--pytest-test-first"]
exclude: ^tests/fixtures
- id: requirements-txt-fixer
- id: trailing-whitespace
- repo: https://github.com/pre-commit/pygrep-hooks
Expand Down
4 changes: 4 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,7 @@ recursive-exclude * __pycache__
recursive-exclude * *.py[co]
recursive-exclude docs *
recursive-exclude tests *

# Include json schemas
recursive-include ethology/annotations/json_schemas/schemas *.json
recursive-include ethology/annotations/json_schemas/schemas *.md
Empty file.
78 changes: 78 additions & 0 deletions ethology/annotations/json_schemas/schemas/COCO_schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"info": {
"type": "object"
},
"licenses": {
"type": "array"
},
"images": {
"type": "array",
"items": {
"type": "object",
"properties": {
"file_name": {
"type": "string"
},
"id": {
"type": "integer"
},
"width": {
"type": "integer"
},
"height": {
"type": "integer"
}
}
}
},
"annotations": {
"type": "array",
"items": {
"type": "object",
"properties": {
"id": {
"type": "integer"
},
"image_id": {
"type": "integer"
},
"bbox": {
"type": "array",
"items": {
"type": "integer"
}
},
"category_id": {
"type": "integer"
},
"area": {
"type": "number"
},
"iscrowd": {
"type": "integer"
}
}
}
},
"categories": {
"type": "array",
"items": {
"type": "object",
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "string"
},
"supercategory": {
"type": "string"
}
}
}
}
}
}
32 changes: 32 additions & 0 deletions ethology/annotations/json_schemas/schemas/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
## JSON schemas for manual annotations files.

We use JSON schemas to validate the types of a supported annotation file.

Note that the schema validation only checks the type of a key if that key is present. It does not check for the presence of the keys.

If the meta-schema (under $schema) is not provided, the jsonschema validator uses the the latest released draft of the JSON schema specification.

## VIA schema

The VIA schema corresponds to the format exported by VGG Image Annotator 2.x.y (VIA) for object detection annotations.

Each image under `_via_img_metadata` is indexed using a unique key: FILENAME-FILESIZE. We use "additionalProperties" to allow for any key name, see https://stackoverflow.com/a/69811612/24834957.

The section `_via_image_id_list` contains an ordered list of image keys using a unique key: `FILENAME-FILESIZE`, the position in the list defines the image ID.

The section `_via_attributes` region attributes and file attributes, to display in VIA's UI and to classify the data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like a verb is missing here, "contains"?


The section `_via_data_format_version` contains the version of the VIA tool used.


## COCO schema
The COCO schema follows the COCO dataset format for object detection, see https://cocodataset.org/#format-data.

Box coordinates are measured from the top left corner of the image, and are 0-indexed.
### References
----------
- https://github.com/python-jsonschema/jsonschema
- https://json-schema.org/understanding-json-schema/
- https://cocodataset.org/#format-data
- https://gitlab.com/vgg/via/-/blob/master/via-2.x.y/CodeDoc.md?ref_type=heads#description-of-via-project-json-file
- https://python-jsonschema.readthedocs.io/en/stable/api/#jsonschema.validate
88 changes: 88 additions & 0 deletions ethology/annotations/json_schemas/schemas/VIA_schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason for using this specific schema version? I guess it what VIA and COCO use (if they bother to specify)?

"type": "object",
"properties": {
"_via_settings": {
"type": "object",
"properties": {
"ui": {
"type": "object"
},
"core": {
"type": "object"
},
"project": {
"type": "object"
}
}
},
"_via_img_metadata": {
"type": "object",
"additionalProperties": {
"type": "object",
"properties": {
"filename": {
"type": "string"
},
"size": {
"type": "integer"
},
"regions": {
"type": "array",
"items": {
"type": "object",
"properties": {
"shape_attributes": {
"type": "object",
"properties": {
"name": {
"type": "string"
},
"x": {
"type": "integer"
},
"y": {
"type": "integer"
},
"width": {
"type": "integer"
},
"height": {
"type": "integer"
}
}
},
"region_attributes": {
"type": "object"
}
}
}
},
"file_attributes": {
"type": "object"
}
}
}
},
"_via_image_id_list": {
"type": "array",
"items": {
"type": "string"
}
},
"_via_attributes": {
"type": "object",
"properties": {
"region": {
"type": "object"
},
"file": {
"type": "object"
}
}
},
"_via_data_format_version": {
"type": "string"
}
}
}
159 changes: 159 additions & 0 deletions ethology/annotations/json_schemas/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
"""Utility functions for JSON schema files."""

import json
from pathlib import Path

import jsonschema
import jsonschema.exceptions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's purely a matter of taste, but I would do from jsonschema.exceptions import SchemaError, ValidationError
But I can also see the merits of how you've done it, so feel free to ignore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or if you take my next suggestion, no need to import these at all.



def _get_default_VIA_schema() -> dict:
"""Get the VIA schema as a dictionary."""
via_schema_path = Path(__file__).parent / "schemas" / "VIA_schema.json"
with open(via_schema_path) as file:
via_schema_dict = json.load(file)
return via_schema_dict


def _get_default_COCO_schema() -> dict:
"""Get the COCO schema as a dictionary."""
coco_schema_path = Path(__file__).parent / "schemas" / "COCO_schema.json"
with open(coco_schema_path) as file:
coco_schema_dict = json.load(file)
return coco_schema_dict
Comment on lines +10 to +23
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not important, but you could also merge these two, and then call _get_default_schema("VIA") / _get_default_schema("COCO")



def _check_file_is_json(filepath: Path):
"""Check the input file can be read as a JSON."""
try:
with open(filepath) as file:
json.load(file)
except json.JSONDecodeError as decode_error:
# We override the error message for clarity
raise ValueError(
f"Error decoding JSON data from file: {filepath}. "
"The data being deserialized is not a valid JSON. "
Comment on lines +34 to +35
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally find the original error message (the one you override) very cryptic, so imo it's worth doing the override.

) from decode_error
except Exception as error:
raise error
Comment on lines +37 to +38
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can simply remove these lines, because there is no additional logic added to catching the generic Exception. This is equivalent to letting Python handle it, I think.

Suggested change
except Exception as error:
raise error



def _check_file_matches_schema(filepath: Path, schema: dict | None):
"""Check the input JSON file matches the given schema.

The schema validation only checks the type for each specified
key if the key exists. It does not check that the keys in the
schema are present in the JSON file.
"""
# Read json file
with open(filepath) as file:
data = json.load(file)

# Check against schema if provided
if schema:
try:
jsonschema.validate(instance=data, schema=schema)
except jsonschema.exceptions.ValidationError as val_err:
raise val_err
except jsonschema.exceptions.SchemaError as schema_err:
raise schema_err

Check warning on line 59 in ethology/annotations/json_schemas/utils.py

View check run for this annotation

Codecov / codecov/patch

ethology/annotations/json_schemas/utils.py#L58-L59

Added lines #L58 - L59 were not covered by tests
Comment on lines +54 to +59
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are just re-raising the same exceptions you are catching, is there any point to this try-expect block?

Suggested change
try:
jsonschema.validate(instance=data, schema=schema)
except jsonschema.exceptions.ValidationError as val_err:
raise val_err
except jsonschema.exceptions.SchemaError as schema_err:
raise schema_err
jsonschema.validate(instance=data, schema=schema)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point being, there is no point in doing error handling here if you are not going to change the error type, add a message, or "absorb" the error instead of raising it. In this particular case here, I would completely do away with it, as per my suggestion.



def _check_required_properties_keys(
required_properties_keys: list, schema: dict
):
"""Check the input schema includes the required "properties" keys."""
# Get keys of "properties" dictionaries in schema
properties_keys_in_schema = _extract_properties_keys(schema)

# Get list of "properties" keys that are required but not in schema
missing_keys = set(required_properties_keys) - set(
properties_keys_in_schema
)

# Raise error if there are missing keys in the schema
if missing_keys:
raise ValueError(
f"Required key(s) {sorted(missing_keys)} not found "
"in schema. Note that "
"a key may not be found correctly if the schema keywords "
"(such as 'properties', 'type' or 'items') are not spelt "
"correctly."
Comment on lines +79 to +81
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a common occurrence? I wonder if we can do without the second sentence.
The first sentence clearly tells you what the problem is and checking for spelling errors in the schema would be a common-sene place to start debugging.
But it doesn't hurt to be extra helpful I guess?

)


def _check_required_keys_in_dict(
list_required_keys: list[str],
data: dict,
additional_message: str = "",
):
"""Check if the required keys are present in the input dictionary."""
missing_keys = set(list_required_keys) - set(data.keys())
if missing_keys:
raise ValueError(
f"Required key(s) {sorted(missing_keys)} not "
f"found{additional_message}."
)


def _extract_properties_keys(schema: dict, parent_key="") -> list:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is quite complex to read and understand, but I think that's just a reflection of the complexity of the task you are doing.

I tried rewriting it in a different way. My version ended up shorter but arguably harder to understand (it includes recursive calls to the same function). Anyway, I will leave my version below in case you are interested, but I don't think it's worth changing.

Note that existing tests still pass with my version, but haven't verified 100% that it behaves the same way as yours in all cases.

My attempt
def _extract_properties_keys(schema: dict) -> list:
    """Extract keys from all "properties" dictionaries in a JSON schema.

    Traverses a JSON schema and collects all property keys, including nested
    ones. Returns them as a sorted list of strings with full paths
    (e.g. 'parent/child').
    """

    def _collect_keys(current_schema: dict, prefix: str = "") -> list:
        result: list[str] = []

        # Skip if not an object schema
        if (
            not isinstance(current_schema, dict)
            or "type" not in current_schema
        ):
            return result

        # Handle properties
        if "properties" in current_schema:
            for key, value in current_schema["properties"].items():
                full_key = f"{prefix}/{key}" if prefix else key
                result.append(full_key)
                # Recurse into nested properties
                result.extend(_collect_keys(value, full_key))

        # Handle additionalProperties
        if "additionalProperties" in current_schema:
            props = current_schema["additionalProperties"]
            result.extend(_collect_keys(props, prefix))

        # Handle array items
        if "items" in current_schema:
            result.extend(_collect_keys(current_schema["items"], prefix))

        return result

    return sorted(_collect_keys(schema))

"""Recursively extract the keys of all "properties" subdictionaries.

Recursively extract the keys of all subdictionaries in the input
dictionary that are values to a "properties" key. The input dictionary
represents a JSON schema dictionary
(see https://json-schema.org/understanding-json-schema/about).

The "properties" key always appears as part of a dictionary with at least
another key, that is "type" or "item".
"""
# The "property keys" are either "properties" or "additionalProperties"
# as they are the keys with the relevant data
property_keys = ["properties", "additionalProperties"]

def _contains_properties_key(input: dict):
"""Return True if the input dictionary contains a property key."""
return any(x in input for x in property_keys)

def _get_properties_subdict(input: dict):
"""Get the subdictionary under the property key."""
return input[next(k for k in property_keys if k in input)]

keys_of_properties_dicts = []
if "type" in schema:
if _contains_properties_key(schema):
# Get the subdictionary under the properties key
properties_subdict = _get_properties_subdict(schema)

# Check if there is a nested "properties" dict inside the current
# one. If so, go down one level.
if _contains_properties_key(properties_subdict):
properties_subdict = _get_properties_subdict(
properties_subdict
)

# Add keys of deepest "properties dict" to list
keys_of_properties_dicts.extend(
[
f"{parent_key}/{ky}" if parent_key else ky
for ky in properties_subdict
]
)

# Inspect non-properties dictionaries under this properties subdict
for ky, val in properties_subdict.items():
full_key = f"{parent_key}/{ky}" if parent_key else ky
keys_of_properties_dicts.extend(
_extract_properties_keys(val, full_key)
)

elif "items" in schema:
# Analyse the dictionary under the "items" key
properties_subdict = schema["items"]
keys_of_properties_dicts.extend(
_extract_properties_keys(
properties_subdict, parent_key=parent_key
)
)

return sorted(keys_of_properties_dicts)
Loading
Loading