Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add class annotations and/or other metadata properties to labels #60

Open
DragaDoncila opened this issue Oct 31, 2020 · 17 comments
Open

Comments

@DragaDoncila
Copy link

Currently the labels spec supports the declaration of a label-value and its associated color.

Commonly, label values have other associated information including the most obvious, the class name. napari also supports display of label properties, so this would be a nice additional feature for the reader plugin.

I think the critical requirements for these properties should be:

  • Supporting an easy mapping between a given property and the label-value/s it is associated with
  • Enforcing as few rules as possible on what kinds of properties can be accepted
  • Supporting an arbitrary number of properties

There are three ways I can see the spec supporting these additional properties:

  1. Arbitrary number of lists of max length n for a label image containing n label values, each corresponding to a property. The index in the list corresponds to the integer label-value e.g.
    "image-label": {
        "version": "0.1",
        "colors": [
            {
                "label-value": 1,
                "rgba": [
                    255,
                    100,
                    100,
                    255
                ]
            },
            {
                "label-value": 2,
                "rgba": [
                    0,
                    40,
                    200,
                    255
                ]
            },
            {
                "label-value": 3,
                "rgba": [
                    148,
                    50,
                    165,
                    255
                ]
            }
        ],
        "properties": [
            {
                "class": [
                    "Urban",
                    "Water",
                    "Agriculture"
                ],
                "area_m2":
                [
                    "400",
                    "1532",
                    "590"
                ]
            }
        ]
    }

I think this is least explicit, and less intuitive than the next approaches.

  1. Declare another group similar to colors, where each label-value has its own associated properties:
{
    "multiscales": [
        {
            "datasets": [
                {
                    "path": "0"
                },
                {
                    "path": "1"
                },
                {
                    "path": "2"
                },
                {
                    "path": "3"
                }
            ],
            "version": "0.1"
        }
    ],
    "image-label": {
        "version": "0.1",
        "colors": [
               ...
        ],
        "properties": [
            {
                "label-value": 1,
                "class": "Urban",
                "area_m2": "400"

            },
            {
                "label-value": 2,
                "class": "Water",
                "area_m2": "1532"
            },
            {
                "label-value": 3,
                "class": "Agriculture",
                "area_m2": "590"

            }
        ]
    }
}

This is explicit, but has the disadvantage of duplicating the label-value definitions.

  1. Make color another property e.g.
        "properties": [
            {
                "label-value": 1,
                "rgba": [
                    255,
                    100,
                    100,
                    255
                ],
                "class": "Urban",
                "area_m2": "400"

            },
            {
                "label-value": 2,
                "rgba": [
                    148,
                    50,
                    165,
                    255
                ],
                "class": "Water",
                "area_m2": "1532"
            },
            {
                "label-value": 3,
                "rgba": [
                    148,
                    50,
                    165,
                    255
                ],
                "class": "Agriculture",
                "area_m2": "590"

            }
        ]

This doesn't duplicate label-values, and has the benefit of keeping all properties associated with a particular label-value in one spot.

On the implementation side, I think the differences in parsing the properties are negligible.

I'd love to hear what other people think are appropriate ways to represent the properties in the label metadata, or what they think the best option is.

@imagesc-bot
Copy link

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/multi-scale-image-labels-v0-1/43483/3

@joshmoore
Copy link
Member

Hi @DragaDoncila. Sorry for the slow response. Took some time to get caught up after the call. 😉

Having this conversation kicked off is great! And I certainly like what you're proposing with 3, but even though the v0.1 proposal hasn't really been officially released, there are a number of repositories that are already implementing it.

A few options I can imagine are:

  • We move this discussion to the v0.1 discussion and update it. Option 3 becomes effectively a "breaking" change even though not technically.
  • We get v0.1 out, your proposal becomes v0.2 and we then deal with the upgrade process (which is a great thing to work through)
  • We add option 2 as a non-breaking (neither technically nor effectively) and then when there is a breaking change, we introduce option 3 or something like it.

I should add that I think another similar breaking change may come when tabular data is supported in which case we may move some of this metadata into arrays for dealing with very large numbers of labels.

@manics
Copy link
Member

manics commented Nov 3, 2020

Option 3 looks the cleanest but a big disadvantage is future additions to the spec may use property names that now clash with the user-defined ones unless there is some way to indicate reserved names. In this respect Option 2 seems better despite the duplication of label-value, as all user-defined properties can go under properties without worrying about future conflicts.

@manics
Copy link
Member

manics commented Nov 3, 2020

Option 4 could be a variant of 3 where the user properties are under a dedicated subkey (I can't think of a good name so I've called it extra-properties in the example):

        "properties": [
            {
                "label-value": 1,
                "rgba": [
                    255,
                    100,
                    100,
                    255
                ],
                "extra-properties": {
                  "class": "Urban",
                  "area_m2": "400",
                  "other": [1, 2, 3, 4]
                }
            },

@will-moore
Copy link
Member

I think I prefer Option 2!
I don't see a big problem with duplication of the label-value key, and this is also clearer that spec-defined attributes (e.g. colors) are easily distinguished from custom properties. No naming conflicts, but without so much nesting as Option 4.

@DragaDoncila
Copy link
Author

Hi everyone,

Sorry for the late response - I've been finishing my honours thesis over the last few days so it's been packed.

Thanks for all the input! Having read through the suggestions here, I think @manics concern about clashes with future reserved names is the biggest disadvantage of Option 3. The extra-properties or user-properties subkey would definitely solve this issue but seems less elegant.

Despite initially thinking Option 3 was the way to go, I now actually think I agree with @will-moore that Option 2 seems preferable, as it fully separates spec properties and user defined properties.

@joshmoore how does that mesh with your longer term view of tabular metadata?

@manics
Copy link
Member

manics commented Nov 9, 2020

I don't think we necessarily need to constrain ourselves to the equivalent of tabular metadata for the properties key, instead we could say it's an array of JSON values. These could be flat key-value dictionaries or arrays if the intention is to convert them to a table, but nested dictionaries could also be allowed.

@imagesc-bot
Copy link

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/multi-scale-image-labels-v0-1/43483/9

@joshmoore
Copy link
Member

I don't think we necessarily need to constrain ourselves to the equivalent of tabular metadata for the properties key,

I was thinking about the reverse. I can see having the JSON keys be "deeper", but what does one do when wants to add gigabytes of tabular data? It's not required to solve that now but it will come up eventually.

For what it's worth, https://www.w3.org/TR/csv2json/ has some examples. Looks like the method there is a top-level object per row.

All the being said, I can definitely still see option 2 as a first non-breaking change that we iterate on.

cc: @manzt

@tischi
Copy link

tischi commented Nov 10, 2020

Hello, we were also thinking about image regions where objects overlap, see discussion here: https://forum.image.sc/t/multi-scale-image-labels-v0-1/43483/7

I am not sure, but maybe this could be tackled by something like:

properties": [
            {
                "label-value": 1,
                "associated-label-values": [3]
                "class": "Urban",
                "area_m2": "400"

            },
            {
                "label-value": 2,
                "associated-label-values": [3]
                "class": "Water",
                "area_m2": "1532"
            },
            {
                "label-value": 3,
                "child-labels": [2, 1]
            }
]

This would mean that label 3 is a region where labels 1 and 2 overlap.
It also means that the image region that semantically corresponds to label 1 is actually bigger, namely the union of the regions covered by label 1 and 3.

The "associated-label-values" is redundant with the "child-labels" and maybe should be removed.
I added it here because, in practice, it could be good to see at one glance that label 1 alone does not fully cover "object 1" but only when combined with the image region covered by label 3.

@constantinpape, do you maybe have comments or suggestions?

@constantinpape
Copy link
Contributor

@tischi yes, I think this could be a good solution for overlapping labels.

I think this opens up a few more questions that are maybe also relevant for the overall discussion of the label properties:

  • If a field is present in one properties, does it need to be in all the others? E.g. do we need class in all elements in the property list?
  • If it needs to be present in all of them, then in this case class is not so trivial, because it would be a composite of Water and Urban.

@tischi
Copy link

tischi commented Nov 10, 2020

If a field is present in one properties, does it need to be in all the others? E.g. do we need class in all elements in the property list?

I would say if we go for above list based approach we should not require a field to be present for all labels. If the storage layout would be more table based, then, I guess, yes, we would have to.

I think above list based approach is nice as it provides a lot of flexibility in terms of different labels having more or less information attached to them.

The disadvantage that I see compared with a table based approach is that it will require more storage space and could thus be quite slow to download and parse in order to e.g. build a table from it.

Thus for use cases with millions of labels I am a bit worried about performance.

@manics
Copy link
Member

manics commented Nov 10, 2020

I think we'll want both options: JSON style nested dictionaries for arbitary properties and support for tabular data. In the short term JSON dictionaries are relatively easy to add to the spec so it makes sense to start there.

@joshmoore
Copy link
Member

joshmoore commented Nov 12, 2020

Whew. Ok. So it sounds like we have some points for future discussion, but generally a consensus that we could start building, no? @DragaDoncila, have you already started on a branch anywhere? If not but were looking to start, do you think you have everything you need for a first pass?

@DragaDoncila
Copy link
Author

@joshmoore I've started a branch, which has Option 1 already implemented. From what I read here, Option 2 is the consensus to start with, before we move on to adding support for tabular data. I think I have everything I need for a first pass, so I'll put up a WIP PR by Monday afternoon if that timeline is okay

@joshmoore
Copy link
Member

Sounds amazing. Thanks, @DragaDoncila !

@imagesc-bot
Copy link

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/ome-zarr-mask-label-metadata-class/103630/2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants