Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plans for converting the data #2

Open
constantinpape opened this issue Nov 18, 2020 · 14 comments
Open

Plans for converting the data #2

constantinpape opened this issue Nov 18, 2020 · 14 comments

Comments

@constantinpape
Copy link
Contributor

@tischi, I wrote a couple of mails with @joshmoore today and as far as I understand the current plan is the following:
We don't ship the data to josh and instead convert and upload it locally.

I have a converter script and I am pretty sure it does the right thing, but I have a couple of other questions:

  • Should we put the new data in a separate bucket? I can ask Josep to create one.
  • Do we keep the same folder structure as for the other mobie projects?
  • I would suggest not to add the full res raw data, but only the 100nm version. Which data do we add apart from that.

P.S I made a new issue because #1 got a bit crowded.

@tischi
Copy link
Owner

tischi commented Nov 18, 2020

Should we put the new data in a separate bucket? I can ask Josep to create one.

Yes, why not. Let's call it i2k-2020

Do we keep the same folder structure as for the other mobie projects?

From my point of view we don't need any folder structure because there will be only three files (see the very first post here: #1).
But I don't know if for @joshmoore's current vision of the ome.zarr file-format they somehow should be in the same zarr container, because they are part of one dataset. @joshmoore would need to say.

I would suggest not to add the full res raw data, but only the 100nm version.

Yes! Excellent suggestion!

Which data do we add apart from that.

As said above: in terms of files see the very first post here: #1

I am not sure about the table. I don't think @joshmoore has something yet ready to store the table in zarr format?!

And ❤️ for helping!

@constantinpape
Copy link
Contributor Author

From my point of view we don't need any folder structure because there will be only three files (see the very first post here: #1).
But I don't know if for @joshmoore's current vision of the ome.zarr file-format they somehow should be in the same zarr container, because they are part of one dataset. @joshmoore would need to say.

Ok, in that case I would just add a single root zarr file with three multiscale datasets:

platy.zarr/
  em-raw/
     ...
  em-segmentation-cells/
     ...
  prospr-myosin/
    ...

I am not sure about the table. I don't think @joshmoore has something yet ready to store the table in zarr format?!

We could just store it as a 2d dataset with column names in the header, but I think there is indeed not a NGF format for tables yet.

Anyway, I will start with the volumetric data and let you know once I have something. (I will probably just start with the myosin volume, so @joshmoore can check it out once I have put it on the bucket and after we make sure the format is correct we add the larger files).

@tischi
Copy link
Owner

tischi commented Nov 19, 2020

Related to this: #3

If we want to use the MoBIE infrastructure the most straightforward would be if there would be somewhere an images.json file (like this one) pointing to three bdv.xml files (like this one) with <ImageLoader format="bdv.n5.zarr.s3">. If we would do this, we may "only" have to get this done (and some hopefully small add-ons in MoBIE) in order to have a working example to further iterate on.

@constantinpape
Copy link
Contributor Author

pointing to three bdv.xml files (like this one) with <ImageLoader format="bdv.n5.zarr.s3">

If we do this there are a few questions about the file layout, because we cannot simply use what I suggested here, because bdv assumes fixed paths inside the dataset (setup0/timepoint0, ...).

I see three options:

  • we go back to having one root zarr per volume and each has a single setup and timepoint
  • we use a single root zarr and single xml and store the different volumes as setups
  • we change the bdv.n5.zarr.s3 format so that we allow specifying a custom pathInFile to support a single root zarr

@joshmoore
Copy link
Collaborator

But I don't know if for @joshmoore's current vision of the ome.zarr file-format they somehow should be in the same zarr container, because they are part of one dataset. @joshmoore would need to say.

I don't think so.

I don't think @joshmoore has something yet ready to store the table in zarr format?!

There is some work now on an initial format:

which briefly looks like this:

/opt/data/6001240.zarr $ cat labels/0/.zattrs
{
    "image-label": {
        "properties": [
            {
                "label-value": 1,
                "class": "foo"
            },
            {
                "label-value": 2,
                "class": "bar"
            }
        ],
        "colors": [
            {
                "label-value": 1,
                "rgba": [
                    128,
                    128,
                    128,
                    128
                ]
            },

Ok, in that case I would just add a single root zarr file with three multiscale datasets:

Also ok.

@constantinpape
Copy link
Contributor Author

But I don't know if for @joshmoore's current vision of the ome.zarr file-format they somehow should be in the same zarr container, because they are part of one dataset. @joshmoore would need to say.

I don't think so.

Ok, let's discuss the layout tomorrow in the meeting.

There is some work now on an initial format:

* [ome/omero-cli-zarr#50](https://github.com/ome/omero-cli-zarr/pull/50)

* [ome/ome-zarr-py#61](https://github.com/ome/ome-zarr-py/pull/61)

* [ome/ome-zarr-py#63](https://github.com/ome/ome-zarr-py/pull/63)

This will produce large jsons in our case :). But we can give it a try; and in the future we can hopefully switch to storing the table as a zarr array.

@tischi
Copy link
Owner

tischi commented Nov 19, 2020

But we can give it a try; and in the future we can hopefully switch to storing the table as a zarr array.

For the testing, you could just write one feature value, like size.

@tischi
Copy link
Owner

tischi commented Nov 19, 2020

Personally, if I would like to get something working within one week until i2k, I would do the following:

  1. Store data like this on EMBL S3
images.json
a.xml
b.xml
c.xml
a.zarr
b.zarr
c.zarr
  1. Copy all the code from https://github.com/joshmoore/n5-zarr/tree/s3zarr into a branch of MoBIE
  2. Work within the MoBIE branch until we can read the images into BDV
  3. Take it from there, e.g. factor out the s3zarr stuff into its own repo again, discuss metadata a.s.o.

@joshmoore
Copy link
Collaborator

This will produce large jsons in our case :)

Yup. Definitely aware. I had tried the zarr array solution but ran into saalfeldlab/n5#73 (comment) Also discussed possible integrate with Parquet etc last night on the community call. Open to thoughts.

@constantinpape
Copy link
Contributor Author

@tischi your plan sounds good. I can def. set up 1. :). Will try to do as much as possible there before the meeting tomorrow and then we can finalize the plan before i2k.

@constantinpape
Copy link
Contributor Author

@joshmoore I uploaded one multiscale dataset to our new bucket.

Could you please check that you can access it?
Here's the details:

ServiceEndpoint: https://s3.embl.de
BucketName: i2k-2020
PathInBucket: platy.ome.zarr   (this is the zarr root)

If you can access it, can you check if the dataset at prospr-myosin is compatible with the zarr multiscale format?

Thanks!

@joshmoore
Copy link
Collaborator

Hi @constantinpape,

The .zattrs that's in ...ome.zarr/ will need to be in the prospr-myosin/ directory

aws --no-sign-request --endpoint-url=https://s3.embl.de s3 ls --recursive s3://i2k-2020/platy.ome.zarr/ | grep /.z
2020-11-19 14:49:21        400 platy.ome.zarr/.zattrs
2020-11-19 14:49:21         24 platy.ome.zarr/.zgroup
2020-11-19 14:49:21        327 platy.ome.zarr/prospr-myosin/s0/.zarray
2020-11-19 14:49:21        327 platy.ome.zarr/prospr-myosin/s1/.zarray
2020-11-19 14:49:21        327 platy.ome.zarr/prospr-myosin/s2/.zarray
2020-11-19 14:49:21        321 platy.ome.zarr/prospr-myosin/s3/.zarray

@constantinpape
Copy link
Contributor Author

constantinpape commented Nov 20, 2020

The .zattrs that's in ...ome.zarr/ will need to be in the prospr-myosin/ directory

Thanks for checking!
I fixed it in the code.

@constantinpape
Copy link
Contributor Author

I added the data according to what we discussed, see #4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants