Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect manifest for Nuclei_localization.py #7

Open
pgarrison opened this issue Sep 6, 2024 · 13 comments
Open

Incorrect manifest for Nuclei_localization.py #7

pgarrison opened this issue Sep 6, 2024 · 13 comments
Assignees

Comments

@pgarrison
Copy link
Collaborator

When I run Nuclei_localization.py I get the following error.

bioio_base.exceptions.UnsupportedFileFormatError: BioImage does not support the image: 'https://allencell.s3.amazonaws.com/aics/emt_timelapse_dataset/data/3500005827_9_collagenIV_segmentation_probability.ome.zarr'. You may need to install an extra format dependency. See our list of known plugins in the bioio README here: https://github.com/bioio-devs/bioio for a list of known plugins. You can also call the 'bioio.plugins.dump_plugins()' method to report information about currently installed plugins or the 'bioio.plugin_feasibility_report(image)' method to check if a specific image can be handled by the available plugins.
@pgarrison pgarrison self-assigned this Sep 6, 2024
@pgarrison
Copy link
Collaborator Author

Strange, we already have bioio-ome-zarr as a dependency. This may be user error.

@pgarrison
Copy link
Collaborator Author

Related bioio-devs/bioio-ome-zarr#28

@pgarrison pgarrison changed the title Missing dependency with Nuclei_localization.py Missing data for Nuclei_localization.py Sep 9, 2024
@pgarrison
Copy link
Collaborator Author

pgarrison commented Sep 9, 2024

Okay, I went down the wrong path: it's not a dependency issue, it's that there is no image at s3://allencell/aics/emt_timelapse_dataset/data/3500005827_9_collagenIV_segmentation_probability.ome.zarr

Investigation questions

  • Is this segmentation is intended to be uploaded? Yes, it's in the manifest
  • Is this the correct name for the segmentation? Yes, at least according to the manifest
  • Were there ever any files uploaded there? No, this command returns no previous versions: AWS_PROFILE=open_data_bucket aws s3api list-object-versions --bucket allencell --prefix aics/emt_timelapse_dataset/data/3500005827_9_collagenIV_segmentation_probability.ome.zarr
  • Were these files uploaded to staging? No (nor were they uploaded and deleted)
  • Are the files just misnamed? Sort of. We did upload one segmentation that wasn't expected by the manifest: 3500005827_20_collagenIV_segmentation_probability.ome.zarr. The data uploaded is correct and the manifest is wrong.
  • If the segmentation is just misnamed, which name is correct? 3500005827_20
  • Do we have the files on the Vast? Yes
  • If not, can we reproduce the segmentation?

Resolution tasks

  • Correct the published data
  • Run Nuclei_localization.py to see if more files it needs are missing.

Post-mortem questions

  • Why did the error happen? See comment below.
  • Can we validate that there are no other missing (or misnamed) files?
  • What can we do to catch these issues sooner?

@vianamp
Copy link
Collaborator

vianamp commented Sep 10, 2024

@smishra3 we need your eye on this.

@smishra3
Copy link

smishra3 commented Sep 10, 2024

I double checked and there is no basement membrane segmentation for this id 3500005827_9.

3500005827_9 has a fms-id - 08d16b7278e24a5c8cdf8c3f723f4859. Goutham has never generated segmentations for this fms id. I also checked the parent directory and it's not there either (\allen\aics\assay-dev\computational\data\EMT_deliverable_processing\Collagen_segmentation_segmentations).

I saw it's in the manifest. Checking now what's the issue.

@vianamp
Copy link
Collaborator

vianamp commented Sep 10, 2024

@niveditasa do you know if this movie was ever analyzed?

@smishra3
Copy link

I found the bug.
3500005827_9 does not exist but 3500005827_20 exits. In the current manifest, for collagenIV segmentation (s3://allencell/aics/emt_timelapse_dataset/data/3500005827_9_collagenIV_segmentation_probability.ome.zarr) path is provided in the manifest, but it represnts s3://allencell/aics/emt_timelapse_dataset/data/3500005827_20_collagenIV_segmentation_probability.ome.zarr.

At present s3://allencell/aics/emt_timelapse_dataset/data/3500005827_20_collagenIV_segmentation_probability.ome.zarr exists but not included in the manifest, and s3://allencell/aics/emt_timelapse_dataset/data/3500005827_9_collagenIV_segmentation_probability.ome.zarr doesn't exist and is included in the manifest.
I guess it was caused during the final naming change to barcode_scene format.

@smishra3
Copy link

@pgarrison can you rerun your test on this updated manifest?
imaging_and_segmentation_data_v1_09102024.csv

Only the rows containing 3500005827_9 and 3500005827_20 have been changed to fix the bug.

@pgarrison
Copy link
Collaborator Author

@smishra3 Are you saying that the segmentations uploaded are correct and it's just the manifest that is off? So we are not expected to have a basement membrane segmentation for 3500005827_9, and we are supposed to have one for 3500005827_20?

@smishra3
Copy link

@pgarrison Exactly. Collagen IV for 3500005827_9 is never done and is also not expected to be done either as 3500005827_9 is a 2D colony. In the paper and in current work and analysis, no collagenIV segmentation is done for a 2D colony (all membrane segmentations are for 3D colonies).

It's a naming issue. And I'm not sure how it got overlooked. When I tried the web viewer link (of the collagenIV and the combined), the link were also dead. I had an opinion that all links were tested.

With changing the naming you can now see the web viewer links are also working.

@pgarrison
Copy link
Collaborator Author

The following root cause diagnosis is summarized from an in-person discussion with @smishra3.

Root cause

There are 3 tightly related errors in the manifest:

  • The 3500005827_9 row has values in the collagen IV segmentation columns
  • The 3500005827_20 row does not have values in the collagen IV segmentation columns
  • The 3500005827_9 row has web volume viewer links with the data for 3500005827_20.

The two affected movies, 3500005827_9 and 3500005827_20, are adjacent rows in the manifest. We don't have precise information about how the manifest was edited, but this is a very strong clue that it was a simple typo, entering data into the wrong row.

How our data validation steps failed to catch it

  1. Two people independently validated that the count of *_collagenIV_segmentation_probability.ome.zarr segmentations was consistent with the manifest (49). This issue left the number of segmentations unchanged because the manifest identified 3500005827_9 instead of 3500005827_20.
  2. All of the web volume viewer links in the manifest were validated by opening in the web browser. In the row for 3500005827_9, the manifest link to the web volume viewer uses the 3500005827_20 data, so the link was correct.

@pgarrison
Copy link
Collaborator Author

@smishra3 @mfs4rd I successfully ran the Nuclei_localization.py with the updated CSV (from Suraj's comment above) and produced 49 files _localized_nuclei.csv. So this resolves the error from the original post. I think I'll go ahead with updating the published CSV. At the same time, is there anything else we can do to validate it is correct?

@pgarrison pgarrison changed the title Missing data for Nuclei_localization.py Incorrect manifest for Nuclei_localization.py Sep 12, 2024
@mfs4rd
Copy link
Collaborator

mfs4rd commented Sep 18, 2024

@pgarrison The localization code is identical to what was used for the data/results published aside from the changes needed to be compatible with s3 storage by downloading the files to a local directory, and @smishra3 @antoineborensztejn and I verified that the meshes on s3 are identical to the originals. I don't think there is anything else that needs to be validated atm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants