Enable tiling non-PANDA WSI datasets #621

dccastro · 2021-12-14T14:31:09Z

This PR implements the following major changes in the tiling/preprocessing pipeline:

Create mask-free LoadROId transform using foreground auto-segmentation using Otsu threshold by default if threshold is unspecified.
Create more generic tiling scripts (create_tiles_dataset.py and azure_tiles_creation.py).
Update and back-up working PANDA tiling scripts as create_panda_tiles_dataset.py and azure_panda_tiles_creation.py for backward-compatibility.
Replace OpenSlide backend with cuCIM for loading WSI files. cuCIM only works on Linux.

Additionally, I've refactored our dataset classes:

Create SlideKey and TileKey schemas for indexing the respective batch dictionaries instead of hardcoded strings. Note that TileKey is not yet used in TilesDataset and DeepMIL; this will be addressed in a separate follow-up PR.
Create base SlidesDataset, now inherited by the simplified PandaDataset and TcgaPradDataset.

Other:

Add tests for slide loading, luminance, foreground seg., bounding box. Most of these run with a real .tiff file from the PANDA dataset, added via git-lfs.

dccastro · 2021-12-16T10:09:55Z

environment.yml

@@ -20,6 +20,7 @@ dependencies:
      - azureml-tensorboard==1.36.0
      - conda-merge==0.1.5
      - cryptography==3.3.2
+      - cucim==21.10.1; platform_system=="Linux"


This spec prevents the Windows builds from failing, as cuCIM is incompatible.

.gitattributes

InnerEye/ML/Histopathology/datasets/tcga_prad_dataset.py

vale-salvatelli · 2021-12-16T10:19:30Z

InnerEye/ML/Histopathology/preprocessing/create_tiles_dataset.py

-    main(panda_dir="/tmp/datasets/PANDA",
-         root_output_dir="/datadrive",
-         level=1,
+    from InnerEye.ML.Histopathology.datasets.tcga_prad_dataset import TcgaPradDataset


If TcgaPrad is removed also this block should be removed - is it a problem we don't actually have a single dataset implementation that is compatible with this script

Following your separate suggestion, I've decided to keep TCGA-PRAD as an example, and added a clarifying comment here.

vale-salvatelli · 2021-12-16T12:34:51Z

Tests/ML/histopathology/datasets/test_tcga_prad_dataset.py

-
-    image_path = sample[dataset.IMAGE_COLUMN]
-    assert isinstance(image_path, str)
-    assert os.path.isfile(image_path)


To not leave things completely untested, do you think we could have a SlideDataset test? obviously we can't test the length or number of positives ... but we can test the dataset contains the expected keys and and that the content of the dict has the expected type. Looking at the dataset definition, if path is an existing path and we pass a dataset.csv, we can run these tests without need for mounting any real data. What you think?

Now added a test_slides_dataset.csv and some basic tests in test_slides_dataset.py.

dccastro added 2 commits December 14, 2021 14:20

Add basic dataset and environment changes

2507b05

Add loading/preproc utils

861c1e7

dccastro changed the title ~~Enable tiling non-PANDA WSI datasets~~ [WIP] Enable tiling non-PANDA WSI datasets Dec 14, 2021

dccastro added 6 commits December 14, 2021 15:07

Back-up PANDA tiling scripts

b8e7f52

Refactor and generalise tiling scripts

cd7af1b

Remove Azure scripts

163169a

Add test WSI file

ad14227

Add preprocessing tests

7c071b9

Update changelog

d1fbfef

dccastro changed the title ~~[WIP] Enable tiling non-PANDA WSI datasets~~ Enable tiling non-PANDA WSI datasets Dec 14, 2021

dccastro marked this pull request as ready for review December 14, 2021 19:11

dccastro requested review from vale-salvatelli and mebristo December 14, 2021 19:12

dccastro and others added 3 commits December 15, 2021 13:16

Add Linux condition for cuCIM in environment.yml

9ffda4a

Merge remote-tracking branch 'origin/main' into dacoelh/tiling

6df3fc6

Merge branch 'main' into dacoelh/tiling

7f4cbd9

dccastro commented Dec 16, 2021

View reviewed changes

vale-salvatelli previously approved these changes Dec 16, 2021

View reviewed changes

dccastro added 2 commits December 16, 2021 11:31

Use PANDA instead of TCGA-PRAD in test

fa5ec34

Leave TcgaPradDataset as an example

f599e61

dccastro dismissed vale-salvatelli’s stale review via f599e61 December 16, 2021 12:13

vale-salvatelli reviewed Dec 16, 2021

View reviewed changes

dccastro added 2 commits December 16, 2021 14:52

Fix skipped InnerEye dataset tests

abe7b4a

Create and test mock slides dataset

a551a9b

vale-salvatelli previously approved these changes Dec 16, 2021

View reviewed changes

maxilse previously approved these changes Dec 16, 2021

View reviewed changes

Remove Tests/ML/datasets from pytest discovery

a56a599

dccastro dismissed stale reviews from maxilse and vale-salvatelli via a56a599 December 16, 2021 15:48

vale-salvatelli approved these changes Dec 16, 2021

View reviewed changes

mebristo approved these changes Dec 16, 2021

View reviewed changes

dccastro merged commit 6a4d334 into main Dec 16, 2021

dccastro deleted the dacoelh/tiling branch December 16, 2021 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable tiling non-PANDA WSI datasets #621

Enable tiling non-PANDA WSI datasets #621

dccastro commented Dec 14, 2021 •

edited

Loading

dccastro Dec 16, 2021

vale-salvatelli Dec 16, 2021

dccastro Dec 16, 2021

vale-salvatelli Dec 16, 2021

dccastro Dec 16, 2021

Enable tiling non-PANDA WSI datasets #621

Enable tiling non-PANDA WSI datasets #621

Conversation

dccastro commented Dec 14, 2021 • edited Loading

dccastro Dec 16, 2021

Choose a reason for hiding this comment

vale-salvatelli Dec 16, 2021

Choose a reason for hiding this comment

dccastro Dec 16, 2021

Choose a reason for hiding this comment

vale-salvatelli Dec 16, 2021

Choose a reason for hiding this comment

dccastro Dec 16, 2021

Choose a reason for hiding this comment

dccastro commented Dec 14, 2021 •

edited

Loading