From 907c8468943ad3a054e6bfd86e643a547c222400 Mon Sep 17 00:00:00 2001
From: Caleb Robinson <calebrob6@gmail.com>
Date: Wed, 21 Feb 2024 23:43:23 +0000
Subject: [PATCH] Added custom segmentation trainer tutorial

---
 .../custom_segmentation_trainer.ipynb         | 433 ++++++++++++++++++
 1 file changed, 433 insertions(+)
 create mode 100644 docs/tutorials/custom_segmentation_trainer.ipynb

diff --git a/docs/tutorials/custom_segmentation_trainer.ipynb b/docs/tutorials/custom_segmentation_trainer.ipynb
new file mode 100644
index 00000000000..2ba123c6caa
--- /dev/null
+++ b/docs/tutorials/custom_segmentation_trainer.ipynb
@@ -0,0 +1,433 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Copyright (c) Microsoft Corporation. All rights reserved.\n",
+    "\n",
+    "Licensed under the MIT License."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Custom Trainers\n",
+    "\n",
+    "In this tutorial, we demonstrate how to extend a TorchGeo [\"trainer class\"](https://torchgeo.readthedocs.io/en/latest/api/trainers.html). In TorchGeo there exist several trainer classes that are pre-made PyTorch Lightning Modules designed to allow for the easy training of models on semantic segmentation, classification, change detection, etc. tasks using TorchGeo's [prebuild DataModules](https://torchgeo.readthedocs.io/en/latest/api/datamodules.html). While the trainers aim to provide sensible defaults and customization options for common tasks, they will not be able to cover all situations (e.g. researchers will likely want to implement and use their own architectures, loss functions, optimizers, etc. in the training routine). If you run into such a situation, then you can simply extend the trainer class you are interested in, and write custom logic to override the default functionality.\n",
+    "\n",
+    "This tutorial shows how to do exactly this to customize a learning rate schedule, logging, and model checkpointing for a semantic segmentation task using the [LandCoverAI](https://landcover.ai.linuxpolska.com/) dataset.\n",
+    "\n",
+    "It's recommended to run this notebook on Google Colab if you don't have your own GPU. Click the \"Open in Colab\" button above to get started."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup\n",
+    "\n",
+    "As always, we install TorchGeo."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install torchgeo"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Imports\n",
+    "\n",
+    "Next, we import TorchGeo and any other libraries we need."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from torchgeo.trainers import SemanticSegmentationTask\n",
+    "from torchgeo.datamodules import LandCoverAIDataModule\n",
+    "from torchmetrics import MetricCollection\n",
+    "from torchmetrics.classification import Accuracy, FBetaScore, Precision, Recall, JaccardIndex\n",
+    "\n",
+    "import lightning.pytorch as pl\n",
+    "from lightning.pytorch.callbacks import ModelCheckpoint\n",
+    "import torch\n",
+    "\n",
+    "from torch.optim.lr_scheduler import CosineAnnealingLR\n",
+    "from torch.optim import AdamW\n",
+    "\n",
+    "# Get rid of the pesky raised by kornia\n",
+    "# UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details.\n",
+    "import warnings\n",
+    "warnings.filterwarnings(\"ignore\", category=UserWarning, module=\"torch.nn.functional\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Custom SemanticSegmentationTask\n",
+    "\n",
+    "Now, we create a `CustomSemanticSegmentationTask` class that inhierits from `SemanticSegmentationTask` and that overrides a few methods:\n",
+    "- `__init__`: We add two new parameters `tmax` and `eta_min` to control the learning rate scheduler\n",
+    "- `configure_optimizers`: We use the `CosineAnnealingLR` learning rate scheduler instead of the default `ReduceLROnPlateau`\n",
+    "- `configure_metrics`: We add a \"MeanIou\" metric (what we will use to evaluate the model's performance) and a variety of other classification metrics\n",
+    "- `configure_callbacks`: We demonstrate how to stack `ModelCheckpoint` callbacks to save the best checkpoint as well as periodic checkpoints\n",
+    "- `on_train_epoch_start`: We log the learning rate at the start of each epoch so we can easily see how it decays over a training run\n",
+    "\n",
+    "Overall these demonstrate how to customize the training routine to investigate specific research questions (e.g. of the scheduler on test performance)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class CustomSemanticSegmentationTask(SemanticSegmentationTask):\n",
+    "    def __init__(self, *args, tmax=50, eta_min=1e-6, **kwargs) -> None:\n",
+    "        super().__init__()\n",
+    "\n",
+    "    def configure_optimizers(self) -> \"lightning.pytorch.utilities.types.OptimizerLRSchedulerConfig\":\n",
+    "        \"\"\"Initialize the optimizer and learning rate scheduler.\n",
+    "\n",
+    "        Returns:\n",
+    "            Optimizer and learning rate scheduler.\n",
+    "        \"\"\"\n",
+    "        tmax: int = self.hparams.get(\"tmax\", 50)\n",
+    "        eta_min: float = self.hparams.get(\"eta_min\", 1e-6)\n",
+    "        optimizer = AdamW(self.parameters(), lr=self.hparams[\"lr\"])\n",
+    "        scheduler = CosineAnnealingLR(optimizer, T_max=tmax, eta_min=eta_min)\n",
+    "        return {\n",
+    "            \"optimizer\": optimizer,\n",
+    "            \"lr_scheduler\": {\"scheduler\": scheduler, \"monitor\": self.monitor},\n",
+    "        }\n",
+    "\n",
+    "    def configure_metrics(self) -> None:\n",
+    "        \"\"\"Initialize the performance metrics.\"\"\"\n",
+    "        num_classes: int = self.hparams[\"num_classes\"]\n",
+    "\n",
+    "        self.train_metrics = MetricCollection(\n",
+    "            {\n",
+    "                \"OverallAccuracy\": Accuracy(\n",
+    "                    task=\"multiclass\",\n",
+    "                    num_classes=num_classes,\n",
+    "                    average=\"micro\",\n",
+    "                ),\n",
+    "                \"OverallPrecision\": Precision(\n",
+    "                    task=\"multiclass\",\n",
+    "                    num_classes=num_classes,\n",
+    "                    average=\"micro\",\n",
+    "                ),\n",
+    "                \"OverallRecall\": Recall(\n",
+    "                    task=\"multiclass\",\n",
+    "                    num_classes=num_classes,\n",
+    "                    average=\"micro\",\n",
+    "                ),\n",
+    "                \"OverallF1Score\": FBetaScore(\n",
+    "                    task=\"multiclass\",\n",
+    "                    num_classes=num_classes,\n",
+    "                    beta=1.0,\n",
+    "                    average=\"micro\",\n",
+    "                ),\n",
+    "                \"MeanIoU\": JaccardIndex(\n",
+    "                    num_classes=num_classes,\n",
+    "                    task=\"multiclass\",\n",
+    "                    average=\"macro\",\n",
+    "                )\n",
+    "            },\n",
+    "            prefix=\"train_\",\n",
+    "        )\n",
+    "        self.val_metrics = self.train_metrics.clone(prefix=\"val_\")\n",
+    "        self.test_metrics = self.train_metrics.clone(prefix=\"test_\")\n",
+    "\n",
+    "    def configure_callbacks(self):\n",
+    "        \"\"\"Initialize callbacks for saving the best and latest models.\n",
+    "\n",
+    "        Returns:\n",
+    "            List of callbacks to apply.\n",
+    "        \"\"\"\n",
+    "        return [\n",
+    "            ModelCheckpoint(every_n_epochs=50, save_top_k=-1),\n",
+    "            ModelCheckpoint(monitor=self.monitor, mode=self.mode, save_top_k=5),\n",
+    "        ]\n",
+    "\n",
+    "    def on_train_epoch_start(self) -> None:\n",
+    "        \"\"\"Log the learning rate at the start of each training epoch.\"\"\"\n",
+    "        lr = self.optimizers().param_groups[0]['lr']\n",
+    "        self.logger.experiment.add_scalar(\"lr\", lr, self.current_epoch)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Train model\n",
+    "\n",
+    "The remainder of the turial is straightforward and follows the typical [PyTorch Lightning](https://lightning.ai/) training routine. We instantiate a `DataModule` for the LandCover.AI dataset, instantiate a `CustomSemanticSegmentationTask` with a U-Net and ResNet-50 backbone, then train the model using a Lightning trainer."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dm = LandCoverAIDataModule(\n",
+    "    root=\"/home/calebrobinson/ssdshared/torchgeo-datasets/LandCoverAI\",\n",
+    "    batch_size=64,\n",
+    "    num_workers=8\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "task = CustomSemanticSegmentationTask(\n",
+    "    model=\"unet\",\n",
+    "    backbone=\"resnet50\",\n",
+    "    weights=True,\n",
+    "    in_channels=3,\n",
+    "    num_classes=6,\n",
+    "    loss=\"ce\",\n",
+    "    lr=1e-3,\n",
+    "    tmax=50,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "GPU available: True (cuda), used: True\n",
+      "TPU available: False, using: 0 TPU cores\n",
+      "IPU available: False, using: 0 IPUs\n",
+      "HPU available: False, using: 0 HPUs\n"
+     ]
+    }
+   ],
+   "source": [
+    "accelerator = \"gpu\" if torch.cuda.is_available() else \"cpu\"\n",
+    "\n",
+    "trainer = pl.Trainer(\n",
+    "    accelerator=accelerator,\n",
+    "    min_epochs=150,\n",
+    "    max_epochs=300,\n",
+    "    log_every_n_steps=50,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "The following callbacks returned in `LightningModule.configure_callbacks` will override existing callbacks passed to Trainer: ModelCheckpoint\n",
+      "You are using a CUDA device ('NVIDIA A100 80GB PCIe') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision\n",
+      "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]\n",
+      "\n",
+      "  | Name          | Type             | Params\n",
+      "---------------------------------------------------\n",
+      "0 | criterion     | CrossEntropyLoss | 0     \n",
+      "1 | train_metrics | MetricCollection | 0     \n",
+      "2 | val_metrics   | MetricCollection | 0     \n",
+      "3 | test_metrics  | MetricCollection | 0     \n",
+      "4 | model         | Unet             | 32.5 M\n",
+      "---------------------------------------------------\n",
+      "32.5 M    Trainable params\n",
+      "0         Non-trainable params\n",
+      "32.5 M    Total params\n",
+      "130.087   Total estimated model params size (MB)\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "c2e2b60406f3494fb9f07b209df8efd7",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Sanity Checking: |          | 0/? [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "8d87c64a736345658f82e9a7df2c2e68",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Training: |          | 0/? [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "8129de8d64014ad8ad346dfe01fb6c7b",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Validation: |          | 0/? [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "45378a9aa005434692fd51bbd27be29e",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Validation: |          | 0/? [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "e66b6a0aa24549e2a9d0fdc4205e9f52",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Validation: |          | 0/? [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "86493125104a46f1848abe9bd7c90c81",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Validation: |          | 0/? [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "06afc2963f6c43d293b957a58c444423",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Validation: |          | 0/? [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "1357a1bef38444d6b7086e9455c9af9b",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Validation: |          | 0/? [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "trainer.fit(task, dm)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Test model\n",
+    "\n",
+    "Finally, we test the model on the test set and visualize the results."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# If you are starting from a checkpoint, run this cell\n",
+    "task = CustomSemanticSegmentationTask.load_from_checkpoint(\"path/to/checkpoint.ckpt\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "trainer.test(task, dm)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "geo",
+   "language": "python",
+   "name": "geo"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}