From 41e1392f167b42c791c69153223661e5d3548772 Mon Sep 17 00:00:00 2001
From: Zain Sohail <zainsohail04@gmail.com>
Date: Wed, 12 Jun 2024 16:21:07 +0200
Subject: [PATCH] add tutorial 8

---
 .github/workflows/documentation.yml |   4 +-
 tutorial/8_jittering_tutorial.ipynb | 384 ++++++++++++++++++++++++++++
 2 files changed, 386 insertions(+), 2 deletions(-)
 create mode 100644 tutorial/8_jittering_tutorial.ipynb

diff --git a/.github/workflows/documentation.yml b/.github/workflows/documentation.yml
index 818cc4cd..bc4c2e2d 100644
--- a/.github/workflows/documentation.yml
+++ b/.github/workflows/documentation.yml
@@ -76,8 +76,8 @@ jobs:
       - name: Move cached executed notebooks
         run: python docs/move_changed_notebooks.py
 
-      # - name: Download datasets
-      #   run: poetry run python docs/download_datasets.py
+      - name: Download datasets
+        run: poetry run python docs/download_datasets.py
 
       # - name: Build Flash parquet files
       #   run: |
diff --git a/tutorial/8_jittering_tutorial.ipynb b/tutorial/8_jittering_tutorial.ipynb
new file mode 100644
index 00000000..ceffd302
--- /dev/null
+++ b/tutorial/8_jittering_tutorial.ipynb
@@ -0,0 +1,384 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "8ad4167a-e4e7-498d-909a-c04da9f177ed",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# Correct use of Jittering\n",
+    "This tutorial discusses the background and correct use of the jittering/dithering method (test) implemented in the package."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fb045e17-fa89-4c11-9d51-7f06e80d96d5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%load_ext autoreload\n",
+    "%autoreload 2\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "import sed\n",
+    "from sed.dataset import dataset\n",
+    "\n",
+    "%matplotlib widget"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "42a6afaa-17dd-4637-ba75-a28c4ead1adf",
+   "metadata": {},
+   "source": [
+    "## Load Data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "34f46d54",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset.get(\"WSe2\") # Put in Path to a storage of at least 20 Gbyte free space.\n",
+    "data_path = dataset.dir # This is the path to the data\n",
+    "scandir, _ = dataset.subdirs # scandir contains the data, _ contains the calibration files"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f1f82054",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# create sed processor using the config file:\n",
+    "sp = sed.SedProcessor(folder=scandir, config=\"../sed/config/mpes_example_config.yaml\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fb8c50c6",
+   "metadata": {},
+   "source": [
+    "After loading, the dataframe contains the four columns `X`, `Y`, `t`, `ADC`, which have all integer values. They originate from a time-to-digital converter, and correspond to digital \"bins\"."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "50d0a3b3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sp.dataframe.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "121cb571",
+   "metadata": {},
+   "source": [
+    "Let's bin these data along the `t` dimension within a small range:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c5675401",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "axes = ['t']\n",
+    "bins = [150]\n",
+    "ranges = [[66000, 67000]]\n",
+    "res01 = sp.compute(bins=bins, axes=axes, ranges=ranges, df_partitions=20)\n",
+    "plt.figure()\n",
+    "res01.plot()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "de1d411c",
+   "metadata": {},
+   "source": [
+    "We notice some oscillation ontop of the data. These are re-binning artefacts, originating from a non-integer number of machine-bins per bin, as we can verify by binning with a different number of steps:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "21d89f2c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "axes = ['t']\n",
+    "bins = [100]\n",
+    "ranges = [[66000, 67000]]\n",
+    "res02 = sp.compute(bins=bins, axes=axes, ranges=ranges, df_partitions=20)\n",
+    "plt.figure()\n",
+    "res01.plot(label=\"6.66/bin\")\n",
+    "res02.plot(label=\"10/bin\")\n",
+    "plt.legend()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "31855799",
+   "metadata": {},
+   "source": [
+    "If we have a very detailed look, with step-sizes smaller than one, we see the digital nature of the original data behind this issue:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e95053bf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "axes = ['t']\n",
+    "bins = [200]\n",
+    "ranges = [[66600, 66605]]\n",
+    "res11 = sp.compute(bins=bins, axes=axes, ranges=ranges, df_partitions=20)\n",
+    "plt.figure()\n",
+    "res11.plot()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c2d82124",
+   "metadata": {},
+   "source": [
+    "To mitigate this problem, we can add some randomness to the data, and re-distribute events into the gaps in-between bins. This is also termed `dithering` and e.g. known from image manipulation. The important factor is to add the right amount and right type of random distribution, to end up at a quasi-continous uniform distribution, but not lose information.\n",
+    "\n",
+    "We can use the add_jitter function for this. We can pass it the colums to add jitter to, and the amplitude of a uniform jitter. Importantly, this step should be taken in the very beginning as first step before any dataframe operations are added.\n",
+    "\n",
+    "Let's try with a value of 0.2 for the amplitude:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f728aa81",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_backup = sp.dataframe\n",
+    "sp.add_jitter(cols=[\"t\"], amps=[0.2])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ae6a2f93",
+   "metadata": {},
+   "source": [
+    "We see that the `t` column is no longer integer-valued:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e60e6c35",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sp.dataframe.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "10e28a62",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "axes = ['t']\n",
+    "bins = [200]\n",
+    "ranges = [[66600, 66605]]\n",
+    "res12 = sp.compute(bins=bins, axes=axes, ranges=ranges, df_partitions=20)\n",
+    "plt.figure()\n",
+    "res11.plot(label=\"not jittered\")\n",
+    "res12.plot(label=\"amplitude 0.2\")\n",
+    "plt.legend()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dad0f31f",
+   "metadata": {},
+   "source": [
+    "This is clearly not enough jitter to close the gaps. The ideal (and default) amplitude is 0.5, which exactly fills the gaps:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "84879f75",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sp.dataframe = df_backup\n",
+    "sp.add_jitter(cols=[\"t\"], amps=[0.5])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2fcb6361",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "axes = ['t']\n",
+    "bins = [200]\n",
+    "ranges = [[66600, 66605]]\n",
+    "res13 = sp.compute(bins=bins, axes=axes, ranges=ranges, df_partitions=20)\n",
+    "plt.figure()\n",
+    "res11.plot(label=\"not jittered\")\n",
+    "res12.plot(label=\"amplitude 0.2\")\n",
+    "res13.plot(label=\"amplitude 0.5\")\n",
+    "plt.legend()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "282994b8",
+   "metadata": {},
+   "source": [
+    "This jittering fills the gaps, and produces a continous uniform distribution. Let's check again the longer-range binning that gave us the oscillations initially:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "801a1e3e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "axes = ['t']\n",
+    "bins = [150]\n",
+    "ranges = [[66000, 67000]]\n",
+    "res03 = sp.compute(bins=bins, axes=axes, ranges=ranges, df_partitions=20)\n",
+    "plt.figure()\n",
+    "res01.plot(label=\"not jittered\")\n",
+    "res03.plot(label=\"jittered\")\n",
+    "plt.legend()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bc64928b",
+   "metadata": {},
+   "source": [
+    "Now, the artefacts are absent, and similarly will they be in any dataframe columns derived from a column jittered in such a way. Note that this only applies to data present in digital (i.e. machine-binned) format, and not to data that are intrinsically continous. \n",
+    "\n",
+    "Also note that too large or not well-aligned jittering amplitudes will\n",
+    "- deteriorate your resolution along the jittered axis\n",
+    "- will not solve the problem entirely:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "26cbbfce",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sp.dataframe = df_backup\n",
+    "sp.add_jitter(cols=[\"t\"], amps=[0.7])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "748509e2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "axes = ['t']\n",
+    "bins = [200]\n",
+    "ranges = [[66600, 66605]]\n",
+    "res14 = sp.compute(bins=bins, axes=axes, ranges=ranges, df_partitions=20)\n",
+    "plt.figure()\n",
+    "res13.plot(label=\"Amplitude 0.5\")\n",
+    "res14.plot(label=\"Amplitude 0.7\")\n",
+    "plt.legend()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "307557a9",
+   "metadata": {},
+   "source": [
+    "If the step-size of digitization is different from 1, the corresponding stepssize (half the distance between digitized values) can be adjusted as shown above.\n",
+    "\n",
+    "Also, alternatively also normally distributed noise can be added, which is less sensitive to the exact right amplitude, but will lead to mixing of neighboring voxels, and thus loss of resolution. Also, normally distributed noise is substantially more computation-intensive to generate. It can nevertheless be helpful in situations where e.g. the stepsize is non-uniform."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c5603f47",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sp.dataframe = df_backup\n",
+    "sp.add_jitter(cols=[\"t\"], amps=[0.7], jitter_type=\"normal\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b64fd59b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "axes = ['t']\n",
+    "bins = [200]\n",
+    "ranges = [[66600, 66605]]\n",
+    "res15 = sp.compute(bins=bins, axes=axes, ranges=ranges, df_partitions=20)\n",
+    "plt.figure()\n",
+    "res14.plot(label=\"Uniform, Amplitude 0.7\")\n",
+    "res15.plot(label=\"Normal, Amplitude 0.7\")\n",
+    "plt.legend()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "48ebf393",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "interpreter": {
+   "hash": "728003ee06929e5fa5ff815d1b96bf487266025e4b7440930c6bf4536d02d243"
+  },
+  "kernelspec": {
+   "display_name": "python3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}