PR feedback

single-cell-data · Sep 27, 2024 · ee453d4 · ee453d4
1 parent a8c16c6
commit ee453d4
Show file tree

Hide file tree

Showing 3 changed files with 7 additions and 7 deletions.
diff --git a/notebooks/tutorial_lightning.ipynb b/notebooks/tutorial_lightning.ipynb
@@ -12,9 +12,9 @@
     "\n",
     "**Prerequesites**\n",
     "\n",
-    "Install `tiledbsoma_ml` and `scikit-learn`, for example:\n",
+    "Install `tiledbsoma_ml`, `pytorch-lightning` and `scikit-learn`, for example:\n",
     "\n",
-    "> pip install tiledbsoma_ml scikit-learn\n"
+    "> pip install tiledbsoma_ml pytorch-lightning scikit-learn\n"
    ]
   },
   {

diff --git a/notebooks/tutorial_multiworker.ipynb b/notebooks/tutorial_multiworker.ipynb
@@ -7,7 +7,7 @@
     "# Multi-process training\n",
     "\n",
     "Multi-process usage of `tiledbsoma_ml.ExperimentAxisQueryIterDataset` includes both:\n",
-    "* using the `torch.utils.data.DataLoader` with 1 or more worker (ie., with an argument of `n_workers=1` or greater)\n",
+    "* using the `torch.utils.data.DataLoader` with 1 or more workers (i.e., with an argument of `n_workers=1` or greater)\n",
     "* using a multi-process training configuration, such as `DistributedDataParallel`\n",
     "\n",
     "In these configurations, `ExperimentAxisQueryIterDataset` will automatically partition data across workers. However, when using `shuffle=True`, there are several things to keep in mind:\n",

diff --git a/notebooks/tutorial_pytorch.ipynb b/notebooks/tutorial_pytorch.ipynb
@@ -76,21 +76,21 @@
    "source": [
     "### `ExperimentAxisQueryIterDataPipe` class explained\n",
     "\n",
-    "This class provides an implementation of PyTorch's `torchdata` [IterDataPipe interface](https://pytorch.org/data/main/torchdata.datapipes.iter.html), which defines a common mechanism for wrapping and accessing training data from any underlying source. The `ExperimentAxisQueryIterDataPipe` class encapsulates the details of querying and retrieving Census data from a single SOMA `Experiment` and returning it to the caller a NumPy `ndarray` and a Pandas `DataFrame`. Most importantly, it retrieves the data lazily from the Census in batches, avoiding having to load the entire training dataset into memory at once.\n",
+    "This class provides an implementation of PyTorch's `torchdata` [IterDataPipe interface](https://pytorch.org/data/main/torchdata.datapipes.iter.html), which defines a common mechanism for wrapping and accessing training data from any underlying source. The `ExperimentAxisQueryIterDataPipe` class encapsulates the details of querying and retrieving data from a single SOMA `Experiment` and returning to the caller a NumPy `ndarray` and a Pandas `DataFrame`. Most importantly, it retrieves the data lazily and in batches, avoiding the need to load the entire training dataset into memory at once.\n",
     "\n",
     "### `ExperimentAxisQueryIterDataPipe` parameters explained\n",
     "\n",
     "The constructor only requires a single parameter, `experiment`, which is a `soma.Experiment` containing the data of the organism to be used for training.\n",
     "\n",
     "To retrieve a subset of the Experiment's data, along either the `obs` or `var` axes, you may specify query filters via the `obs_query` and `var_query` parameters, which are both `soma.AxisQuery` objects.\n",
     "\n",
-    "The values for the prediction label(s) that you intend to use for training are specified via the `obs_column_names` array.\n",
+    "The values for the prediction label(s) that you intend to use for training are specified via the `obs_column_names` (or `var_column_names`) array.\n",
     "\n",
-    "The `batch_size` allows you to specify the number of obs rows (cells) to be returned by each return PyTorch tensor. You may exclude this parameter if you want single rows (`batch_size=1`).\n",
+    "The `batch_size` parameter allows you to specify the number of `obs` rows (i.e., cells) to be returned by each return PyTorch tensor. You may exclude this parameter if you want single rows (`batch_size=1`).\n",
     "\n",
     "The `shuffle` flag allows you to randomize the ordering of the training data for each training epoch. Note:\n",
     "* You should use this flag instead of the `DataLoader` `shuffle` flag, primarily for performance reasons.\n",
-    "* PyTorch's TorchData library provides a [Shuffler](https://pytorch.org/data/main/generated/torchdata.datapipes.iter.Shuffler.html) `DataPipe`, which is alternate mechanism one can use to perform shuffling of an `IterableDataset`. However, the `Shuffler` will not \"globally\" randomize the training data, as it only \"locally\" randomizes the ordering of the training data within fixed-size \"windows\". Due to the layout of Census data, a given \"window\" of Census data may be highly homogeneous in terms of its `obs` axis attribute values, and so this shuffling strategy may not provide sufficient randomization for certain types of models."
+    "* PyTorch's TorchData library provides a [Shuffler](https://pytorch.org/data/main/generated/torchdata.datapipes.iter.Shuffler.html) `DataPipe`, which is an alternate mechanism one can use to perform shuffling of an `IterableDataset`. However, the `Shuffler` will not \"globally\" randomize the training data, as it only \"locally\" randomizes the ordering of the training data within fixed-size \"windows\". This problematic for atlas-style datasets such as Census, where a given \"window\" of Census data may be highly homogeneous in terms of its `obs` axis attribute values, and so this shuffling strategy may not provide sufficient randomization for certain types of models."
    ]
   },
   {