Skip to content

Commit

Permalink
PR feedback
Browse files Browse the repository at this point in the history
  • Loading branch information
bkmartinjr committed Sep 27, 2024
1 parent a8c16c6 commit ee453d4
Show file tree
Hide file tree
Showing 3 changed files with 7 additions and 7 deletions.
4 changes: 2 additions & 2 deletions notebooks/tutorial_lightning.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@
"\n",
"**Prerequesites**\n",
"\n",
"Install `tiledbsoma_ml` and `scikit-learn`, for example:\n",
"Install `tiledbsoma_ml`, `pytorch-lightning` and `scikit-learn`, for example:\n",
"\n",
"> pip install tiledbsoma_ml scikit-learn\n"
"> pip install tiledbsoma_ml pytorch-lightning scikit-learn\n"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion notebooks/tutorial_multiworker.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
"# Multi-process training\n",
"\n",
"Multi-process usage of `tiledbsoma_ml.ExperimentAxisQueryIterDataset` includes both:\n",
"* using the `torch.utils.data.DataLoader` with 1 or more worker (ie., with an argument of `n_workers=1` or greater)\n",
"* using the `torch.utils.data.DataLoader` with 1 or more workers (i.e., with an argument of `n_workers=1` or greater)\n",
"* using a multi-process training configuration, such as `DistributedDataParallel`\n",
"\n",
"In these configurations, `ExperimentAxisQueryIterDataset` will automatically partition data across workers. However, when using `shuffle=True`, there are several things to keep in mind:\n",
Expand Down
8 changes: 4 additions & 4 deletions notebooks/tutorial_pytorch.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -76,21 +76,21 @@
"source": [
"### `ExperimentAxisQueryIterDataPipe` class explained\n",
"\n",
"This class provides an implementation of PyTorch's `torchdata` [IterDataPipe interface](https://pytorch.org/data/main/torchdata.datapipes.iter.html), which defines a common mechanism for wrapping and accessing training data from any underlying source. The `ExperimentAxisQueryIterDataPipe` class encapsulates the details of querying and retrieving Census data from a single SOMA `Experiment` and returning it to the caller a NumPy `ndarray` and a Pandas `DataFrame`. Most importantly, it retrieves the data lazily from the Census in batches, avoiding having to load the entire training dataset into memory at once.\n",
"This class provides an implementation of PyTorch's `torchdata` [IterDataPipe interface](https://pytorch.org/data/main/torchdata.datapipes.iter.html), which defines a common mechanism for wrapping and accessing training data from any underlying source. The `ExperimentAxisQueryIterDataPipe` class encapsulates the details of querying and retrieving data from a single SOMA `Experiment` and returning to the caller a NumPy `ndarray` and a Pandas `DataFrame`. Most importantly, it retrieves the data lazily and in batches, avoiding the need to load the entire training dataset into memory at once.\n",
"\n",
"### `ExperimentAxisQueryIterDataPipe` parameters explained\n",
"\n",
"The constructor only requires a single parameter, `experiment`, which is a `soma.Experiment` containing the data of the organism to be used for training.\n",
"\n",
"To retrieve a subset of the Experiment's data, along either the `obs` or `var` axes, you may specify query filters via the `obs_query` and `var_query` parameters, which are both `soma.AxisQuery` objects.\n",
"\n",
"The values for the prediction label(s) that you intend to use for training are specified via the `obs_column_names` array.\n",
"The values for the prediction label(s) that you intend to use for training are specified via the `obs_column_names` (or `var_column_names`) array.\n",
"\n",
"The `batch_size` allows you to specify the number of obs rows (cells) to be returned by each return PyTorch tensor. You may exclude this parameter if you want single rows (`batch_size=1`).\n",
"The `batch_size` parameter allows you to specify the number of `obs` rows (i.e., cells) to be returned by each return PyTorch tensor. You may exclude this parameter if you want single rows (`batch_size=1`).\n",
"\n",
"The `shuffle` flag allows you to randomize the ordering of the training data for each training epoch. Note:\n",
"* You should use this flag instead of the `DataLoader` `shuffle` flag, primarily for performance reasons.\n",
"* PyTorch's TorchData library provides a [Shuffler](https://pytorch.org/data/main/generated/torchdata.datapipes.iter.Shuffler.html) `DataPipe`, which is alternate mechanism one can use to perform shuffling of an `IterableDataset`. However, the `Shuffler` will not \"globally\" randomize the training data, as it only \"locally\" randomizes the ordering of the training data within fixed-size \"windows\". Due to the layout of Census data, a given \"window\" of Census data may be highly homogeneous in terms of its `obs` axis attribute values, and so this shuffling strategy may not provide sufficient randomization for certain types of models."
"* PyTorch's TorchData library provides a [Shuffler](https://pytorch.org/data/main/generated/torchdata.datapipes.iter.Shuffler.html) `DataPipe`, which is an alternate mechanism one can use to perform shuffling of an `IterableDataset`. However, the `Shuffler` will not \"globally\" randomize the training data, as it only \"locally\" randomizes the ordering of the training data within fixed-size \"windows\". This problematic for atlas-style datasets such as Census, where a given \"window\" of Census data may be highly homogeneous in terms of its `obs` axis attribute values, and so this shuffling strategy may not provide sufficient randomization for certain types of models."
]
},
{
Expand Down

0 comments on commit ee453d4

Please sign in to comment.