-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CLI function for SMOTETomek #463
base: master
Are you sure you want to change the base?
Conversation
I don't think we can support passing a dictionary well over the CLI. However, I would think that the float input is needed. Could add an additional CLI parameter for that and decide that either the float or literal overrides the other when given (similar as base raster overrides manual raster settings in some other CLI functions). @msmiyels , could you help with reviewing this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote some actionable suggestions, but either I am missing something obvious or I don't fully understand how this method is applied to raster data. @msmiyels , in case you have time, could you comment about the output data size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter length check might need to be changed to this form, at least it was used in another file accepting DataFrames too:
x_size = X.index.size if isinstance(X, pd.DataFrame) else X.shape[0]
if x_size != y.size:
raise NonMatchingParameterLengthsException(f"X and y must have the length {x_size} != {y.size}.")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short notice @nmaarnio: I'll have a look into this by Monday or even Friday, depends on diff. things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nmaarnio , I tried to get into this, however, not sure if I do understand the complete context.
In my view, it is sufficient to only use X.shape[0]
for both dataframes and numpy arrays. Behind the scenes, pandas 🐼 uses numpy.
If you have a numpy raster stack which originates from a list of 10 rasters, you will have the number of rasters on index[1]
and the number of rows (pixels) on index[0]
.
The underlying question here is: Where does the pd.DataFrame
comes from ❓ Maybe the following example (illustrating the complete workflow idea) helps:
Example:
# Imports
import rasterio
import numpy as np
import pandas as pd
from imblearn.combine import SMOTETomek
# Load layers
input_layers = []
for file in file_list:
# Load
layer_raster = rasterio.open(file)
layer_array = layer_raster.read(1)
layer_array = np.where(layer_array == layer_raster.nodata, np.nan, layer_array)
# Make it ML compatible (x*y rows, one column)
layer_array = layer_array.reshape(-1, 1)
# Stacking
layer_stack = np.column_stack(input_layers)
# layer_stack.shape[0]: number of pixels
# layer_stack.shape[1]: number of layers
# Labels
labels_raster = rasterio.open(file)
labels_array = labels.read(1)
labels_array = np.where(labels_array == labels_raster.nodata, np.nan, labels_array)
# ML-compatibility (all rows in one column)
labels_array = labels_array.reshape(-1, 1)
# Append labels (could be done in one run with the loop above, but better for readability)
raster_stack = np.column_stack([layer_stack, labels_array])
# Remove any rows containing NaN values
raster_stack[
~np.isnan(raster_stack).any(axis=1)
]
You can easily convert this one to a pd.DataFrame
with pd.DataFrame(raster_stack)
. The first index (0) will be the number of rows and the second (1) the number of layers + one label column.
To further propagate this stuff into the sampling
, split it again into X
and y
# Split
X, y = raster_stack[:, :-1], raster_stack[:, -1]
# Example for sampling
X_smote_tomek, y_smote_tomek = SMOTETomek(sampling_strategy=0.5).fit_resample(X, y)
To my knowledge, all of the sampling packages and helpers expect X
and y
to be in shape rows, columns
, with rows = spatial_x * spatial_y
and columns = number of evidence layers
(+ attached labels, if so)
The goal before putting any data into the sampler is to have the right data (cleaned) in the right format (shape). I'm not aware of how the data got prepared before, but if you used .reshape(1, -1)
, the stacking methods might be changed to np.vstack
and additionally transposed
(otherwise rows and columns would be in the wrong order and the indizies are upside down as well).
Another, general command
I would not recommend trying to save any of the sampled data as rasters. Any sampling will modify the data and increase/decrease the amount of data used for training by creating some kind of virtual samples (or by dropping), but without a spatial relation.
Thus:
- the number of rows will change, and the
spatial_x
andspatial_y
will not be the same (other dimensions) - you cannot generally overly the sampled outputs with the original data
Instead, propagate the resulting arrays under the hood and directly feed it [any supervised method] with these - no intermediate saving, since the spatial contstraints cannot be kept. If you want to save those, save it as .csv
or .feather
(binary and way, way faster and smaller) and then load the file again as input.
Did that help to clarify some points?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @msmiyels and thanks for the comments! I'll get back to this later more in depth, but your last point was what we were looking for most. We were not sure if there is some accepted logic to "squeeze in " the extra synthetic data produced by this tool, but now I understood that there is not. I gather that we should not offer this as a separate tool for the user, but rather parameterize the supervised model training tools that could run this under the hood (?)
"""Resample feature data using SMOTETomek. | ||
|
||
Parameter sampling_strategy_float will override sampling_strategy_literal if given. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small nitpick, if multiline docstring, the """ should be on a blank line
"""Resample feature data using SMOTETomek. | |
Parameter sampling_strategy_float will override sampling_strategy_literal if given. | |
""" | |
""" | |
Resample feature data using SMOTETomek. | |
Parameter sampling_strategy_float will override sampling_strategy_literal if given. | |
""" |
sampling_strategy_literal: Annotated[SMOTETomekSamplingStrategy, typer.Option()] = SMOTETomekSamplingStrategy.auto, | ||
sampling_strategy_float: Optional[float] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to come up with more descriptive names for these, I think. For users it will be confusing what is a "sampling strategy float" and they might not know what "literal" means in this context.
def balance_data_cli( | ||
input_rasters: INPUT_FILES_ARGUMENT, | ||
input_labels: INPUT_FILE_OPTION, | ||
output_raster: OUTPUT_FILE_OPTION, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can return the stacked rasters as one. We have to return individual rasters that correspond to the input rasters, like in the unify_rasters
CLI function. However, as I was digging into this, I learned that the balancing process changes the data size and spatial integrity of the data (idk if this is a concern). I think we need to think about this more and consult someone who knows about this method in this context.
SMOTETomek expects labels to be discrete.
Also edited SMOTETomek function slightly: added link to imblearn documentation in the docstring and changed parameter type of 'sampling_strategy' from
Union[float, str, dict]
toUnion[float, Literal, dict]
to make sure that beartype will roar if the given argument is a str and not one of the correct ones.About the cli function: Should we enable user providing sampling_strategy in the cli function as any of the three dtypes? I'm not sure if it is possible to pass a dict in cli... @nmaarnio