Add CLI function for SMOTETomek #463

msorvoja · 2024-11-25T07:15:39Z

SMOTETomek expects labels to be discrete.

Also edited SMOTETomek function slightly: added link to imblearn documentation in the docstring and changed parameter type of 'sampling_strategy' from Union[float, str, dict] to Union[float, Literal, dict] to make sure that beartype will roar if the given argument is a str and not one of the correct ones.

About the cli function: Should we enable user providing sampling_strategy in the cli function as any of the three dtypes? I'm not sure if it is possible to pass a dict in cli... @nmaarnio

nmaarnio · 2024-11-26T09:47:47Z

I don't think we can support passing a dictionary well over the CLI. However, I would think that the float input is needed. Could add an additional CLI parameter for that and decide that either the float or literal overrides the other when given (similar as base raster overrides manual raster settings in some other CLI functions).

@msmiyels , could you help with reviewing this?

nmaarnio

I wrote some actionable suggestions, but either I am missing something obvious or I don't fully understand how this method is applied to raster data. @msmiyels , in case you have time, could you comment about the output data size?

nmaarnio · 2024-12-03T09:57:44Z

eis_toolkit/training_data_tools/class_balancing.py

The parameter length check might need to be changed to this form, at least it was used in another file accepting DataFrames too:

x_size = X.index.size if isinstance(X, pd.DataFrame) else X.shape[0]
if x_size != y.size:
raise NonMatchingParameterLengthsException(f"X and y must have the length {x_size} != {y.size}.")

Short notice @nmaarnio: I'll have a look into this by Monday or even Friday, depends on diff. things.

Hi @nmaarnio , I tried to get into this, however, not sure if I do understand the complete context.

In my view, it is sufficient to only use X.shape[0] for both dataframes and numpy arrays. Behind the scenes, pandas 🐼 uses numpy.

If you have a numpy raster stack which originates from a list of 10 rasters, you will have the number of rasters on index[1] and the number of rows (pixels) on index[0].

The underlying question here is: Where does the pd.DataFrame comes from ❓ Maybe the following example (illustrating the complete workflow idea) helps:

Example:

# Imports import rasterio import numpy as np import pandas as pd from imblearn.combine import SMOTETomek # Load layers input_layers = [] for file in file_list: # Load layer_raster = rasterio.open(file) layer_array = layer_raster.read(1) layer_array = np.where(layer_array == layer_raster.nodata, np.nan, layer_array) # Make it ML compatible (x*y rows, one column) layer_array = layer_array.reshape(-1, 1) # Stacking layer_stack = np.column_stack(input_layers) # layer_stack.shape[0]: number of pixels # layer_stack.shape[1]: number of layers # Labels labels_raster = rasterio.open(file) labels_array = labels.read(1) labels_array = np.where(labels_array == labels_raster.nodata, np.nan, labels_array) # ML-compatibility (all rows in one column) labels_array = labels_array.reshape(-1, 1) # Append labels (could be done in one run with the loop above, but better for readability) raster_stack = np.column_stack([layer_stack, labels_array]) # Remove any rows containing NaN values raster_stack[ ~np.isnan(raster_stack).any(axis=1) ]

You can easily convert this one to a pd.DataFrame with pd.DataFrame(raster_stack). The first index (0) will be the number of rows and the second (1) the number of layers + one label column.

To further propagate this stuff into the sampling, split it again into X and y

# Split X, y = raster_stack[:, :-1], raster_stack[:, -1] # Example for sampling X_smote_tomek, y_smote_tomek = SMOTETomek(sampling_strategy=0.5).fit_resample(X, y)

To my knowledge, all of the sampling packages and helpers expect X and y to be in shape rows, columns, with rows = spatial_x * spatial_y and columns = number of evidence layers (+ attached labels, if so)

The goal before putting any data into the sampler is to have the right data (cleaned) in the right format (shape). I'm not aware of how the data got prepared before, but if you used .reshape(1, -1), the stacking methods might be changed to np.vstack and additionally transposed (otherwise rows and columns would be in the wrong order and the indizies are upside down as well).

Another, general command
I would not recommend trying to save any of the sampled data as rasters. Any sampling will modify the data and increase/decrease the amount of data used for training by creating some kind of virtual samples (or by dropping), but without a spatial relation.

Thus:

the number of rows will change, and the spatial_x and spatial_y will not be the same (other dimensions)

you cannot generally overly the sampled outputs with the original data

Instead, propagate the resulting arrays under the hood and directly feed it [any supervised method] with these - no intermediate saving, since the spatial contstraints cannot be kept. If you want to save those, save it as .csv or .feather (binary and way, way faster and smaller) and then load the file again as input.

Did that help to clarify some points?

Hi @msmiyels and thanks for the comments! I'll get back to this later more in depth, but your last point was what we were looking for most. We were not sure if there is some accepted logic to "squeeze in " the extra synthetic data produced by this tool, but now I understood that there is not. I gather that we should not offer this as a separate tool for the user, but rather parameterize the supervised model training tools that could run this under the hood (?)

eis_toolkit/training_data_tools/class_balancing.py

nmaarnio · 2024-12-03T10:11:00Z

eis_toolkit/cli.py

+    """Resample feature data using SMOTETomek.
+
+    Parameter sampling_strategy_float will override sampling_strategy_literal if given.
+    """


A small nitpick, if multiline docstring, the """ should be on a blank line

Suggested change

"""Resample feature data using SMOTETomek.

Parameter sampling_strategy_float will override sampling_strategy_literal if given.

"""

"""

Resample feature data using SMOTETomek.

Parameter sampling_strategy_float will override sampling_strategy_literal if given.

"""

eis_toolkit/training_data_tools/class_balancing.py

nmaarnio · 2024-12-03T10:25:47Z

eis_toolkit/cli.py

+    sampling_strategy_literal: Annotated[SMOTETomekSamplingStrategy, typer.Option()] = SMOTETomekSamplingStrategy.auto,
+    sampling_strategy_float: Optional[float] = None,


We need to come up with more descriptive names for these, I think. For users it will be confusing what is a "sampling strategy float" and they might not know what "literal" means in this context.

nmaarnio · 2024-12-03T10:58:12Z

eis_toolkit/cli.py

+def balance_data_cli(
+    input_rasters: INPUT_FILES_ARGUMENT,
+    input_labels: INPUT_FILE_OPTION,
+    output_raster: OUTPUT_FILE_OPTION,


I don't think we can return the stacked rasters as one. We have to return individual rasters that correspond to the input rasters, like in the unify_rasters CLI function. However, as I was digging into this, I learned that the balancing process changes the data size and spatial integrity of the data (idk if this is a concern). I think we need to think about this more and consult someone who knows about this method in this context.

fix(SMOTETomek): improve documentation and parameter typing

bb2ff19

msorvoja requested a review from nmaarnio November 25, 2024 07:15

msorvoja linked an issue Nov 25, 2024 that may be closed by this pull request

Improve class_balancing documentation #462

Open

msorvoja removed the request for review from nmaarnio November 25, 2024 08:22

feat(SMOTETomek): add cli function

e751e94

msorvoja requested a review from nmaarnio November 26, 2024 05:18

msorvoja changed the title ~~fix(SMOTETomek): improve documentation and parameter typing~~ Add CLI function for SMOTETomek Nov 26, 2024

fix(balance_data_cli): add sampling strategy class

e6dc779

msorvoja mentioned this pull request Nov 26, 2024

Add processing algorithm for balance data (SMOTETomek) GispoCoding/eis_qgis_plugin#284

Draft

Override sampling strategy literal parameter with float if given

b8c4993

nmaarnio requested changes Dec 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CLI function for SMOTETomek #463

Add CLI function for SMOTETomek #463

msorvoja commented Nov 25, 2024 •

edited

Loading

nmaarnio commented Nov 26, 2024

nmaarnio left a comment

nmaarnio Dec 3, 2024

msmiyels Dec 4, 2024

msmiyels Dec 10, 2024 •

edited

Loading

nmaarnio Dec 10, 2024

nmaarnio Dec 3, 2024

nmaarnio Dec 3, 2024

nmaarnio Dec 3, 2024

		sampling_strategy_literal: Annotated[SMOTETomekSamplingStrategy, typer.Option()] = SMOTETomekSamplingStrategy.auto,
		sampling_strategy_float: Optional[float] = None,

Add CLI function for SMOTETomek #463

Are you sure you want to change the base?

Add CLI function for SMOTETomek #463

Conversation

msorvoja commented Nov 25, 2024 • edited Loading

nmaarnio commented Nov 26, 2024

nmaarnio left a comment

Choose a reason for hiding this comment

nmaarnio Dec 3, 2024

Choose a reason for hiding this comment

msmiyels Dec 4, 2024

Choose a reason for hiding this comment

msmiyels Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

nmaarnio Dec 10, 2024

Choose a reason for hiding this comment

nmaarnio Dec 3, 2024

Choose a reason for hiding this comment

nmaarnio Dec 3, 2024

Choose a reason for hiding this comment

nmaarnio Dec 3, 2024

Choose a reason for hiding this comment

msorvoja commented Nov 25, 2024 •

edited

Loading

msmiyels Dec 10, 2024 •

edited

Loading