Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CLI function for SMOTETomek #463

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

msorvoja
Copy link
Collaborator

@msorvoja msorvoja commented Nov 25, 2024

SMOTETomek expects labels to be discrete.

Also edited SMOTETomek function slightly: added link to imblearn documentation in the docstring and changed parameter type of 'sampling_strategy' from Union[float, str, dict] to Union[float, Literal, dict] to make sure that beartype will roar if the given argument is a str and not one of the correct ones.

About the cli function: Should we enable user providing sampling_strategy in the cli function as any of the three dtypes? I'm not sure if it is possible to pass a dict in cli... @nmaarnio

@msorvoja msorvoja requested a review from nmaarnio November 25, 2024 07:15
@msorvoja msorvoja linked an issue Nov 25, 2024 that may be closed by this pull request
@msorvoja msorvoja removed the request for review from nmaarnio November 25, 2024 08:22
@msorvoja msorvoja requested a review from nmaarnio November 26, 2024 05:18
@msorvoja msorvoja changed the title fix(SMOTETomek): improve documentation and parameter typing Add CLI function for SMOTETomek Nov 26, 2024
@nmaarnio
Copy link
Collaborator

I don't think we can support passing a dictionary well over the CLI. However, I would think that the float input is needed. Could add an additional CLI parameter for that and decide that either the float or literal overrides the other when given (similar as base raster overrides manual raster settings in some other CLI functions).

@msmiyels , could you help with reviewing this?

Copy link
Collaborator

@nmaarnio nmaarnio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote some actionable suggestions, but either I am missing something obvious or I don't fully understand how this method is applied to raster data. @msmiyels , in case you have time, could you comment about the output data size?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter length check might need to be changed to this form, at least it was used in another file accepting DataFrames too:

x_size = X.index.size if isinstance(X, pd.DataFrame) else X.shape[0]
if x_size != y.size:
raise NonMatchingParameterLengthsException(f"X and y must have the length {x_size} != {y.size}.")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short notice @nmaarnio: I'll have a look into this by Monday or even Friday, depends on diff. things.

Copy link
Collaborator

@msmiyels msmiyels Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nmaarnio , I tried to get into this, however, not sure if I do understand the complete context.

In my view, it is sufficient to only use X.shape[0] for both dataframes and numpy arrays. Behind the scenes, pandas 🐼 uses numpy.

If you have a numpy raster stack which originates from a list of 10 rasters, you will have the number of rasters on index[1] and the number of rows (pixels) on index[0].

The underlying question here is: Where does the pd.DataFrame comes from ❓ Maybe the following example (illustrating the complete workflow idea) helps:

Example:

# Imports
import rasterio
import numpy as np
import pandas as pd

from imblearn.combine import SMOTETomek

# Load layers
input_layers = []
for file in file_list:
    # Load
    layer_raster = rasterio.open(file)
    layer_array = layer_raster.read(1)
    layer_array = np.where(layer_array == layer_raster.nodata, np.nan, layer_array)

    # Make it ML compatible (x*y rows, one column)
    layer_array = layer_array.reshape(-1, 1)

# Stacking
layer_stack = np.column_stack(input_layers)       

# layer_stack.shape[0]: number of pixels
# layer_stack.shape[1]: number of layers

# Labels
labels_raster = rasterio.open(file)
labels_array = labels.read(1)
labels_array = np.where(labels_array == labels_raster.nodata, np.nan, labels_array)

# ML-compatibility (all rows in one column)
labels_array = labels_array.reshape(-1, 1)

# Append labels (could be done in one run with the loop above, but better for readability)
raster_stack = np.column_stack([layer_stack, labels_array])

# Remove any rows containing NaN values
raster_stack[
        ~np.isnan(raster_stack).any(axis=1)
    ]

You can easily convert this one to a pd.DataFrame with pd.DataFrame(raster_stack). The first index (0) will be the number of rows and the second (1) the number of layers + one label column.

To further propagate this stuff into the sampling, split it again into X and y

# Split
X, y = raster_stack[:, :-1], raster_stack[:, -1]

# Example for sampling
X_smote_tomek, y_smote_tomek = SMOTETomek(sampling_strategy=0.5).fit_resample(X, y)

To my knowledge, all of the sampling packages and helpers expect X and y to be in shape rows, columns, with rows = spatial_x * spatial_y and columns = number of evidence layers (+ attached labels, if so)

The goal before putting any data into the sampler is to have the right data (cleaned) in the right format (shape). I'm not aware of how the data got prepared before, but if you used .reshape(1, -1), the stacking methods might be changed to np.vstack and additionally transposed (otherwise rows and columns would be in the wrong order and the indizies are upside down as well).

Another, general command
I would not recommend trying to save any of the sampled data as rasters. Any sampling will modify the data and increase/decrease the amount of data used for training by creating some kind of virtual samples (or by dropping), but without a spatial relation.

Thus:

  • the number of rows will change, and the spatial_x and spatial_y will not be the same (other dimensions)
  • you cannot generally overly the sampled outputs with the original data

Instead, propagate the resulting arrays under the hood and directly feed it [any supervised method] with these - no intermediate saving, since the spatial contstraints cannot be kept. If you want to save those, save it as .csv or .feather (binary and way, way faster and smaller) and then load the file again as input.

Did that help to clarify some points?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @msmiyels and thanks for the comments! I'll get back to this later more in depth, but your last point was what we were looking for most. We were not sure if there is some accepted logic to "squeeze in " the extra synthetic data produced by this tool, but now I understood that there is not. I gather that we should not offer this as a separate tool for the user, but rather parameterize the supervised model training tools that could run this under the hood (?)

Comment on lines +3053 to +3056
"""Resample feature data using SMOTETomek.

Parameter sampling_strategy_float will override sampling_strategy_literal if given.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small nitpick, if multiline docstring, the """ should be on a blank line

Suggested change
"""Resample feature data using SMOTETomek.
Parameter sampling_strategy_float will override sampling_strategy_literal if given.
"""
"""
Resample feature data using SMOTETomek.
Parameter sampling_strategy_float will override sampling_strategy_literal if given.
"""

Comment on lines +3049 to +3050
sampling_strategy_literal: Annotated[SMOTETomekSamplingStrategy, typer.Option()] = SMOTETomekSamplingStrategy.auto,
sampling_strategy_float: Optional[float] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to come up with more descriptive names for these, I think. For users it will be confusing what is a "sampling strategy float" and they might not know what "literal" means in this context.

def balance_data_cli(
input_rasters: INPUT_FILES_ARGUMENT,
input_labels: INPUT_FILE_OPTION,
output_raster: OUTPUT_FILE_OPTION,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can return the stacked rasters as one. We have to return individual rasters that correspond to the input rasters, like in the unify_rasters CLI function. However, as I was digging into this, I learned that the balancing process changes the data size and spatial integrity of the data (idk if this is a concern). I think we need to think about this more and consult someone who knows about this method in this context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve class_balancing documentation
3 participants