Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for customizing missing/invalid value handling across all customer Transformer classes (similar to what's already available in ExpressionTransformer) #438

Open
philip-bingham opened this issue Dec 20, 2024 · 5 comments

Comments

@philip-bingham
Copy link

philip-bingham commented Dec 20, 2024

I'm trying to take advantage of the datetime functionality presented here https://openscoring.io/blog/2020/03/08/sklearn_date_datetime_pmml/ which works great for datetime fields that are always populated.

For each sample in my data I have the datetime the sample was created, then a historic datetime for an event related to this sample that may or may not have happened. I would like to calculate a feature that is the difference between these timestamps if both are present, but null if the historic event hasn't happened.

I'm currently using this mapper config:

def duration_transformer():
    return ExpressionTransformer("(X[0] - X[1])/(60*60*24)", dtype=float)

memory = Memory()

mapper = DataFrameMapper(
# first one is list comprehension, for each column in cat_columns it will map to categorical domain and then label encode the category
              [
                 (["datetime"], [DateTimeDomain(), make_memorizer_union(memory, names=["memorized_datetime"]), SecondsSinceMidnightTransformer(), Alias(make_hour_of_day_transformer(), "HourOfDay", prefit = False)], {'alias':'hour_of_day'}),
                 (["historic_event"], [DateTimeDomain(), make_recaller_union(memory, names=["memorized_datetime"]), SecondsSinceYearTransformer(year = 1900), Alias(duration_transformer(), "days_since_historic_event", prefit = False)], {'alias':'days_since_historic_event'}),
          
                 
              ], input_df=False, df_out=True
                )

When I attempt to fit_transform, I get an error because the SecondsSinceYearTransformer is receiving some NaT values, and the DurationTransformer class attempts to cast whatever value it gets to int, which fails:

IntCastingNaNError: ['historic_event']: Cannot convert non-finite values (NA or inf) to integer

Is there a functional reason why the SecondsSinceYearTransformer doesn't have missing/invalid treatment options like other transformers? Ideally I'd be able to tell it to just pass through missing values and return a null that LGBM is capable of handling, although I assume I'd then have to updated my duration_transformer() to understand what to do with null values

@vruusmann
Copy link
Member

I would like to calculate a feature that is the difference between these timestamps if both are present, but null if the historic event hasn't happened.
ExpressionTransformer("(X[0] - X[1])/(60*60*24)", dtype=float)

You can make this requirement transparent by using an in-line if-else expression:

transformer = ExpressionTransformer("(X[0] - X[1])/(60*60*24) if pandas.notnull(X[1]) else None", dtype=float)

This will eliminate all doubt about the first part of the computation.

Is there a functional reason why the SecondsSinceYearTransformer doesn't have missing/invalid treatment options like other transformers?

Explicit missing/invalid value treatment support is currently available at the first step of a Scikit-Learn pipeline. The SkLearn2PMML package calls these special transformers as "decorators", and they are located in the sklearn2pmml.decoration package (cf. with "ordinary" transformers that are located in the sklearn2pmml.preprocessing package).

Now, in principle, it is possible to make some "ordinary" transformers also support invalid/missing value treatment, if the underlying PMML element (that gets generated) has <Expression>@mapMissingTo and/or <Expression>@defaultValue attributes.

For example, class ExpressionTransformer is generating Apply elements, which does provide such attributes:
https://github.com/jpmml/sklearn2pmml/blob/0.112.0/sklearn2pmml/preprocessing/__init__.py#L230

So, to begin answering your question - can you perhaps move the non-valid value treatment commands to the ExpressionTransformer step?

All DurationTransformer subclasses appear to be generating Apply elements as well, which means that it's possible to introduce DurationTransformer@mapMissingTo, DurationTransformer@defaultValue, etc. attributes if necessary.

But I wouldn't want to only upgrade the DurationTransformer class in isolation. This functional enhancement should be applied to all SkLearn2PMML custom transformer classes at once. Seems like quite a lot of work, so I cannot give any estimates when that might happen.

@vruusmann vruusmann changed the title NaT/Missing value handling for datetime preprocessing functions Support for customizing missing/invalid value handling across all customer Transformer classes (similar to what's already available in ExpressionTransformer) Dec 21, 2024
@vruusmann
Copy link
Member

Seems closely related to #436

@vruusmann
Copy link
Member

can you perhaps move the non-valid value treatment commands to the ExpressionTransformer step?

The business logic of DurationTransformer subclasses could be extracted into an utility function, which could be calleable from within Python expressions:

transformer = ExpressionTransformer("sklearn2pmml.preprocessing.seconds_since_year(X[0]) if pandas.notnull(X[0]) else None")

@vruusmann
Copy link
Member

All DurationTransformer subclasses appear to be generating Apply elements as well, which means that it's possible to introduce DurationTransformer@mapMissingTo, DurationTransformer@defaultValue, etc. attributes if necessary.

@philip-bingham You can take the PMML document generated by SkLearn2PMML, and post-process using your own Python helper tool, which adds those attributes as appropriate.

@philip-bingham
Copy link
Author

philip-bingham commented Dec 24, 2024

Thanks for looking into this @vruusmann , from the above comments it doesn't seem that there's a way to achieve this without changes to the package?

This approach:
transformer = ExpressionTransformer("(X[0] - X[1])/(60*60*24) if pandas.notnull(X[1]) else None", dtype=float)

Doesn't work, because X[0] and X[1] are the results of the SecondsSinceYearTransformer, which is where the error is thrown so we don't even reach this transformer.

This looks promising:

transformer = ExpressionTransformer("sklearn2pmml.preprocessing.seconds_since_year(X[0]) if pandas.notnull(X[0]) else None")

but would require some new functions right? And in the expression evaluator it has a predefined list of modules that it can use functions from:
def to_expr_func(expr, modules = ["math", "re", "pcre", "pcre2", "numpy", "pandas", "scipy"]):

so would sklearn2pmml need to be added to this list for this to work? I will play around with this in my local branch.

I also tried modifying the transformer in my local branch to convert to float instead of int so that nulls are allowed and propagate:

def _float(X):
	if numpy.isscalar(X):
		return float(X)
	else:
		return cast(X, float)

def transform(self, X):
		def to_float_duration(X):
			duration = self._to_duration(pandas.to_timedelta(X - self.epoch))
			return _float(duration)

		return dt_transform(X, to_float_duration)

This allows me to get a PMML file out, however when I try to evaluate on the same dataframe with jpmml_evaluator, I get an error about using the pandas datetime dtype:
JavaError: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pandas._libs.tslibs.timestamps._unpickle_timestamp)

The fitted DateTimeDomain() is expecting this dtype
image

So I guess i need to go back and cast to a supported dtype before fitting, but then I think I'm going to run into issues with this part of the operation:
self._to_duration(pandas.to_timedelta(X - self.epoch))

Because I think non-pandas datetime dtypes don't support subtraction with nulls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants