Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using decorators to perform schema validations on DataFrames #141

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

edmondop
Copy link

@edmondop edmondop commented Oct 13, 2023

Addresses #140

@jeffbrennan
Copy link
Collaborator

Hey @edmondop, I'll take a look at this!

@jeffbrennan
Copy link
Collaborator

@edmondop, @MrPowers
These commits look good - the decorators are a really nice and visible way to add validations to existing functions. One thing I'm thinking about is the overlap in naming, like how validate_returned_schema() references validate_schema() and ensure_columns_present() references validate_presence_of_columns()

To an end user, I'm not sure if this difference would be obvious. Would it be possible to make the existing functions work as optional decorators? Let me know what y'all think

@edmondop
Copy link
Author

@edmondop, @MrPowers These commits look good - the decorators are a really nice and visible way to add validations to existing functions. One thing I'm thinking about is the overlap in naming, like how validate_returned_schema() references validate_schema() and ensure_columns_present() references validate_presence_of_columns()

To an end user, I'm not sure if this difference would be obvious. Would it be possible to make the existing functions work as optional decorators? Let me know what y'all think

Hi @jeffbrennan I tried a couple of strategy but that doesn't work well. The problem seems to be that Python has a limitation around overloading, that is not fixed in the standard library.

If you define let's say validate_schema as both a function that takes three parameters, and as a decorator that takes two parameters and return a wrapper of a function, you need a language that support overloading correctly, and Python doesn't. I would need to introduce an additional dependency to move forward if needed (multi-dispatch)

@MrPowers
Copy link
Collaborator

So this pull request adds a few functions to the public interface: validate_returned_schema, ensure_columns_present, ensure_columns_absent. These functions are exposed as decorators and can only be used for functions that return DataFrames.

I am actually curious what would happen with the validate_returned_schema if the decorator is added to a function that returns a StructType.

quinn already has the following functions: validate_presence_of_columns, validate_schema, and validate_absence_of_columns.

From what I understand, we can't use the existing names because Python doesn't support overloading.

It seems like validate_presence_of_columns is more flexible because it can be invoked multiple times in a function. ensure_columns_present can only be invoked once.

From a naming perspective, I definitely don't think we should have "presence_of_columns" and "columns_present".

I am on the fence about this one. I really like the decorator syntax. If we didn't already have something then this would be a no-brainer. I am concerned that introducing another way to accomplish the same task might confuse users. I am interested in thoughts from the community.

@edmondop
Copy link
Author

So this pull request adds a few functions to the public interface: validate_returned_schema, ensure_columns_present, ensure_columns_absent. These functions are exposed as decorators and can only be used for functions that return DataFrames.

I am actually curious what would happen with the validate_returned_schema if the decorator is added to a function that returns a StructType.

quinn already has the following functions: validate_presence_of_columns, validate_schema, and validate_absence_of_columns.

From what I understand, we can't use the existing names because Python doesn't support overloading.

It seems like validate_presence_of_columns is more flexible because it can be invoked multiple times in a function. ensure_columns_present can only be invoked once.

From a naming perspective, I definitely don't think we should have "presence_of_columns" and "columns_present".

I am on the fence about this one. I really like the decorator syntax. If we didn't already have something then this would be a no-brainer. I am concerned that introducing another way to accomplish the same task might confuse users. I am interested in thoughts from the community.

There are workarounds, but they all come with trade-offs I believe. In particular, I think if we use https://pypi.org/project/multimethod/, we can achieve the desired behavior of having a single consistent API . I just didn't feel like adding a new dependency without asking was appropriate. I tried with singledispatch and ended up throwing away all the code and be very frustrated

Other ideas:

  • separate extension package( quinn.decorators?)
  • trying to work with existing Python feature but introducing a breaking change. For example, I think if we use only keyword named arguments, it's possible to write a signature that takes a single kwargs dict, and try to make it behave dynamically depending on how many entries are in the dictionary (2 for the decorator in validate_schema)

I think the use case here that I have in mind is that someone could desire that in a complex Spark pipeline, being all lazy, there are explicit points where the schema is checked. Almost as an instrumentation system for a spark job

@MrPowers
Copy link
Collaborator

@edmondop - I am really happy that you opened this pull request and put this on my radar. Decorators could be really nice for Python programmers. I created a separate issue where we can brainstorm this more generically. I'm interested in brainstorming the ideal decorator interface for PySpark programmers, seeing how it helps for common pipelines, and exploring any potential downsides of the approach. I think we should detach this "ideal end state" brainstorming exercise from this "dealing with legacy code and all the semver/deprecation/public interface messiness considerations. Let's brainstorm the ideal end state in the other issue!

@kunaljubce kunaljubce mentioned this pull request Mar 29, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants