Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support unix_timestamp on GPU for subset of formats #1113

Merged
merged 18 commits into from
Nov 18, 2020

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Nov 13, 2020

Signed-off-by: Andy Grove [email protected]

This PR makes improvements to unix_timestamp so that a small number of date/timestamp formats can now be supported on GPU without requiring that incompatibleOps=true because they are now compatible with Spark (with the exception of #1117).

Specific changes in this PR:

  • unix_timestamp and to_unix_timestamp no longer return bad data for invalid inputs when using one of the compatible formats because it first validates the contents of a column using is_timestamp and will either throw an exception or convert invalid inputs to null, depending on the value specified for legacyTimeParserPolicy.
  • A new configuration option spark.rapids.sql.incompatibleDateFormats.enabled has been added.
  • unix_timestamp now supports the special dates epoch, now, today, yesterday, and tomorrow.
  • unix_timestamp will fall back to CPU if legacyTimeParserPolicy is LEGACY, or if the supplied format is not supported on GPU and spark.rapids.sql.incompatibleDateFormats.enabled=false.
  • As a side effect of these changes, to_date is now supported on GPU because it gets translated to an expression that is a combination of CAST and unix_timestamp, using one of the compatible formats.

This PR closes #1098, #1094, and largely addresses #50. I filed a new issue #1111 for the remainder of the work to support legacyTimeParserPolicy.

@andygrove andygrove self-assigned this Nov 13, 2020
@andygrove
Copy link
Contributor Author

build

Signed-off-by: Andy Grove <[email protected]>
@sameerz sameerz added the feature request New feature or request label Nov 15, 2020
@sameerz sameerz added this to the Nov 9 - Nov 20 milestone Nov 15, 2020
@andygrove andygrove changed the title [WIP] Support unix_timestamp on GPU for subset of formats Support unix_timestamp on GPU for subset of formats Nov 16, 2020
Signed-off-by: Andy Grove <[email protected]>
@andygrove
Copy link
Contributor Author

Thanks for the review @revans2. I have now addressed the remaining issues.

revans2
revans2 previously approved these changes Nov 16, 2020
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great

@revans2
Copy link
Collaborator

revans2 commented Nov 16, 2020

build

@andygrove
Copy link
Contributor Author

The PySpark tests failed. I will either need to modify the config to enable incompatible formats or add more formats to the list of compatible formats.

@revans2
Copy link
Collaborator

revans2 commented Nov 17, 2020

The PySpark tests failed. I will either need to modify the config to enable incompatible formats or add more formats to the list of compatible formats.

There are only two formats that are being tested,

'yyyy/MM' and 'yyyy/MM/dd'

One of the tests is marked as xfail because we allow invalid dates to be inserted.

The tests are already marked with incompat because to_unix_timestamp is not compatible so I am fine with either fix you suggested. I personally would like to see more formats being marked as compatible. Perhaps a disallow list in the parser would be better so we can support everything except for the one variable width one we know of.

Signed-off-by: Andy Grove <[email protected]>
revans2
revans2 previously approved these changes Nov 17, 2020
Signed-off-by: Andy Grove <[email protected]>
@pytest.mark.parametrize('data_gen,date_form', str_date_and_format_gen, ids=idfn)
def test_string_to_unix_timestamp(data_gen, date_form):
print("date: " + date_form)
conf = {"spark.rapids.sql.improvedTimeOps.enabled": "true"}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test fails if I don't add this config and I don't understand why yet. I am debugging this still.

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

andygrove commented Nov 17, 2020

@revans2 This is ready for review now. I had misunderstood the correct behavior for legacyTimeParserPolicy=EXCEPTION. We should not throw exceptions for values that we fail to parse for a given format unless those inputs would have been valid in Spark versions prior to 3.0.0. This is not true of any of the formats that we currently support on GPU.

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

It looks like the behavior has changed in Spark 3.1.0 and unix_timestamp gets translated to cast(gettimestamp(c0#20108, yyyy-MM-dd, Some(UTC)) as date) which we do not currently support on GPU, which is why the tests are failing against 3.1.0 currently.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks fine, but if we know that there are failures in 3.1.0 then we should at least file a follow-on issue for them and either skip the tests for spark 3.1.0+ or mark them as xfail

@andygrove
Copy link
Contributor Author

build

@andygrove andygrove merged commit fd6b613 into NVIDIA:branch-0.3 Nov 18, 2020
@andygrove andygrove deleted the unix_timestamp branch November 18, 2020 16:17
sperlingxx pushed a commit to sperlingxx/spark-rapids that referenced this pull request Nov 20, 2020
* Support unix_timestamp on GPU for subset of formats

Signed-off-by: Andy Grove <[email protected]>

* close scalar value

Signed-off-by: Andy Grove <[email protected]>

* compatible formats will now run on GPU without requiring incompatibleOps to be set

Signed-off-by: Andy Grove <[email protected]>

* code cleanup and address more review comments

Signed-off-by: Andy Grove <[email protected]>

* add specific config option for enabling incompatible date formats on GPU

* update documentation

Signed-off-by: Andy Grove <[email protected]>

* improve docs

Signed-off-by: Andy Grove <[email protected]>

* use constants for special dates

Signed-off-by: Andy Grove <[email protected]>

* Add support for more date formats and remove incompat from to_unix_timestamp

Signed-off-by: Andy Grove <[email protected]>

* remove debug print

Signed-off-by: Andy Grove <[email protected]>

* Revert unnecessary change

Signed-off-by: Andy Grove <[email protected]>

* Make ToUnixTimestamp consistent with UnixTimestamp

Signed-off-by: Andy Grove <[email protected]>

* refactor to remove duplicate code

Signed-off-by: Andy Grove <[email protected]>

* fix resource leaks and fix regressions in python tests

Signed-off-by: Andy Grove <[email protected]>

* scalstyle

Signed-off-by: Andy Grove <[email protected]>

* update docs

Signed-off-by: Andy Grove <[email protected]>

* fix error in handling of legacyTimeParserPolicy=EXCEPTION

Signed-off-by: Andy Grove <[email protected]>

* fix test failures against Spark 3.1.0

Signed-off-by: Andy Grove <[email protected]>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Support unix_timestamp on GPU for subset of formats

Signed-off-by: Andy Grove <[email protected]>

* close scalar value

Signed-off-by: Andy Grove <[email protected]>

* compatible formats will now run on GPU without requiring incompatibleOps to be set

Signed-off-by: Andy Grove <[email protected]>

* code cleanup and address more review comments

Signed-off-by: Andy Grove <[email protected]>

* add specific config option for enabling incompatible date formats on GPU

* update documentation

Signed-off-by: Andy Grove <[email protected]>

* improve docs

Signed-off-by: Andy Grove <[email protected]>

* use constants for special dates

Signed-off-by: Andy Grove <[email protected]>

* Add support for more date formats and remove incompat from to_unix_timestamp

Signed-off-by: Andy Grove <[email protected]>

* remove debug print

Signed-off-by: Andy Grove <[email protected]>

* Revert unnecessary change

Signed-off-by: Andy Grove <[email protected]>

* Make ToUnixTimestamp consistent with UnixTimestamp

Signed-off-by: Andy Grove <[email protected]>

* refactor to remove duplicate code

Signed-off-by: Andy Grove <[email protected]>

* fix resource leaks and fix regressions in python tests

Signed-off-by: Andy Grove <[email protected]>

* scalstyle

Signed-off-by: Andy Grove <[email protected]>

* update docs

Signed-off-by: Andy Grove <[email protected]>

* fix error in handling of legacyTimeParserPolicy=EXCEPTION

Signed-off-by: Andy Grove <[email protected]>

* fix test failures against Spark 3.1.0

Signed-off-by: Andy Grove <[email protected]>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Support unix_timestamp on GPU for subset of formats

Signed-off-by: Andy Grove <[email protected]>

* close scalar value

Signed-off-by: Andy Grove <[email protected]>

* compatible formats will now run on GPU without requiring incompatibleOps to be set

Signed-off-by: Andy Grove <[email protected]>

* code cleanup and address more review comments

Signed-off-by: Andy Grove <[email protected]>

* add specific config option for enabling incompatible date formats on GPU

* update documentation

Signed-off-by: Andy Grove <[email protected]>

* improve docs

Signed-off-by: Andy Grove <[email protected]>

* use constants for special dates

Signed-off-by: Andy Grove <[email protected]>

* Add support for more date formats and remove incompat from to_unix_timestamp

Signed-off-by: Andy Grove <[email protected]>

* remove debug print

Signed-off-by: Andy Grove <[email protected]>

* Revert unnecessary change

Signed-off-by: Andy Grove <[email protected]>

* Make ToUnixTimestamp consistent with UnixTimestamp

Signed-off-by: Andy Grove <[email protected]>

* refactor to remove duplicate code

Signed-off-by: Andy Grove <[email protected]>

* fix resource leaks and fix regressions in python tests

Signed-off-by: Andy Grove <[email protected]>

* scalstyle

Signed-off-by: Andy Grove <[email protected]>

* update docs

Signed-off-by: Andy Grove <[email protected]>

* fix error in handling of legacyTimeParserPolicy=EXCEPTION

Signed-off-by: Andy Grove <[email protected]>

* fix test failures against Spark 3.1.0

Signed-off-by: Andy Grove <[email protected]>
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] unix_timestamp on GPU returns invalid data for bad input
3 participants