-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add support for org.apache.spark.sql.execution.SampleExec #3419
Comments
@viadea Do we need to match the Spark random number sampling 100%? CUDF has a sample API that supports sampling with and without replacement. But there are a number of issues with it that make it so it would be hard to use in this case and also make it not match Spark 100%. If we need to be 100% accurate then the only way to do this is likely to use the existing Spark implementations of |
@viadea @revans2 The results varied each time on spark CPU env, the sample is really random, so I think it’s unnecessary to match 100%. What do you think? The results are same if specify the seed parameter like this: df.sample(0.1, 1), the second parameter is seed. |
The sample API takes a few different parameters. A fraction, with replacement, and a seed. If you do not provide a seed, then a random seed is used so you end up with different results each time. This is always true for the TABLESAMPLE SQL function. There is no way to provide a key. If you provide a seed are the results different each time? From what I can see unless the shuffle ordering is different the result should be the same. Ideally we would want to match 100% if we can guarantee that the input also matches. |
Confirmed this using 20211029 snapshot 21.12 jars:
|
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I wish the RAPIDS Accelerator for Apache Spark would [...]
I wish the RAPIDS Accelerator for Apache Spark would support org.apache.spark.sql.execution.SampleExec.
Describe the solution you'd like
A clear and concise description of what you want to happen.
Below api should work on GPU:
Currently it falls back on CPU with below driver log:
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.
The text was updated successfully, but these errors were encountered: