Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support legacy behavior of parameterless count #1958

Merged
merged 6 commits into from
Mar 23, 2021

Conversation

razajafri
Copy link
Collaborator

This PR returns a Long col with a single row with a value of 0

closes #1737

Signed-off-by: Raza Jafri [email protected]

@razajafri
Copy link
Collaborator Author

@abellina can you take a look since you originally worked on aggregates?

@sameerz sameerz added the task Work required that improves the product but is not user facing label Mar 18, 2021
@sameerz sameerz added this to the Mar 15 - March 26 milestone Mar 18, 2021
@razajafri
Copy link
Collaborator Author

build

Signed-off-by: Raza Jafri <[email protected]>
@razajafri
Copy link
Collaborator Author

I have added tests. I assumed the count was tested but it wasn't the parameterless count obviously, should've known better.

@abellina have I answered your questions?
@revans2 PTAL

@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

@abellina are you OK with this PR?

@abellina
Copy link
Collaborator

@razajafri just thought of an edge case. Two count() "aggs" back to back. Does your code work with this?

scala> spark.sql("select count(),count() from foo").explain(true)
== Parsed Logical Plan ==
'Project [unresolvedalias('count(), None), unresolvedalias('count(), None)]
+- 'UnresolvedRelation [foo], [], false

== Analyzed Logical Plan ==
count(): bigint, count(): bigint
Aggregate [count() AS count()#255L, count() AS count()#256L]
+- SubqueryAlias foo
   +- SerializeFromObject [input[0, int, false] AS value#2]
      +- ExternalRDD [obj#1]

== Optimized Logical Plan ==
Aggregate [0 AS count()#255L, 0 AS count()#256L]
+- SerializeFromObject
   +- ExternalRDD [obj#1]

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[], output=[count()#255L, count()#256L])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#983]
      +- HashAggregate(keys=[], functions=[], output=[])
         +- SerializeFromObject
            +- Scan[obj#1]

scala> spark.sql("select count(),count() from foo").collect
res32: Array[org.apache.spark.sql.Row] = Array([0,0])

Signed-off-by: Raza Jafri <[email protected]>
@abellina
Copy link
Collaborator

build

@razajafri
Copy link
Collaborator Author

@abellina PTAL

@abellina
Copy link
Collaborator

@razajafri thanks for adding tests. I am still not clear on how this works.

My theory is that it works because there is a projection at the end that just sets 0s (i.e. the scalar you are generating is not used). A quick test would be to use something other than 0 for your scalar, to see if that makes a difference.

If the above is true, I don't think it changes the impl. It would just be good to fully understand how this is propagating to the result.

@razajafri
Copy link
Collaborator Author

My theory is that it works because there is a projection at the end that just sets 0s (i.e. the scalar you are generating is not used). A quick test would be to use something other than 0 for your scalar, to see if that makes a difference.

You are right that's how its working. I did a quick test by using a scalar value of 2 and that had no effect on the result

abellina
abellina previously approved these changes Mar 23, 2021
Copy link
Collaborator

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@razajafri thanks for the changes and for the testing. I think this makes sense and I can't think of a simpler way at this point.

Signed-off-by: Raza Jafri <[email protected]>
@razajafri
Copy link
Collaborator Author

build

@razajafri razajafri merged commit 097dc97 into NVIDIA:branch-0.5 Mar 23, 2021
@razajafri razajafri deleted the parameterless_count branch March 23, 2021 22:40
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Return a Long col with 0 if agg is empty

Signed-off-by: Raza Jafri <[email protected]>

* addressed review comments

Signed-off-by: Raza Jafri <[email protected]>

* improved tests

Signed-off-by: Raza Jafri <[email protected]>

* added two counts

Signed-off-by: Raza Jafri <[email protected]>

* added comment

Signed-off-by: Raza Jafri <[email protected]>

Co-authored-by: Raza Jafri <[email protected]>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Return a Long col with 0 if agg is empty

Signed-off-by: Raza Jafri <[email protected]>

* addressed review comments

Signed-off-by: Raza Jafri <[email protected]>

* improved tests

Signed-off-by: Raza Jafri <[email protected]>

* added two counts

Signed-off-by: Raza Jafri <[email protected]>

* added comment

Signed-off-by: Raza Jafri <[email protected]>

Co-authored-by: Raza Jafri <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spark 3.1 now supports the legacy behavior of parameterless count
4 participants