Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support date cast and date_sub as AST expressions #7030

Open
viadea opened this issue Nov 9, 2022 · 4 comments
Open

[FEA] Support date cast and date_sub as AST expressions #7030

viadea opened this issue Nov 9, 2022 · 4 comments
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request

Comments

@viadea
Copy link
Collaborator

viadea commented Nov 9, 2022

I wish we can support Left Semi + BNLJ.

eg:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import spark.sqlContext.implicits._
import org.apache.spark.sql.functions._

val data = Seq(
    Row(Row("Adam ","","Green"),"1","M",1000),
    Row(Row("Bob ","Middle","Green"),"2","M",2000),
    Row(Row("Cathy ","","Green"),"3","F",3000)
)

val schema = (new StructType()
  .add("name",new StructType()
    .add("firstname",StringType)
    .add("middlename",StringType)
    .add("lastname",StringType)) 
  .add("id",StringType)
  .add("gender",StringType)
  .add("salary",IntegerType))

val df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)
val df2 = df.withColumn("dt",current_date().as("dt")).withColumn("ts",current_timestamp().as("ts"))

df2.printSchema
df2.write.format("parquet").mode("overwrite").save("/tmp/testparquet")
val newdf2 = spark.read.parquet("/tmp/testparquet")
newdf2.createOrReplaceTempView("df2")


val query="""
select count(*) as cnt 
from df2 a left semi join 
df2 b on date(a.ts) > b.dt - 7  
"""

spark.sql(query).collect

Not-supported-messages:

        !Exec <BroadcastNestedLoopJoinExec> cannot run on GPU because not all expressions can be replaced
          @Expression <GreaterThan> (cast(ts#127 as date) > date_sub(dt#253, 7)) could run on GPU
            !Expression <Cast> cast(ts#127 as date) cannot run on GPU because AST is required and this expression does not support AST
              @Expression <AttributeReference> ts#127 could run on GPU
            !Expression <DateSub> date_sub(dt#253, 7) cannot run on GPU because AST is required and this expression does not support AST
              @Expression <AttributeReference> dt#253 could run on GPU
@viadea viadea added feature request New feature or request ? - Needs Triage Need team to review and classify labels Nov 9, 2022
@jlowe jlowe changed the title [FEA] Support Left Semi + BNLJ [FEA] Support Left Semi + BNLJ with conditional date cast and date_sub expressions Nov 9, 2022
@jlowe
Copy link
Contributor

jlowe commented Nov 9, 2022

This request is more about asking cast-as-date and date_sub to be supported as AST expressions than the join itself. If those were supported by AST then the join would have been on the GPU.

@sameerz sameerz added cudf_dependency An issue or PR with this label depends on a new feature in cudf and removed ? - Needs Triage Need team to review and classify labels Nov 16, 2022
@NVnavkumar
Copy link
Collaborator

Note if this issue is resolved, then the integration test used in #7037 will need to be updated as it currently relies on this running on the CPU (an unfortunate limitation of the testing scenario involved).

@kuhushukla kuhushukla changed the title [FEA] Support Left Semi + BNLJ with conditional date cast and date_sub expressions [FEA] Support date cast and date_sub as AST expressions Dec 18, 2023
@revans2
Copy link
Collaborator

revans2 commented Jul 9, 2024

I think we need to be clear about a lot of what is being asked here. Casting a timestamp to a date is complex, especially if a timezone is involved.

case (TimestampType, DateType) if options.timeZoneId.isDefined =>
val zoneId = DateTimeUtils.getZoneId(options.timeZoneId.get)
withResource(GpuTimeZoneDB.fromUtcTimestampToTimestamp(input.asInstanceOf[ColumnVector],
zoneId.normalized())) {
shifted => shifted.castTo(GpuColumnVector.getNonNestedRapidsType(toDataType))
}

But even without a timezone it is not likely to ever be an AST operation. Date differences, however, should be a subtraction operation.

So putting that in AST should be super simple to do, if we can get CUDF to ignore the logical type and just look at the physical type of the data.

That said I think the simplest thing for us to do to be able to support complex operations is to finish #9635

It added in some basic support for rewriting joins so part of it is on the AST and other parts are not. But it only did it for a BroadcastNestedLoopJoin. We really should just extend that out to all join types and do some performance testing to see how well it performs.

@sameerz sameerz added the ? - Needs Triage Need team to review and classify label Jul 10, 2024
@viadea
Copy link
Collaborator Author

viadea commented Jul 17, 2024

Note: I tested the original query in this issue and it works now on GPU.
I will keep this issue open as discussed.

In the meantime, I opened 2 related FEAs based on a customer's fallback messages:
#11213
#11214

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants