Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Missing Timestamp column stats of parquet files written by native writer. #8673

Open
j7nhai opened this issue Feb 6, 2025 · 4 comments
Labels
bug Something isn't working triage

Comments

@j7nhai
Copy link
Contributor

j7nhai commented Feb 6, 2025

Backend

VL (Velox)

Bug description

Expected to collect column stats of column, but we found that stat timestamp column is missing.

with setting spark.gluten.sql.native.writer.enabled=true, we write a parquet file.

import org.apache.spark.sql.{SparkSession, Row}
import org.apache.spark.sql.types.{StructType, StructField, TimestampType}
import java.sql.Timestamp

val data = Seq(
  Row(Timestamp.valueOf("2022-01-01 00:00:00")),
  Row(Timestamp.valueOf("2022-01-02 00:00:00"))
)

val schema = List(
  StructField("timestamp", TimestampType, true)
)

val df = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  StructType(schema)
)

df.show()
df.coalesce(1).write.mode("overwrite").parquet("/data/test_parquet")

Then we find a parquet file

part-00000-3d1e02c1-9fea-4203-a423-019dfc0d9f6d-c000.snappy.parquet

however the timestamp stat is not collected.

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.parquet.hadoop.ParquetFileReader
import org.apache.parquet.hadoop.util.HadoopInputFile
import scala.collection.JavaConverters._


val conf = new Configuration()
val path = new Path("/data/test_parquet/part-00000-3d1e02c1-9fea-4203-a423-019dfc0d9f6d-c000.snappy.parquet")
val file = HadoopInputFile.fromPath(path, conf)
val reader = ParquetFileReader.open(file)

val group = reader.getRowGroups.asScala.head
val columns = group.getColumns.asScala
columns.foreach(c => println(c.getStatistics))

the output is

no stats for this column
Image

Spark version

None

Spark configurations

spark.gluten.sql.native.writer.enabled=true

System information

No response

Relevant logs

@j7nhai j7nhai added bug Something isn't working triage labels Feb 6, 2025
@Yohahaha
Copy link
Contributor

Yohahaha commented Feb 7, 2025

part-00000-3d1e02c1-9fea-4203-a423-019dfc0d9f6d-c000.snappy.parquet

seems this file name was not written by gluten.

@j7nhai
Copy link
Contributor Author

j7nhai commented Feb 7, 2025

part-00000-3d1e02c1-9fea-4203-a423-019dfc0d9f6d-c000.snappy.parquet

seems this file name was not written by gluten.

This is written by Gluten's Velox Native writer from org.apache.spark.sql.execution.datasources.velox.VeloxFormatWriterInjects#createOutputWriter

@Yohahaha
Copy link
Contributor

Yohahaha commented Feb 7, 2025

part-00000-3d1e02c1-9fea-4203-a423-019dfc0d9f6d-c000.snappy.parquet
seems this file name was not written by gluten.

This is written by Gluten's Velox Native writer from org.apache.spark.sql.execution.datasources.velox.VeloxFormatWriterInjects#createOutputWriter

got it, could you specify the Spark version in the issue description?

@j7nhai
Copy link
Contributor Author

j7nhai commented Feb 7, 2025

part-00000-3d1e02c1-9fea-4203-a423-019dfc0d9f6d-c000.snappy.parquet
seems this file name was not written by gluten.

This is written by Gluten's Velox Native writer from org.apache.spark.sql.execution.datasources.velox.VeloxFormatWriterInjects#createOutputWriter

got it, could you specify the Spark version in the issue description?

spark version is 3.3.1, but i think it is a common problem because of the native parquet writer's bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

2 participants