[VL] Missing Timestamp column stats of parquet files written by native writer. #8673

j7nhai · 2025-02-06T07:10:25Z

Backend

VL (Velox)

Bug description

Expected to collect column stats of column, but we found that stat timestamp column is missing.

with setting spark.gluten.sql.native.writer.enabled=true, we write a parquet file.

import org.apache.spark.sql.{SparkSession, Row}
import org.apache.spark.sql.types.{StructType, StructField, TimestampType}
import java.sql.Timestamp

val data = Seq(
  Row(Timestamp.valueOf("2022-01-01 00:00:00")),
  Row(Timestamp.valueOf("2022-01-02 00:00:00"))
)

val schema = List(
  StructField("timestamp", TimestampType, true)
)

val df = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  StructType(schema)
)

df.show()
df.coalesce(1).write.mode("overwrite").parquet("/data/test_parquet")

Then we find a parquet file

part-00000-3d1e02c1-9fea-4203-a423-019dfc0d9f6d-c000.snappy.parquet

however the timestamp stat is not collected.

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.parquet.hadoop.ParquetFileReader
import org.apache.parquet.hadoop.util.HadoopInputFile
import scala.collection.JavaConverters._


val conf = new Configuration()
val path = new Path("/data/test_parquet/part-00000-3d1e02c1-9fea-4203-a423-019dfc0d9f6d-c000.snappy.parquet")
val file = HadoopInputFile.fromPath(path, conf)
val reader = ParquetFileReader.open(file)

val group = reader.getRowGroups.asScala.head
val columns = group.getColumns.asScala
columns.foreach(c => println(c.getStatistics))

the output is

no stats for this column

Spark version

None

Spark configurations

spark.gluten.sql.native.writer.enabled=true

System information

No response

Relevant logs

The text was updated successfully, but these errors were encountered:

Yohahaha · 2025-02-07T01:52:30Z

part-00000-3d1e02c1-9fea-4203-a423-019dfc0d9f6d-c000.snappy.parquet

seems this file name was not written by gluten.

j7nhai · 2025-02-07T03:04:14Z

part-00000-3d1e02c1-9fea-4203-a423-019dfc0d9f6d-c000.snappy.parquet

seems this file name was not written by gluten.

This is written by Gluten's Velox Native writer from org.apache.spark.sql.execution.datasources.velox.VeloxFormatWriterInjects#createOutputWriter

Yohahaha · 2025-02-07T06:33:04Z

part-00000-3d1e02c1-9fea-4203-a423-019dfc0d9f6d-c000.snappy.parquet
seems this file name was not written by gluten.

This is written by Gluten's Velox Native writer from org.apache.spark.sql.execution.datasources.velox.VeloxFormatWriterInjects#createOutputWriter

got it, could you specify the Spark version in the issue description?

j7nhai · 2025-02-07T06:34:27Z

part-00000-3d1e02c1-9fea-4203-a423-019dfc0d9f6d-c000.snappy.parquet
seems this file name was not written by gluten.

This is written by Gluten's Velox Native writer from org.apache.spark.sql.execution.datasources.velox.VeloxFormatWriterInjects#createOutputWriter

got it, could you specify the Spark version in the issue description?

spark version is 3.3.1, but i think it is a common problem because of the native parquet writer's bug.

j7nhai added bug Something isn't working triage labels Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL] Missing Timestamp column stats of parquet files written by native writer. #8673

[VL] Missing Timestamp column stats of parquet files written by native writer. #8673

j7nhai commented Feb 6, 2025 •

edited

Loading

Yohahaha commented Feb 7, 2025

j7nhai commented Feb 7, 2025

Yohahaha commented Feb 7, 2025

j7nhai commented Feb 7, 2025 •

edited

Loading

[VL] Missing Timestamp column stats of parquet files written by native writer. #8673

[VL] Missing Timestamp column stats of parquet files written by native writer. #8673

Comments

j7nhai commented Feb 6, 2025 • edited Loading

Backend

Bug description

Spark version

Spark configurations

System information

Relevant logs

Yohahaha commented Feb 7, 2025

j7nhai commented Feb 7, 2025

Yohahaha commented Feb 7, 2025

j7nhai commented Feb 7, 2025 • edited Loading

j7nhai commented Feb 6, 2025 •

edited

Loading

j7nhai commented Feb 7, 2025 •

edited

Loading