We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VL (Velox)
Expected to collect column stats of column, but we found that stat timestamp column is missing.
with setting spark.gluten.sql.native.writer.enabled=true, we write a parquet file.
spark.gluten.sql.native.writer.enabled=true
import org.apache.spark.sql.{SparkSession, Row} import org.apache.spark.sql.types.{StructType, StructField, TimestampType} import java.sql.Timestamp val data = Seq( Row(Timestamp.valueOf("2022-01-01 00:00:00")), Row(Timestamp.valueOf("2022-01-02 00:00:00")) ) val schema = List( StructField("timestamp", TimestampType, true) ) val df = spark.createDataFrame( spark.sparkContext.parallelize(data), StructType(schema) ) df.show() df.coalesce(1).write.mode("overwrite").parquet("/data/test_parquet")
Then we find a parquet file
part-00000-3d1e02c1-9fea-4203-a423-019dfc0d9f6d-c000.snappy.parquet
however the timestamp stat is not collected.
import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.Path import org.apache.parquet.hadoop.ParquetFileReader import org.apache.parquet.hadoop.util.HadoopInputFile import scala.collection.JavaConverters._ val conf = new Configuration() val path = new Path("/data/test_parquet/part-00000-3d1e02c1-9fea-4203-a423-019dfc0d9f6d-c000.snappy.parquet") val file = HadoopInputFile.fromPath(path, conf) val reader = ParquetFileReader.open(file) val group = reader.getRowGroups.asScala.head val columns = group.getColumns.asScala columns.foreach(c => println(c.getStatistics))
the output is
no stats for this column
None
No response
The text was updated successfully, but these errors were encountered:
seems this file name was not written by gluten.
Sorry, something went wrong.
part-00000-3d1e02c1-9fea-4203-a423-019dfc0d9f6d-c000.snappy.parquet seems this file name was not written by gluten.
This is written by Gluten's Velox Native writer from org.apache.spark.sql.execution.datasources.velox.VeloxFormatWriterInjects#createOutputWriter
org.apache.spark.sql.execution.datasources.velox.VeloxFormatWriterInjects#createOutputWriter
part-00000-3d1e02c1-9fea-4203-a423-019dfc0d9f6d-c000.snappy.parquet seems this file name was not written by gluten. This is written by Gluten's Velox Native writer from org.apache.spark.sql.execution.datasources.velox.VeloxFormatWriterInjects#createOutputWriter
got it, could you specify the Spark version in the issue description?
part-00000-3d1e02c1-9fea-4203-a423-019dfc0d9f6d-c000.snappy.parquet seems this file name was not written by gluten. This is written by Gluten's Velox Native writer from org.apache.spark.sql.execution.datasources.velox.VeloxFormatWriterInjects#createOutputWriter got it, could you specify the Spark version in the issue description?
spark version is 3.3.1, but i think it is a common problem because of the native parquet writer's bug.
No branches or pull requests
Backend
VL (Velox)
Bug description
Expected to collect column stats of column, but we found that stat timestamp column is missing.
with setting
spark.gluten.sql.native.writer.enabled=true
, we write a parquet file.Then we find a parquet file
however the timestamp stat is not collected.
the output is
Spark version
None
Spark configurations
spark.gluten.sql.native.writer.enabled=true
System information
No response
Relevant logs
The text was updated successfully, but these errors were encountered: