-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support arbitrary column names #957
Comments
Add a column name mapping mode in Delta, which allows Delta to use different names in the table schema and in the underlying Parquet files. This is a building block for issues delta-io#957 and delta-io#958. New unit tests. Closes delta-io#962 GitOrigin-RevId: 7a64a33fd60781c17236bff6168044e702c8413a
Add a column name mapping mode in Delta, which allows Delta to use different names in the table schema and in the underlying Parquet files. This is a building block for issues delta-io#957 and delta-io#958. New unit tests. Closes delta-io#962 GitOrigin-RevId: 7a64a33fd60781c17236bff6168044e702c8413a
Hello, @liwensun! I was doing some tests in Databricks Community regarding support for arbitrary characters in column names. I have found that I could create regular Parquet files with column names containing special characters such as spaces and parentheses. Is this intended behavior, or an accidental new feature not yet announced? Repro steps (tested in Databricks Runtime 11.1): df = spark.sql("""
SELECT
1 AS Id,
'Alice' AS Name,
'Something' AS `Lots of symbols: !@#$%&*()`
""")
location = "/FileStore/parquet-with-special-chars/"
(df.write
.format("parquet")
.mode("overwrite")
.save(location))
display(spark.read.format("parquet").load(location)) |
@D-to-the-K this is because Spark removed the check from parquet side: apache/spark#35229 |
Thank you, @zsxwing! I really missed that. Hope this is soon changed in parquet-mr to remove the same limitation from tools such as Azure Data Factory as well. https://docs.microsoft.com/en-us/azure/data-factory/connector-troubleshoot-parquet#no-enum-constant |
Overview
This is the issue to track interest, feature requests, and progress being made on support for arbitrary column names in Delta Lake, which is a part of implementing “column renaming and dropping” as outlined on the Delta OSS 2022 H1 roadmap here.
Delta tables use Parquet as the underlying file format. Right now, a Delta column must be stored in the underlying Parquet files using the same name. Thus, users can't name a Delta column using characters disallowed by Parquet. This limitation could cause inconvenience for Delta users who want to directly ingest data that contains columns with special characters, e.g., columns with spaces are common in CSV. The end goal of this issue is to lift the Delta column naming restrictions inherited from Parquet.
Requirements
Users can name Delta columns using characters disallowed by Parquet, without concerning what column names the underlying Parquet files use. When a Delta column contains such characters, all existing Delta operations and API behaviors should not be impacted.
Please see the detailed requirements here.
Design Sketch
Please see the detailed design sketch here.
The text was updated successfully, but these errors were encountered: