Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support arbitrary column names #957

Closed
liwensun opened this issue Feb 23, 2022 · 3 comments
Closed

Support arbitrary column names #957

liwensun opened this issue Feb 23, 2022 · 3 comments
Labels
enhancement New feature or request
Milestone

Comments

@liwensun
Copy link
Contributor

liwensun commented Feb 23, 2022

Overview

This is the issue to track interest, feature requests, and progress being made on support for arbitrary column names in Delta Lake, which is a part of implementing “column renaming and dropping” as outlined on the Delta OSS 2022 H1 roadmap here.

Delta tables use Parquet as the underlying file format. Right now, a Delta column must be stored in the underlying Parquet files using the same name. Thus, users can't name a Delta column using characters disallowed by Parquet. This limitation could cause inconvenience for Delta users who want to directly ingest data that contains columns with special characters, e.g., columns with spaces are common in CSV. The end goal of this issue is to lift the Delta column naming restrictions inherited from Parquet.

Requirements

Users can name Delta columns using characters disallowed by Parquet, without concerning what column names the underlying Parquet files use. When a Delta column contains such characters, all existing Delta operations and API behaviors should not be impacted.

Please see the detailed requirements here.

Design Sketch

Please see the detailed design sketch here.

@liwensun liwensun added the enhancement New feature or request label Feb 23, 2022
@liwensun liwensun changed the title Support Arbitrary Column Names Support arbitrary column names Feb 23, 2022
scottsand-db pushed a commit that referenced this issue Mar 3, 2022
Add a column name mapping mode in Delta, which allows Delta to use different names in the table schema and in the underlying Parquet files. This is a building block for issues #957 and #958.

New unit tests.

Closes #962

GitOrigin-RevId: 7a64a33fd60781c17236bff6168044e702c8413a
@vkorukanti vkorukanti added this to the 1.2 milestone Apr 8, 2022
jbguerraz pushed a commit to jbguerraz/delta that referenced this issue Jul 6, 2022
Add a column name mapping mode in Delta, which allows Delta to use different names in the table schema and in the underlying Parquet files. This is a building block for issues delta-io#957 and delta-io#958.

New unit tests.

Closes delta-io#962

GitOrigin-RevId: 7a64a33fd60781c17236bff6168044e702c8413a
jbguerraz pushed a commit to jbguerraz/delta that referenced this issue Jul 6, 2022
Add a column name mapping mode in Delta, which allows Delta to use different names in the table schema and in the underlying Parquet files. This is a building block for issues delta-io#957 and delta-io#958.

New unit tests.

Closes delta-io#962

GitOrigin-RevId: 7a64a33fd60781c17236bff6168044e702c8413a
@D-to-the-K
Copy link

Hello, @liwensun!

I was doing some tests in Databricks Community regarding support for arbitrary characters in column names. I have found that I could create regular Parquet files with column names containing special characters such as spaces and parentheses. Is this intended behavior, or an accidental new feature not yet announced?

image

Repro steps (tested in Databricks Runtime 11.1):

df = spark.sql("""
    SELECT
        1 AS Id,
        'Alice' AS Name,
        'Something' AS `Lots of symbols: !@#$%&*()`
""")

location = "/FileStore/parquet-with-special-chars/"

(df.write
    .format("parquet")
    .mode("overwrite")
    .save(location))

display(spark.read.format("parquet").load(location))

@zsxwing
Copy link
Member

zsxwing commented Aug 24, 2022

@D-to-the-K this is because Spark removed the check from parquet side: apache/spark#35229

@D-to-the-K
Copy link

Thank you, @zsxwing! I really missed that.

Hope this is soon changed in parquet-mr to remove the same limitation from tools such as Azure Data Factory as well.

https://docs.microsoft.com/en-us/azure/data-factory/connector-troubleshoot-parquet#no-enum-constant

https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/schema/MessageTypeParser.java#L48

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants