Support arbitrary column names #957

liwensun · 2022-02-23T23:05:44Z

Overview

This is the issue to track interest, feature requests, and progress being made on support for arbitrary column names in Delta Lake, which is a part of implementing “column renaming and dropping” as outlined on the Delta OSS 2022 H1 roadmap here.

Delta tables use Parquet as the underlying file format. Right now, a Delta column must be stored in the underlying Parquet files using the same name. Thus, users can't name a Delta column using characters disallowed by Parquet. This limitation could cause inconvenience for Delta users who want to directly ingest data that contains columns with special characters, e.g., columns with spaces are common in CSV. The end goal of this issue is to lift the Delta column naming restrictions inherited from Parquet.

Requirements

Users can name Delta columns using characters disallowed by Parquet, without concerning what column names the underlying Parquet files use. When a Delta column contains such characters, all existing Delta operations and API behaviors should not be impacted.

Please see the detailed requirements here.

Design Sketch

Please see the detailed design sketch here.

Add a column name mapping mode in Delta, which allows Delta to use different names in the table schema and in the underlying Parquet files. This is a building block for issues #957 and #958. New unit tests. Closes #962 GitOrigin-RevId: 7a64a33fd60781c17236bff6168044e702c8413a

Add a column name mapping mode in Delta, which allows Delta to use different names in the table schema and in the underlying Parquet files. This is a building block for issues delta-io#957 and delta-io#958. New unit tests. Closes delta-io#962 GitOrigin-RevId: 7a64a33fd60781c17236bff6168044e702c8413a

D-to-the-K · 2022-08-23T23:00:14Z

Hello, @liwensun!

I was doing some tests in Databricks Community regarding support for arbitrary characters in column names. I have found that I could create regular Parquet files with column names containing special characters such as spaces and parentheses. Is this intended behavior, or an accidental new feature not yet announced?

Repro steps (tested in Databricks Runtime 11.1):

df = spark.sql("""
    SELECT
        1 AS Id,
        'Alice' AS Name,
        'Something' AS `Lots of symbols: !@#$%&*()`
""")

location = "/FileStore/parquet-with-special-chars/"

(df.write
    .format("parquet")
    .mode("overwrite")
    .save(location))

display(spark.read.format("parquet").load(location))

zsxwing · 2022-08-24T03:52:55Z

@D-to-the-K this is because Spark removed the check from parquet side: apache/spark#35229

D-to-the-K · 2022-08-24T11:38:50Z

Thank you, @zsxwing! I really missed that.

Hope this is soon changed in parquet-mr to remove the same limitation from tools such as Azure Data Factory as well.

https://docs.microsoft.com/en-us/azure/data-factory/connector-troubleshoot-parquet#no-enum-constant

https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/schema/MessageTypeParser.java#L48

liwensun added the enhancement New feature or request label Feb 23, 2022

liwensun changed the title ~~Support Arbitrary Column Names~~ Support arbitrary column names Feb 23, 2022

liwensun mentioned this issue Feb 24, 2022

Enable column name mapping in Delta #962

Closed

ahlag mentioned this issue Feb 28, 2022

Create GitHub issue and pull request templates #942

Closed

findepi mentioned this issue Mar 3, 2022

Properly reject unsupported column names in Delta Lake trinodb/trino#11297

Closed

jackierwzhang mentioned this issue Mar 5, 2022

Support arbitrary chars in column names #976

Closed

tdas mentioned this issue Mar 29, 2022

Roadmap 2022 H1 (discussion) #920

Closed

vkorukanti closed this as completed Apr 8, 2022

vkorukanti added this to the 1.2 milestone Apr 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support arbitrary column names #957

Support arbitrary column names #957

liwensun commented Feb 23, 2022 •

edited

Loading

D-to-the-K commented Aug 23, 2022

zsxwing commented Aug 24, 2022

D-to-the-K commented Aug 24, 2022

Support arbitrary column names #957

Support arbitrary column names #957

Comments

liwensun commented Feb 23, 2022 • edited Loading

Overview

Requirements

Design Sketch

D-to-the-K commented Aug 23, 2022

zsxwing commented Aug 24, 2022

D-to-the-K commented Aug 24, 2022

liwensun commented Feb 23, 2022 •

edited

Loading