Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for GENERATED ALWAYS AS IDENTITY in DeltaTableBuilder #1072

Closed
norbitek opened this issue Apr 15, 2022 · 14 comments · Fixed by #3404
Closed

Add support for GENERATED ALWAYS AS IDENTITY in DeltaTableBuilder #1072

norbitek opened this issue Apr 15, 2022 · 14 comments · Fixed by #3404
Labels
enhancement New feature or request

Comments

@norbitek
Copy link

Last version of Databricks added support for identity column in Delta table.
It is possible to define GENERATED ALWAYS AS IDENTITY in column specification.

It would be nice to do the same using DeltaTableBuilder for example:

DeltaTable.create(spark)
.tableName("default.people10m")
.addColumn("id", "BIGINT", generatedAlwaysAs="IDENTITY(START WITH 10 INCREMENT BY 10)")
.addColumn("firstName", "STRING")
.addColumn("middleName", "STRING")
.addColumn("lastName", "STRING", comment = "surname")
.addColumn("gender", "STRING")
.addColumn("birthDate", "TIMESTAMP")
.addColumn("dateOfBirth", DateType(), generatedAlwaysAs="CAST(birthDate AS DATE)")
.addColumn("ssn", "STRING")
.addColumn("salary", "INT")
.partitionedBy("gender")
.execute()

@norbitek norbitek added the enhancement New feature or request label Apr 15, 2022
@allisonport-db
Copy link
Collaborator

Hi @norbitek thanks for opening this issue. This is definitely in the plan for Delta Lake but we're currently prioritizing other features on the roadmap #920 like OPTIMIZE ZORDER and CDF.

@keen85
Copy link

keen85 commented Aug 12, 2022

@norbitek, it's on the roadmap for 2022 H2 🥳
#1307

@wedesoft
Copy link

Tried to add a generated column using SQL. So I understand it is not supported yet in pyspark?

generated

@zsxwing
Copy link
Member

zsxwing commented Sep 30, 2022

@wedesoft Spark doesn't support it yet. The sql syntax supported for GENERATED COLUMN is tracked by #1100

@jasperp97
Copy link

Is this still on the roadmap?

@thebaz73
Copy link

Any news on this issue status?

@shahkalpan07
Copy link

Any update on release date ?

@bart-samwel
Copy link
Collaborator

This is definitely still on the roadmap! However, at the moment all the focus is on completing Deletion Vectors, which is in high demand. We will only get to this item after that work is complete.

@keen85
Copy link

keen85 commented Feb 7, 2024

Since Delta Lake 3.1.0 (with deletion vectors) is out now, would you consider working on it for 3.2, @bart-samwel 😇

@bart-samwel
Copy link
Collaborator

@keen85

Since Delta Lake 3.1.0 (with deletion vectors) is out now, would you consider working on it for 3.2

Thank you for the reminder! It is near the top of our list now. I can't make any hard guarantees, but I'm hopeful that we'll get to this pretty soon.

@norbitek
Copy link
Author

norbitek commented Feb 8, 2024

@bart-samwel
What is the reason that features in Standalone version are implemented with such big latency?
Does it means that for every new features (like for example liquid clustering) we will wait for about 2 years?

@bart-samwel
Copy link
Collaborator

bart-samwel commented Feb 8, 2024

@norbitek

What is the reason that features in Standalone version are implemented with such big latency?

Just to make sure there's no confusion here: Delta Standalone is different from the Spark connector for of Delta Lake. Standalone is a library that can be used to implement connectors for non-Spark systems, and it is not really getting the new features anymore -- its design is not really suitable to support many of the new features easily. All of the new efforts are going into Delta Kernel, which is the new library for building connectors. It makes it a lot easier to keep up with new features, and we intend to keep it up to date.

Identity columns is a feature where we have unfortunately dropped the ball even for support in the Spark connector. It's the exception though, not the rule!

Does it means that for every new features (like for example liquid clustering) we will wait for about 2 years?

Certainly not! Like I said, identity columns is an exception. Liquid clustering is actually released in Delta Lake 3.1 which came out last week! https://github.com/delta-io/delta/releases

@SYOGESH045
Copy link

Hi, currently in my company, I'm not using Spark SQL anywhere. Here I wanted to utilize DeltaTableBuilderAPI. So wanted to ask whether is this resolved, if no, when will we get this update?

Many thanks,
Yogesh S

@tdas
Copy link
Contributor

tdas commented May 30, 2024

@SYOGESH045 The next release of Delta is going to be Delta 3.3. The identity column support seems to be in progress - #3044. So Delta 3.3 should have it. If I have to hazard a guess, Delta 3.3 should be released in 2-3 months.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects