-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setup project #4
Conversation
f9d0b02
to
c678086
Compare
I like this robust project setup code and Poetry. I'm cool with this approach if folks want to go with the OOP / inheritance style of coding. What's the difference between Seems like this code can be easily extended to generate the Great work. |
When I coded the My goal was mainly to get a feedback around the design rather than get some sort of complete implementation, but I am happy to move forward if there is some agreement. Can we maybe brainstorming about a list of feature to include in this PR before it becomes too large? |
@edmondo1984 - It'd be great if this PR could contain the reference tables so that @wjones127 could clone the repo and start writing some tests in delta-rs. We can change the underlying generation of the reference tables, but just having the reference tables would be a great way to let the connectors start writing some tests, so we can keep on iterating. Sounds good? |
0f745f1
to
c2565a9
Compare
81caa02
to
856c1d5
Compare
c594464
to
3fdc47e
Compare
2d0ed80
to
d433ab9
Compare
This pr introduces separation between generated and external tables, and therefore between writers (metadata writer, generated table writer) It also adds a new click command that can be used to generate metadata for external tables
d433ab9
to
c9b63f7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is shaping up pretty well. I pointed out some things I think we still need to resolve.
out/schemas/schema.json
Outdated
@@ -0,0 +1,57 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm reading this right, this is a JSON schema? But it's missing the declaration?
Are there concrete json entries in the reference tables? I don't see them materialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the one for the external table, but not the generated tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The json schema is for the base class, which contains the common fields (metadata) but not the data (it is an attribute of the subclass).
Do you think also exporting the data is useful? In that case, I agree with you the two json schema should be exported. However, the table-metadata.json at the moment does not contain the "RowCollection", as you can see here: https://github.com/delta-incubator/dat/pull/4/files#diff-520ed0b66472d6cf73ddc9dc60b382eb63f07f40841c28aae6c6513b6c21473cR38 the field is excluded from serialization
Regarding the schema, it is compliant with JSON Schema Core, JSON Schema Validation and OpenAPI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think also exporting the data is useful?
Yes, I was expecting to see the JSON files in the referenceTable1
and referenceTable2
folders, but I only see the delta
folder and the table_content.parquet
file in each of those locations. My understanding is the tests should be reading the delta table, then validating that against (1) the content in table_content.parquet
and (2) the metadata in the exported JSON for each table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting the same as you, I actually just realized, it's a bug. It should be fixed now, can you check and mark the conversation resolved? I messed up a little with .gitignores and the CI pipeline, but it should be fixed now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be fixed @wjones127 can you check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's fixed.
table_description='My first table', | ||
column_names=['letter', 'number', 'a_float'], | ||
partition_keys=['letter'], | ||
reader_protocol_version=2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI we cannot yet read reader_protocol_version=2
. I'd prefer we start with tables in reader v1 and writer v2.
What does the latest version of Spark default to, btw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a maintainer of Spark, but I would be surprised if the Delta jar would be bundled within the standard Spark distribution, this would be constraining (if you want to use Spark x, you need to use Delta y).
I would expect that when you run Spark, you need to specify some configuration settings that tells Delta is enabled and place Delta in the classpath of the application. I did the following with spark-xml in a project:
@pytest.fixture(scope='session')
def spark_session(request):
session_builder = SparkSession.builder.appName(
'My Project Automated Testing'
).config(
'spark.jars',
build_sparkxml_jar_location()
).config(
'spark.executorEnv.MLFLOW_TRACKING_URI',
mlflow_helper.mlflow_test_tracking_uri
)
spark = session_builder.getOrCreate()
request.addfinalizer(spark.stop)
return spark
Databricks team, can you confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry for confusion. When I used Spark more I was using Databricks version, which does bundle Delta Lake.
I just want to know what protocol versions are being written out by default in the latest Delta jar right now, so we know which are most prevalent in the real world.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is documented as a part of the databricks runtime, for example if you click here: https://docs.databricks.com/release-notes/runtime/11.1.html
you can see in the section "System environment" Delta Lake: 2.0.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the library version, not the protocol version.
RowCollection( | ||
write_mode='overwrite', | ||
data=[ | ||
('a', 1, 1.1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to think about how to specify the schema. There are some complexities that can be represented in Python, such as Decimals with certain precision and scale. However, Python AFAIK has no distinction between int16, int32, and int64 types, which all exist in Delta Lake.
Also IIRC schemas can evolve with an overwrite, so we'll want to represent that in a reference table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point. A simple solution could be to add a field to the reference table model to define the schema, and having this schema defined in terms of PySpark types. However, if there is no 1-to-1 mapping between PySpark types and Delta lake types, it will not be a complete solution. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah for the generated it does sounds like having it written in PySpark makes sense, since the schema should be able to be passed to spark.createDataFrame()
and get it to interpret the types correctly. 👍
@@ -0,0 +1,6 @@ | |||
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the reader version is 1, but the in the Python file it was labelled as 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, should we even place the protocol version in the reference table model if it is anyways in the log? If it is, I should implement some checks that they match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think? Should the protocol version be part of the table model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should be part of the table model. Two reasons:
- Test implementors may wish to filter the tables based on the protocol version, only testing the versions they support. (Though preferable they should verify that they error on ones they don't support.)
- We should test that readers correctly detect when a table upgrades its protocol when enabling certain features.
@edmondo1984 can you give a description to the PR? Its hard to review such an extensive PR without a summary of all the changes. |
Also, there does not seem to be a README for developers to understand how to develop on this. |
"partition_keys": [ | ||
"letter" | ||
], | ||
"reader_protocol_version": 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This protocol version does not match what is in the delta log.
out/schemas/schema.json
Outdated
@@ -0,0 +1,57 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked through again. Looks like the schema part is still a WIP?
No description provided.