Setup project #4

edmondop · 2022-08-17T22:30:06Z

No description provided.

MrPowers · 2022-08-18T11:25:56Z

I like this robust project setup code and Poetry.

I'm cool with this approach if folks want to go with the OOP / inheritance style of coding.

What's the difference between StaticReferenceTable and ReferenceTable? Do you envision multiple classes inheriting from ReferenceTable?

Seems like this code can be easily extended to generate the table_metadata.json and table_content.parquet files. Will we also be able to easily extend this code for the writer tests? The writer tests will need some additional assets (e.g. a data file for performing an upsert). I'm cool with this object model as long as it's sufficiently extensible to cover all the types of tests we'll need.

Great work.

edmondop · 2022-08-18T18:54:07Z

I like this robust project setup code and Poetry.

I'm cool with this approach if folks want to go with the OOP / inheritance style of coding.

What's the difference between StaticReferenceTable and ReferenceTable? Do you envision multiple classes inheriting from ReferenceTable?

Seems like this code can be easily extended to generate the table_metadata.json and table_content.parquet files. Will we also be able to easily extend this code for the writer tests? The writer tests will need some additional assets (e.g. a data file for performing an upsert). I'm cool with this object model as long as it's sufficiently extensible to cover all the types of tests we'll need.

Great work.

When I coded the ReferenceTable and I added an attribute containing a static list of data, I kind of felt I was "forcing" the design of the code, in case other tables could be created for example reading from disks, or using random data generation.

My goal was mainly to get a feedback around the design rather than get some sort of complete implementation, but I am happy to move forward if there is some agreement. Can we maybe brainstorming about a list of feature to include in this PR before it becomes too large?

MrPowers · 2022-08-19T13:44:19Z

@edmondo1984 - It'd be great if this PR could contain the reference tables so that @wjones127 could clone the repo and start writing some tests in delta-rs. We can change the underlying generation of the reference tables, but just having the reference tables would be a great way to let the connectors start writing some tests, so we can keep on iterating. Sounds good?

This pr introduces separation between generated and external tables, and therefore between writers (metadata writer, generated table writer) It also adds a new click command that can be used to generate metadata for external tables

wjones127

This is shaping up pretty well. I pointed out some things I think we still need to resolve.

wjones127 · 2022-08-31T04:44:24Z

out/schemas/schema.json

@@ -0,0 +1,57 @@
+{


If I'm reading this right, this is a JSON schema? But it's missing the declaration?

Are there concrete json entries in the reference tables? I don't see them materialized.

I see the one for the external table, but not the generated tables.

The json schema is for the base class, which contains the common fields (metadata) but not the data (it is an attribute of the subclass).

Do you think also exporting the data is useful? In that case, I agree with you the two json schema should be exported. However, the table-metadata.json at the moment does not contain the "RowCollection", as you can see here: https://github.com/delta-incubator/dat/pull/4/files#diff-520ed0b66472d6cf73ddc9dc60b382eb63f07f40841c28aae6c6513b6c21473cR38 the field is excluded from serialization

Regarding the schema, it is compliant with JSON Schema Core, JSON Schema Validation and OpenAPI.

Do you think also exporting the data is useful?

Yes, I was expecting to see the JSON files in the referenceTable1 and referenceTable2 folders, but I only see the delta folder and the table_content.parquet file in each of those locations. My understanding is the tests should be reading the delta table, then validating that against (1) the content in table_content.parquet and (2) the metadata in the exported JSON for each table.

I was expecting the same as you, I actually just realized, it's a bug. It should be fixed now, can you check and mark the conversation resolved? I messed up a little with .gitignores and the CI pipeline, but it should be fixed now

It should be fixed @wjones127 can you check?

Yes, that's fixed.

wjones127 · 2022-08-31T04:50:09Z

dat/generated_tables.py

+    table_description='My first table',
+    column_names=['letter', 'number', 'a_float'],
+    partition_keys=['letter'],
+    reader_protocol_version=2,


FYI we cannot yet read reader_protocol_version=2. I'd prefer we start with tables in reader v1 and writer v2.

What does the latest version of Spark default to, btw?

I am not a maintainer of Spark, but I would be surprised if the Delta jar would be bundled within the standard Spark distribution, this would be constraining (if you want to use Spark x, you need to use Delta y).

I would expect that when you run Spark, you need to specify some configuration settings that tells Delta is enabled and place Delta in the classpath of the application. I did the following with spark-xml in a project:

@pytest.fixture(scope='session') def spark_session(request): session_builder = SparkSession.builder.appName( 'My Project Automated Testing' ).config( 'spark.jars', build_sparkxml_jar_location() ).config( 'spark.executorEnv.MLFLOW_TRACKING_URI', mlflow_helper.mlflow_test_tracking_uri ) spark = session_builder.getOrCreate() request.addfinalizer(spark.stop) return spark

Databricks team, can you confirm?

Ah sorry for confusion. When I used Spark more I was using Databricks version, which does bundle Delta Lake.

I just want to know what protocol versions are being written out by default in the latest Delta jar right now, so we know which are most prevalent in the real world.

It is documented as a part of the databricks runtime, for example if you click here: https://docs.databricks.com/release-notes/runtime/11.1.html

you can see in the section "System environment" Delta Lake: 2.0.0

That is the library version, not the protocol version.

wjones127 · 2022-08-31T04:52:48Z

dat/generated_tables.py

+        RowCollection(
+            write_mode='overwrite',
+            data=[
+                ('a', 1, 1.1),


We'll need to think about how to specify the schema. There are some complexities that can be represented in Python, such as Decimals with certain precision and scale. However, Python AFAIK has no distinction between int16, int32, and int64 types, which all exist in Delta Lake.

Also IIRC schemas can evolve with an overwrite, so we'll want to represent that in a reference table.

It's a good point. A simple solution could be to add a field to the reference table model to define the schema, and having this schema defined in terms of PySpark types. However, if there is no 1-to-1 mapping between PySpark types and Delta lake types, it will not be a complete solution. What do you think?

Yeah for the generated it does sounds like having it written in PySpark makes sense, since the schema should be able to be passed to spark.createDataFrame() and get it to interpret the types correctly. 👍

wjones127 · 2022-08-31T05:08:59Z

out/tables/reference_table_1/delta/_delta_log/00000000000000000000.json

@@ -0,0 +1,6 @@
+{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}


Here the reader version is 1, but the in the Python file it was labelled as 2.

Good point, should we even place the protocol version in the reference table model if it is anyways in the log? If it is, I should implement some checks that they match

what do you think? Should the protocol version be part of the table model?

Yes, it should be part of the table model. Two reasons:

Test implementors may wish to filter the tables based on the protocol version, only testing the versions they support. (Though preferable they should verify that they error on ones they don't support.)

We should test that readers correctly detect when a table upgrades its protocol when enabling certain features.

tdas · 2022-09-06T02:27:20Z

@edmondo1984 can you give a description to the PR? Its hard to review such an extensive PR without a summary of all the changes.

tdas · 2022-09-06T02:28:26Z

Also, there does not seem to be a README for developers to understand how to develop on this.

wjones127 · 2022-09-18T20:21:04Z

out/tables/external/my_external_table/table-metadata.json

+  "partition_keys": [
+    "letter"
+  ],
+  "reader_protocol_version": 2,


This protocol version does not match what is in the delta log.

out/tables/external/my_external_table/table-metadata.json

wjones127 · 2022-09-18T20:22:18Z

out/schemas/schema.json

@@ -0,0 +1,57 @@
+{


Yes, that's fixed.

wjones127

Looked through again. Looks like the schema part is still a WIP?

edmondop force-pushed the setup-project branch 2 times, most recently from f9d0b02 to c678086 Compare August 17, 2022 22:39

edmondop marked this pull request as ready for review August 17, 2022 22:49

edmondop mentioned this pull request Aug 17, 2022

First pass at creating reference tables #1

Closed

edmondop force-pushed the setup-project branch from 0f745f1 to c2565a9 Compare August 22, 2022 03:19

Initial version

856c1d5

edmondop force-pushed the setup-project branch from 81caa02 to 856c1d5 Compare August 22, 2022 23:09

Probably wrong path

3fdc47e

edmondop force-pushed the setup-project branch from c594464 to 3fdc47e Compare August 23, 2022 00:42

edmondop and others added 3 commits August 22, 2022 17:45

Ignoring git unmatch

b476593

Autocommitting tables and schema only on main

69dffb0

Auto committing tables

631733f

edmondop force-pushed the setup-project branch 2 times, most recently from 2d0ed80 to d433ab9 Compare August 29, 2022 17:43

edmondop force-pushed the setup-project branch from d433ab9 to c9b63f7 Compare August 29, 2022 18:55

wjones127 reviewed Aug 31, 2022

View reviewed changes

Implementing first commit

40fc968

dennyglee mentioned this pull request Sep 14, 2022

Roadmap 2022 H2 (discussion) delta-io/delta#1307

Open

Fixing absolute folder bug

633e838

wjones127 reviewed Sep 18, 2022

View reviewed changes

Implementing Matthew suggestions after last discussion

8af78c8

MrPowers mentioned this pull request Sep 22, 2022

Can we provide a makefile with different targets? #2

Closed

MrPowers merged commit 617a073 into delta-incubator:master Oct 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setup project #4

Setup project #4

edmondop commented Aug 17, 2022

MrPowers commented Aug 18, 2022

edmondop commented Aug 18, 2022 •

edited

Loading

MrPowers commented Aug 19, 2022

wjones127 left a comment

wjones127 Aug 31, 2022

wjones127 Aug 31, 2022

edmondop Sep 2, 2022

wjones127 Sep 2, 2022

edmondop Sep 2, 2022 •

edited

Loading

edmondop Sep 2, 2022

wjones127 Sep 18, 2022

wjones127 Aug 31, 2022

edmondop Sep 2, 2022

wjones127 Sep 2, 2022

edmondop Sep 2, 2022

wjones127 Sep 18, 2022

wjones127 Aug 31, 2022

edmondop Sep 2, 2022

wjones127 Sep 2, 2022

wjones127 Aug 31, 2022

edmondop Sep 2, 2022

edmondop Sep 2, 2022

wjones127 Sep 2, 2022 •

edited

Loading

tdas commented Sep 6, 2022

tdas commented Sep 6, 2022

wjones127 Sep 18, 2022

wjones127 Sep 18, 2022

wjones127 left a comment

		@@ -0,0 +1,6 @@
		{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}

Setup project #4

Setup project #4

Conversation

edmondop commented Aug 17, 2022

MrPowers commented Aug 18, 2022

edmondop commented Aug 18, 2022 • edited Loading

MrPowers commented Aug 19, 2022

wjones127 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edmondop Sep 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjones127 Sep 2, 2022 • edited Loading

Choose a reason for hiding this comment

tdas commented Sep 6, 2022

tdas commented Sep 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjones127 left a comment

Choose a reason for hiding this comment

edmondop commented Aug 18, 2022 •

edited

Loading

edmondop Sep 2, 2022 •

edited

Loading

wjones127 Sep 2, 2022 •

edited

Loading