Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup project #4

Merged
merged 9 commits into from
Oct 4, 2022
Merged

Setup project #4

merged 9 commits into from
Oct 4, 2022

Conversation

edmondop
Copy link
Contributor

No description provided.

@edmondop edmondop force-pushed the setup-project branch 2 times, most recently from f9d0b02 to c678086 Compare August 17, 2022 22:39
@edmondop edmondop marked this pull request as ready for review August 17, 2022 22:49
@MrPowers
Copy link
Contributor

I like this robust project setup code and Poetry.

I'm cool with this approach if folks want to go with the OOP / inheritance style of coding.

What's the difference between StaticReferenceTable and ReferenceTable? Do you envision multiple classes inheriting from ReferenceTable?

Seems like this code can be easily extended to generate the table_metadata.json and table_content.parquet files. Will we also be able to easily extend this code for the writer tests? The writer tests will need some additional assets (e.g. a data file for performing an upsert). I'm cool with this object model as long as it's sufficiently extensible to cover all the types of tests we'll need.

Great work.

@edmondop
Copy link
Contributor Author

edmondop commented Aug 18, 2022

I like this robust project setup code and Poetry.

I'm cool with this approach if folks want to go with the OOP / inheritance style of coding.

What's the difference between StaticReferenceTable and ReferenceTable? Do you envision multiple classes inheriting from ReferenceTable?

Seems like this code can be easily extended to generate the table_metadata.json and table_content.parquet files. Will we also be able to easily extend this code for the writer tests? The writer tests will need some additional assets (e.g. a data file for performing an upsert). I'm cool with this object model as long as it's sufficiently extensible to cover all the types of tests we'll need.

Great work.

When I coded the ReferenceTable and I added an attribute containing a static list of data, I kind of felt I was "forcing" the design of the code, in case other tables could be created for example reading from disks, or using random data generation.

My goal was mainly to get a feedback around the design rather than get some sort of complete implementation, but I am happy to move forward if there is some agreement. Can we maybe brainstorming about a list of feature to include in this PR before it becomes too large?

@MrPowers
Copy link
Contributor

@edmondo1984 - It'd be great if this PR could contain the reference tables so that @wjones127 could clone the repo and start writing some tests in delta-rs. We can change the underlying generation of the reference tables, but just having the reference tables would be a great way to let the connectors start writing some tests, so we can keep on iterating. Sounds good?

@edmondop edmondop force-pushed the setup-project branch 2 times, most recently from 2d0ed80 to d433ab9 Compare August 29, 2022 17:43
This pr introduces separation between generated and external tables,
and therefore between writers (metadata writer, generated table writer)
It also adds a new click command that can be used to generate metadata for external tables
Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is shaping up pretty well. I pointed out some things I think we still need to resolve.

@@ -0,0 +1,57 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this right, this is a JSON schema? But it's missing the declaration?

Are there concrete json entries in the reference tables? I don't see them materialized.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the one for the external table, but not the generated tables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The json schema is for the base class, which contains the common fields (metadata) but not the data (it is an attribute of the subclass).

Do you think also exporting the data is useful? In that case, I agree with you the two json schema should be exported. However, the table-metadata.json at the moment does not contain the "RowCollection", as you can see here: https://github.com/delta-incubator/dat/pull/4/files#diff-520ed0b66472d6cf73ddc9dc60b382eb63f07f40841c28aae6c6513b6c21473cR38 the field is excluded from serialization

Regarding the schema, it is compliant with JSON Schema Core, JSON Schema Validation and OpenAPI.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think also exporting the data is useful?

Yes, I was expecting to see the JSON files in the referenceTable1 and referenceTable2 folders, but I only see the delta folder and the table_content.parquet file in each of those locations. My understanding is the tests should be reading the delta table, then validating that against (1) the content in table_content.parquet and (2) the metadata in the exported JSON for each table.

Copy link
Contributor Author

@edmondop edmondop Sep 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting the same as you, I actually just realized, it's a bug. It should be fixed now, can you check and mark the conversation resolved? I messed up a little with .gitignores and the CI pipeline, but it should be fixed now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fixed @wjones127 can you check?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's fixed.

table_description='My first table',
column_names=['letter', 'number', 'a_float'],
partition_keys=['letter'],
reader_protocol_version=2,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI we cannot yet read reader_protocol_version=2. I'd prefer we start with tables in reader v1 and writer v2.

What does the latest version of Spark default to, btw?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a maintainer of Spark, but I would be surprised if the Delta jar would be bundled within the standard Spark distribution, this would be constraining (if you want to use Spark x, you need to use Delta y).

I would expect that when you run Spark, you need to specify some configuration settings that tells Delta is enabled and place Delta in the classpath of the application. I did the following with spark-xml in a project:

@pytest.fixture(scope='session')
def spark_session(request):
    session_builder = SparkSession.builder.appName(
        'My Project Automated Testing'
    ).config(
        'spark.jars',
        build_sparkxml_jar_location()
    ).config(
        'spark.executorEnv.MLFLOW_TRACKING_URI',
        mlflow_helper.mlflow_test_tracking_uri
    )
  spark = session_builder.getOrCreate()
  request.addfinalizer(spark.stop)
  return spark

Databricks team, can you confirm?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry for confusion. When I used Spark more I was using Databricks version, which does bundle Delta Lake.

I just want to know what protocol versions are being written out by default in the latest Delta jar right now, so we know which are most prevalent in the real world.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is documented as a part of the databricks runtime, for example if you click here: https://docs.databricks.com/release-notes/runtime/11.1.html

you can see in the section "System environment" Delta Lake: 2.0.0

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the library version, not the protocol version.

RowCollection(
write_mode='overwrite',
data=[
('a', 1, 1.1),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to think about how to specify the schema. There are some complexities that can be represented in Python, such as Decimals with certain precision and scale. However, Python AFAIK has no distinction between int16, int32, and int64 types, which all exist in Delta Lake.

Also IIRC schemas can evolve with an overwrite, so we'll want to represent that in a reference table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point. A simple solution could be to add a field to the reference table model to define the schema, and having this schema defined in terms of PySpark types. However, if there is no 1-to-1 mapping between PySpark types and Delta lake types, it will not be a complete solution. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for the generated it does sounds like having it written in PySpark makes sense, since the schema should be able to be passed to spark.createDataFrame() and get it to interpret the types correctly. 👍

@@ -0,0 +1,6 @@
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the reader version is 1, but the in the Python file it was labelled as 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, should we even place the protocol version in the reference table model if it is anyways in the log? If it is, I should implement some checks that they match

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think? Should the protocol version be part of the table model?

Copy link
Collaborator

@wjones127 wjones127 Sep 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be part of the table model. Two reasons:

  1. Test implementors may wish to filter the tables based on the protocol version, only testing the versions they support. (Though preferable they should verify that they error on ones they don't support.)
  2. We should test that readers correctly detect when a table upgrades its protocol when enabling certain features.

@tdas
Copy link

tdas commented Sep 6, 2022

@edmondo1984 can you give a description to the PR? Its hard to review such an extensive PR without a summary of all the changes.

@tdas
Copy link

tdas commented Sep 6, 2022

Also, there does not seem to be a README for developers to understand how to develop on this.

"partition_keys": [
"letter"
],
"reader_protocol_version": 2,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This protocol version does not match what is in the delta log.

out/tables/external/my_external_table/table-metadata.json Outdated Show resolved Hide resolved
@@ -0,0 +1,57 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's fixed.

Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked through again. Looks like the schema part is still a WIP?

@MrPowers MrPowers merged commit 617a073 into delta-incubator:master Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants