Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement SparkSession.Catalog #231

Merged
merged 40 commits into from
Sep 24, 2019
Merged

Implement SparkSession.Catalog #231

merged 40 commits into from
Sep 24, 2019

Conversation

GoEddie
Copy link
Contributor

@GoEddie GoEddie commented Aug 30, 2019

This is for #226

We are excited to review your PR.

So we can do the best job, please check:

  • There's a descriptive title that will make sense to other developers some time from now.
  • There's associated issues. All PR's should have issue(s) associated - unless a trivial self-evident change such as fixing a typo. You can use the format Fixes #nnnn in your description to cause GitHub to automatically close the issue(s) when your PR is merged.
  • Your change description explains what the change does, why you chose your approach, and anything else that reviewers should know.
  • You have included any necessary tests in the same PR.

Implementing SparkSession.Catalog, this includes most of the catalog functions except any deprecated ones and some of the experimental functions.

@GoEddie GoEddie changed the title WIP: Implement Catalog Implement SparkSession.Catalog Sep 2, 2019
@imback82 imback82 self-requested a review September 3, 2019 16:25
@imback82 imback82 added the enhancement New feature or request label Sep 3, 2019
using System.Collections.Generic;
using System.IO;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these usings still needed ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new test uses Path.Combine to build a path, I know in other places the code just manually creates a path but I think Path.Combine is better (I was going to do another pr converting the manually built strings to Path.Combine)

Assert.IsType<DataFrame>(catalog.ListFunctions());
Assert.IsType<DataFrame>(catalog.ListFunctions("default"));

var table = catalog.CreateTable("users", Path.Combine(TestEnvironment.ResourceDirectory, "users.parquet"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surround your test and put something like the following at the beginning of CatalogFunctions()

using (var tempDir = new TemporaryDirectory())
{
...
}

Then you should be able to do something like

DataFrame table = catalog.CreateTable("users", Path.Combine(tempDir.Path, "users.parquet"));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GoEddie what do you think about @suhsteve 's suggestion so that we clean up the resources per test (or suite)

@GoEddie
Copy link
Contributor Author

GoEddie commented Sep 11, 2019

Just to clarify, is the guidance to copy from the comments from spark docs or describe them in more detail?

In some feedback here I have been asked to copy from the spark codebase and some to add more detail.

I don't mind which but just need some guidance!

"copy from spark": #231 (comment)
"add more detail": #231 (comment)

Example "tableExists":

https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/sql/catalog/Catalog.html#tableExists-java.lang.String-

public abstract boolean tableExists(String dbName,
                                    String tableName)
Check if the table or view with the specified name exists in the specified database.
Parameters:
dbName - is a name that designates a database.
tableName - is an unqualified name that designates a table.
Returns:
(undocumented)

I have added a return type and can add something like "bool - true if the table exists, false if it doesn't" - the meaning of the return should be obvious??

Looking throughthe codebase some places are more descriptive and some less:

        /// <summary>
        /// Returns a sort expression based on the descending order of the column.
        /// </summary>
        /// <param name="columnName">Column name</param>
        /// <returns>Column object</returns>
        public static Column Desc(string columnName)
        {
            return ApplyFunction("desc", columnName);
        }
        /// <summary>
        /// Apply boolean AND operator with the given column.
        /// </summary>
        /// <param name="other">Column to apply AND operator</param>
        /// <returns>New column after applying the operator</returns>

Let me know and i'll fix them all up :)

@imback82
Copy link
Contributor

Just to clarify, is the guidance to copy from the comments from spark docs or describe them in more detail?

Thanks for bringing this up @GoEddie. Looks like there was an inconsistency in the review process.

What I typically do about comments when new APIs are introduced is to start from the Spark comments and update them as fit (not all their comments are great either. :)). I understand that there are already inconsistencies in comments in our codebase. Since it's hard to go back and fix all the time, we try to make the bar a little higher on the new PRs. If you think review comments are not reasonable, please include me in the discussion so I can resolve them.

BTW, We also publish API references based on the comments: https://docs.microsoft.com/en-us/dotnet/api/microsoft.spark.sql.dataframe?view=spark-dotnet.

Thanks for your contribution @GoEddie!

@GoEddie GoEddie changed the title Implement SparkSession.Catalog WIP: Implement SparkSession.Catalog Sep 16, 2019
@GoEddie
Copy link
Contributor Author

GoEddie commented Sep 16, 2019

Thanks @imback82 i'll make sure to start with the spark comments and add to them if they need it.

@GoEddie GoEddie changed the title WIP: Implement SparkSession.Catalog Implement SparkSession.Catalog Sep 17, 2019
@GoEddie GoEddie changed the title Implement SparkSession.Catalog WIP:Implement SparkSession.Catalog Sep 17, 2019
@GoEddie GoEddie changed the title WIP:Implement SparkSession.Catalog Implement SparkSession.Catalog Sep 19, 2019
@GoEddie
Copy link
Contributor Author

GoEddie commented Sep 19, 2019

I think that is all the comments dealt with - if your not happy or want changes let me know :)

@imback82
Copy link
Contributor

Thanks @GoEddie, I will review this soon.

@GoEddie
Copy link
Contributor Author

GoEddie commented Sep 20, 2019

Great let me know if you want any changes :)

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have few minor comments. Great work!

Assert.IsType<DataFrame>(catalog.ListFunctions());
Assert.IsType<DataFrame>(catalog.ListFunctions("default"));

var table = catalog.CreateTable("users", Path.Combine(TestEnvironment.ResourceDirectory, "users.parquet"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GoEddie what do you think about @suhsteve 's suggestion so that we clean up the resources per test (or suite)

@GoEddie
Copy link
Contributor Author

GoEddie commented Sep 22, 2019

Thanks @imback82 for reviewing - I think that is everything - let me know if there are any other changes :)

For this one #231 (comment) the suggestion was when I was having trouble setting the path to spark-warehouse - this is reading the users.parquet file from the resources directory so there aren't any resources to clean up here.

@@ -568,7 +571,7 @@ public void TestSignaturesV2_3_X()
Assert.IsType<Column>(JsonTuple(col, "a"));
Assert.IsType<Column>(JsonTuple(col, "a", "b"));

var options = new Dictionary<string, string>() { { "hello", "world" } };
var options = new Dictionary<string, string>() {{"hello", "world"}};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know why this formatting changed? I think the existing formatting is fine. (Ctrl + K, D gives me the existing formatting). Could you double check the latest commit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think resharper got a bit excited :)

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@imback82 imback82 merged commit 6b13a1c into dotnet:master Sep 24, 2019
@GoEddie GoEddie deleted the feature/Catalog branch January 15, 2020 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants