Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataFusion] - Add show and show_limit function for DataFrame #923

Merged
merged 10 commits into from
Aug 24, 2021

Conversation

francis-du
Copy link
Contributor

@francis-du francis-du commented Aug 22, 2021

Which issue does this PR close?

Closes #937

Rationale for this change

Collect the query results for the user in the show function and print the results to the console.

If the user wants to preview the results, they only needs to call the show function.

What changes are included in this PR?

Add show function implementation for DataFrame.

Are there any user-facing changes?

Users can directly print the query results by calling the show function.

eg:

async fn main() -> Result<()> {
  let mut ctx = ExecutionContext::new();
  let df = ctx.read_csv("tests/example.csv", CsvReadOptions::new())?;
  df.show().await?;
  Ok(())
}

+---+---+---+
| a | b | c |
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Aug 22, 2021
/// # Ok(())
/// # }
/// ```
async fn show(&self) -> Result<()>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we could have similar behavior to Spark, where the default for show is to just show the first 20 rows.

def show(numRows: Int): Unit = show(numRows, truncate = true)

def show(): Unit = show(20)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFusion df has limit function to limit the number of rows, so I think it is just to add the default 20 rows, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding a call to df.limit(20) prior to showing the output would work well

Copy link
Contributor Author

@francis-du francis-du Aug 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be inferred from the logical plan whether there is a limit number. If there is no limit, set it to the default value of 20. If there is, then the default limit value is not set.

This is an implementation:

async fn show(&self) -> Result<()> {
    let mut num = 20;
    if let LogicalPlan::Limit { n, input: _ } = self.to_logical_plan() {
        num = n;
    }
    let results = self.limit(num)?.collect().await?;
    Ok(pretty::print_batches(&results)?)
}

The user can call the limit function to pass in the number of lines, if limit is not used, only the default 20 lines will be printed.

eg:

df.limit(10)?.show().await?; // limit 10

df.show().await?; // default limit 20

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an implementation:

I agree that looks perfect

Copy link
Contributor Author

@francis-du francis-du Aug 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, i pushed it, please review these changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be a pain but this seems confusing to me and maybe it would be better to revert to your original code and document that the user can limit output by using df.limit(20).show(). Alternatively, we could add a show_limit(limit: usize) alternate method where the user can specify how many rows they would like.

The issue I have with this is that it shows 20 rows by default and if the user wants more then they need to add a limit, which is counterintuitive because limit normally reduces the number of rows. Also, this code only works if the final operator is a limit, so it won't work constantly if the limit is wrapped in a sort, for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I greet, I will rewrite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I defer to @andygrove

@@ -20,7 +20,7 @@
# This file is git pre-commit hook.
#
# Soft link it as git hook under top dir of apache arrow git repository:
# $ ln -s ../../rust/pre-commit.sh .git/hooks/pre-commit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these pre-commit changes intentional here? This seems unrelated to adding the show function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I found that pre-commit does not work, so I changed it, do I need to submit another PR?

Copy link
Contributor

@alamb alamb Aug 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend a separate PR in future cases -- mostly so we can tag the original author of the pre-commit and have a discussion about what to do there without holding up this PR

This change looks uncontentious, but since I don't use the pre-commit.sh hook I am not sure how to test it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just run ln -s ../../rust/pre-commit.sh .git/hooks/pre-commit and exec git commit

@@ -223,6 +223,21 @@ pub trait DataFrame: Send + Sync {
/// ```
async fn collect(&self) -> Result<Vec<RecordBatch>>;

/// Print results.
///
/// ```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add no_run here and remove the code comment there, so the doc string can still be tested to make sure it always compiles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied the code comment of the previous function, do I need to add it here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha, ok, never mind then :)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great -- thank you @francis-du !

@francis-du
Copy link
Contributor Author

I think this looks great -- thank you @francis-du !

By the way,should I need to add show function in example code?

@loic-sharma
Copy link
Contributor

The main README may be a good candidate here: https://github.com/apache/arrow-datafusion#example-usage

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @francis-du

@github-actions github-actions bot added ballista documentation Improvements or additions to documentation labels Aug 24, 2021
@francis-du francis-du requested review from andygrove and alamb August 24, 2021 05:57

// execute and print results
let results: Vec<RecordBatch> = df.collect().await?;
print_batches(&results)?;
df.show_limit(100).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@alamb alamb changed the title [DataFusion] - Support show function for DataFrame [DataFusion] - Add show and show_limit function for DataFrame Aug 24, 2021
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @francis-du !

I enabled the CI checks to run and once they have passed I think this is good to me.

@francis-du
Copy link
Contributor Author

Thank you @francis-du !

I enabled the CI checks to run and once they have passed I think this is good to me.

Thanks.

@alamb alamb merged commit 5871207 into apache:master Aug 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add "show" support to DataFrame API
5 participants