[DataFusion] - Add show and show_limit function for DataFrame #923

francis-du · 2021-08-22T10:18:16Z

Which issue does this PR close?

Closes #937

Rationale for this change

Collect the query results for the user in the show function and print the results to the console.

If the user wants to preview the results, they only needs to call the show function.

What changes are included in this PR?

Add show function implementation for DataFrame.

Are there any user-facing changes?

Users can directly print the query results by calling the show function.

eg:

async fn main() -> Result<()> {
  let mut ctx = ExecutionContext::new();
  let df = ctx.read_csv("tests/example.csv", CsvReadOptions::new())?;
  df.show().await?;
  Ok(())
}

+---+---+---+
| a | b | c |
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+

andygrove · 2021-08-22T15:01:43Z

datafusion/src/dataframe.rs

+    /// # Ok(())
+    /// # }
+    /// ```
+    async fn show(&self) -> Result<()>;


It would be nice if we could have similar behavior to Spark, where the default for show is to just show the first 20 rows.

def show(numRows: Int): Unit = show(numRows, truncate = true) def show(): Unit = show(20)

DataFusion df has limit function to limit the number of rows, so I think it is just to add the default 20 rows, what do you think?

I think adding a call to df.limit(20) prior to showing the output would work well

I think it can be inferred from the logical plan whether there is a limit number. If there is no limit, set it to the default value of 20. If there is, then the default limit value is not set.

This is an implementation:

async fn show(&self) -> Result<()> { let mut num = 20; if let LogicalPlan::Limit { n, input: _ } = self.to_logical_plan() { num = n; } let results = self.limit(num)?.collect().await?; Ok(pretty::print_batches(&results)?) }

The user can call the limit function to pass in the number of lines, if limit is not used, only the default 20 lines will be printed.

eg:

df.limit(10)?.show().await?; // limit 10 df.show().await?; // default limit 20

This is an implementation:

I agree that looks perfect

OK, i pushed it, please review these changes.

Sorry to be a pain but this seems confusing to me and maybe it would be better to revert to your original code and document that the user can limit output by using df.limit(20).show(). Alternatively, we could add a show_limit(limit: usize) alternate method where the user can specify how many rows they would like.

The issue I have with this is that it shows 20 rows by default and if the user wants more then they need to add a limit, which is counterintuitive because limit normally reduces the number of rows. Also, this code only works if the final operator is a limit, so it won't work constantly if the limit is wrapped in a sort, for example.

I greet, I will rewrite.

I defer to @andygrove

andygrove · 2021-08-22T15:02:40Z

pre-commit.sh

@@ -20,7 +20,7 @@
 # This file is git pre-commit hook.
 #
 # Soft link it as git hook under top dir of apache arrow git repository:
-# $ ln -s  ../../rust/pre-commit.sh .git/hooks/pre-commit


Are these pre-commit changes intentional here? This seems unrelated to adding the show function.

Yes, I found that pre-commit does not work, so I changed it, do I need to submit another PR?

I would recommend a separate PR in future cases -- mostly so we can tag the original author of the pre-commit and have a discussion about what to do there without holding up this PR

This change looks uncontentious, but since I don't use the pre-commit.sh hook I am not sure how to test it

Just run ln -s ../../rust/pre-commit.sh .git/hooks/pre-commit and exec git commit

houqp · 2021-08-22T16:22:21Z

datafusion/src/dataframe.rs

@@ -223,6 +223,21 @@ pub trait DataFrame: Send + Sync {
    /// ```
    async fn collect(&self) -> Result<Vec<RecordBatch>>;

+    /// Print results.
+    ///
+    /// ```


We could add no_run here and remove the code comment there, so the doc string can still be tested to make sure it always compiles.

I copied the code comment of the previous function, do I need to add it here?

ha, ok, never mind then :)

alamb

I think this looks great -- thank you @francis-du !

francis-du · 2021-08-23T13:03:48Z

I think this looks great -- thank you @francis-du !

By the way,should I need to add show function in example code?

loic-sharma · 2021-08-23T17:52:49Z

The main README may be a good candidate here: https://github.com/apache/arrow-datafusion#example-usage

andygrove

LGTM. Thanks @francis-du

alamb · 2021-08-24T11:52:06Z

README.md


  // execute and print results
-  let results: Vec<RecordBatch> = df.collect().await?;
-  print_batches(&results)?;
+  df.show_limit(100).await?;


alamb

Thank you @francis-du !

I enabled the CI checks to run and once they have passed I think this is good to me.

francis-du · 2021-08-24T12:07:27Z

Thank you @francis-du !

I enabled the CI checks to run and once they have passed I think this is good to me.

Thanks.

github-actions bot added the datafusion Changes in the datafusion crate label Aug 22, 2021

francis-du force-pushed the df_show branch from 3b8f861 to b20ed43 Compare August 22, 2021 10:46

feat: support show function for DataFrame

47d3844

francis-du force-pushed the df_show branch from b20ed43 to 47d3844 Compare August 22, 2021 10:49

francis-du added 3 commits August 22, 2021 18:54

fix: fix docs comments

7549269

fix: fix typo

8706c4e

fix: fix pre-commit

00b1a9c

andygrove reviewed Aug 22, 2021

View reviewed changes

houqp reviewed Aug 22, 2021

View reviewed changes

fix: fix code format

7aebc4f

alamb approved these changes Aug 23, 2021

View reviewed changes

francis-du added 2 commits August 23, 2021 20:15

fix: improve show function implementation

41d0691

fix: change match pattern to 'if let' single pattern

8eae085

fix: Rewrite show function impl and add a new show_limit function

d76baed

andygrove approved these changes Aug 24, 2021

View reviewed changes

fix: Add the show function to the sample code

ffcb3a6

github-actions bot added ballista documentation Improvements or additions to documentation labels Aug 24, 2021

fix: fix cargo test error

52905ed

francis-du requested review from andygrove and alamb August 24, 2021 05:57

alamb reviewed Aug 24, 2021

View reviewed changes

alamb changed the title ~~[DataFusion] - Support show function for DataFrame~~ [DataFusion] - Add show and show_limit function for DataFrame Aug 24, 2021

alamb approved these changes Aug 24, 2021

View reviewed changes

alamb merged commit 5871207 into apache:master Aug 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataFusion] - Add show and show_limit function for DataFrame #923

[DataFusion] - Add show and show_limit function for DataFrame #923

francis-du commented Aug 22, 2021 •

edited by alamb

Loading

andygrove Aug 22, 2021

francis-du Aug 22, 2021

alamb Aug 23, 2021

francis-du Aug 23, 2021 •

edited

Loading

alamb Aug 23, 2021

francis-du Aug 23, 2021 •

edited

Loading

andygrove Aug 23, 2021

francis-du Aug 23, 2021

alamb Aug 23, 2021

andygrove Aug 22, 2021

francis-du Aug 22, 2021

alamb Aug 23, 2021 •

edited

Loading

francis-du Aug 23, 2021

houqp Aug 22, 2021

francis-du Aug 22, 2021

houqp Aug 22, 2021

alamb left a comment

francis-du commented Aug 23, 2021

loic-sharma commented Aug 23, 2021

andygrove left a comment

alamb Aug 24, 2021

alamb left a comment

francis-du commented Aug 24, 2021

[DataFusion] - Add show and show_limit function for DataFrame #923

[DataFusion] - Add show and show_limit function for DataFrame #923

Conversation

francis-du commented Aug 22, 2021 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

francis-du Aug 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

francis-du Aug 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Aug 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

francis-du commented Aug 23, 2021

loic-sharma commented Aug 23, 2021

andygrove left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

francis-du commented Aug 24, 2021

francis-du commented Aug 22, 2021 •

edited by alamb

Loading

francis-du Aug 23, 2021 •

edited

Loading

francis-du Aug 23, 2021 •

edited

Loading

alamb Aug 23, 2021 •

edited

Loading