-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataFusion] - Add show and show_limit function for DataFrame #923
Conversation
/// # Ok(()) | ||
/// # } | ||
/// ``` | ||
async fn show(&self) -> Result<()>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we could have similar behavior to Spark, where the default for show
is to just show the first 20 rows.
def show(numRows: Int): Unit = show(numRows, truncate = true)
def show(): Unit = show(20)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFusion df has limit function to limit the number of rows, so I think it is just to add the default 20 rows, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding a call to df.limit(20)
prior to showing the output would work well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be inferred from the logical plan whether there is a limit number. If there is no limit, set it to the default value of 20. If there is, then the default limit value is not set.
This is an implementation:
async fn show(&self) -> Result<()> {
let mut num = 20;
if let LogicalPlan::Limit { n, input: _ } = self.to_logical_plan() {
num = n;
}
let results = self.limit(num)?.collect().await?;
Ok(pretty::print_batches(&results)?)
}
The user can call the limit function to pass in the number of lines, if limit is not used, only the default 20 lines will be printed.
eg:
df.limit(10)?.show().await?; // limit 10
df.show().await?; // default limit 20
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an implementation:
I agree that looks perfect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, i pushed it, please review these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to be a pain but this seems confusing to me and maybe it would be better to revert to your original code and document that the user can limit output by using df.limit(20).show()
. Alternatively, we could add a show_limit(limit: usize)
alternate method where the user can specify how many rows they would like.
The issue I have with this is that it shows 20 rows by default and if the user wants more then they need to add a limit, which is counterintuitive because limit normally reduces the number of rows. Also, this code only works if the final operator is a limit, so it won't work constantly if the limit is wrapped in a sort, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I greet, I will rewrite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I defer to @andygrove
@@ -20,7 +20,7 @@ | |||
# This file is git pre-commit hook. | |||
# | |||
# Soft link it as git hook under top dir of apache arrow git repository: | |||
# $ ln -s ../../rust/pre-commit.sh .git/hooks/pre-commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these pre-commit changes intentional here? This seems unrelated to adding the show
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I found that pre-commit does not work, so I changed it, do I need to submit another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend a separate PR in future cases -- mostly so we can tag the original author of the pre-commit and have a discussion about what to do there without holding up this PR
This change looks uncontentious, but since I don't use the pre-commit.sh hook I am not sure how to test it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just run ln -s ../../rust/pre-commit.sh .git/hooks/pre-commit
and exec git commit
@@ -223,6 +223,21 @@ pub trait DataFrame: Send + Sync { | |||
/// ``` | |||
async fn collect(&self) -> Result<Vec<RecordBatch>>; | |||
|
|||
/// Print results. | |||
/// | |||
/// ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add no_run
here and remove the code comment there, so the doc string can still be tested to make sure it always compiles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied the code comment of the previous function, do I need to add it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha, ok, never mind then :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks great -- thank you @francis-du !
By the way,should I need to add show function in example code? |
The main README may be a good candidate here: https://github.com/apache/arrow-datafusion#example-usage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @francis-du
|
||
// execute and print results | ||
let results: Vec<RecordBatch> = df.collect().await?; | ||
print_batches(&results)?; | ||
df.show_limit(100).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @francis-du !
I enabled the CI checks to run and once they have passed I think this is good to me.
Thanks. |
Which issue does this PR close?
Closes #937
Rationale for this change
Collect the query results for the user in the show function and print the results to the console.
If the user wants to preview the results, they only needs to call the
show
function.What changes are included in this PR?
Add
show
function implementation for DataFrame.Are there any user-facing changes?
Users can directly print the query results by calling the show function.
eg:
+---+---+---+
| a | b | c |
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+