Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve and test dataframe API examples in docs #11290

Merged
merged 11 commits into from
Jul 9, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 5, 2024

Which issue does this PR close?

Part of #11172

Rationale for this change

I am trying to make the documentation better and examples easier to find

I would like to consolidate the examples for using DataFrames from examples https://github.com/apache/datafusion/tree/main/datafusion-examples/examples (e.g dataframe.rs and dataframe_in_memory and dataframe_output

However, first I wanted to make sure that the existing DataFrame API docs were in good shape -- and it turns out they needed some attention to get the examples compiling

So to keep the size of this PR small, I started with some improvements to the initial content.

I will actually consolidate some of the dataframe examples in a follow on PR

What changes are included in this PR?

  1. Run examples as part of doctests
  2. Make examples run
  3. Improve dataframe documentation while I was in there

Are these changes tested?

Yes, now these examples run as part of the CI

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Jul 5, 2024

`DataFrame` in `DataFrame` is modeled after the Pandas DataFrame interface, and is a thin wrapper over LogicalPlan that adds functionality for building and executing those plans.

```rust
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this example to the end as I think it makes more sense once you see the DataFrame in action

);

#[cfg(doctest)]
doc_comment::doctest!(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This runs the examples as part of cargo test --doc

@@ -29,16 +29,15 @@ using the [`SessionContext::sql`] method. For lower level control such as
preventing DDL, you can use [`SessionContext::sql_with_options`] or the
[`SessionState`] APIs

[`sessioncontext`]: https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html
Copy link
Contributor Author

@alamb alamb Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this down to the bottom of the doc so the link for SessionContext::sql wasn't duplicated (there was a doc warning about this)

@alamb alamb force-pushed the alamb/more_exmaples branch from 7d3c37c to d6540db Compare July 6, 2024 10:21
@alamb alamb added the documentation Improvements or additions to documentation label Jul 6, 2024

#[tokio::main]
async fn main() -> Result<()> {
let ctx = SessionContext::new();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While these examples are now more verbose, they all stand on their own, which I think @efredine suggested would be an improvement

Copy link
Contributor

@efredine efredine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I learnt a lot reading and reviewing this ;-). Left some suggestions as seen through the eyes of a new user like me.

You can use `collect` or `execute_stream` to execute the query.
DataFusion [`DataFrame`]s are modeled after the [Pandas DataFrame] interface,
and is implemented as thin wrapper over a [`LogicalPlan`] that adds
functionality for building and executing those plans.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok taking advantage of my relative ignorance to share some things that might be confusing to new users like me.

I think it might be better to start with a short section called Reading a Dataframe? Because that's probably the first thing people will want to do. It should show a simple example of reading a csv and displaying it. We just want to establish that when you read from a file the thing you get back is a dataframe! Then I wonder if the next section might be called Generating a New Dataframe? And I think it might be preferable to have the SQL example second rather than first?

Interestingly, the distinction between a "table" and a "dataframe" is hazy to me. There is also some sort of subtle distinction going on here between a dataframe as a thing that contains some data (which is how I think about it mentally when I read from a csv) and a thing which contains an executable plan that performs a transformation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great feedback -- I looked around and the more basic introduction seems like it is in https://datafusion.apache.org/user-guide/dataframe.html (the "user guide"). I'll add some text that points there and rearrange the content (as well as make a PR to clean up that page)


#[tokio::main]
async fn main() -> Result<()>{
let ctx = SessionContext::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - this comment, really clarified the distinction between a table and a dataframe for me. In essence, the simplest possible dataframe is one that scans a table and that table can be in a file or in memory. I think this might be worth including in the introduction. Maybe worth consistently using scan when reading a file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to clarify and add some additional comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example really drives it home.

And in the near future we'll be able to turn it back into SQL which probably wouldn't belong here but is cool all the same ;-).

// read the contents of a CSV file into a DataFrame
let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
// execute the query and collect the results as a Vec<RecordBatch>
let batches = df.collect().await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the next example, it might be worth iterating the batch here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Jul 7, 2024
Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @efredine for the insightful feedback - I have tried to address all comments. I would appreciate knowing what you think of the new version if you get a chance

Copy link
Contributor

@efredine efredine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great!

Feel free to tag me on these example changes. I share you view that reviewing and refining documentation and examples is high impact and it's a great way for me to continue learning more.

docs/source/library-user-guide/using-the-dataframe-api.md Outdated Show resolved Hide resolved
docs/source/library-user-guide/using-the-dataframe-api.md Outdated Show resolved Hide resolved

#[tokio::main]
async fn main() -> Result<()>{
let ctx = SessionContext::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example really drives it home.

And in the near future we'll be able to turn it back into SQL which probably wouldn't belong here but is cool all the same ;-).

@alamb alamb added the documentation Improvements or additions to documentation label Jul 7, 2024
@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Jul 7, 2024
@alamb
Copy link
Contributor Author

alamb commented Jul 7, 2024

And in the near future we'll be able to turn it back into SQL which probably wouldn't belong here but is cool all the same ;-).

I actually think we can do it now:

// create a logical plan from a SQL string and then programmatically add new filters
let df = ctx
// Use SQL to read some data from the parquet file
.sql(
"SELECT int_col, double_col, CAST(date_string_col as VARCHAR) \
FROM alltypes_plain",
)
.await?
// Add id > 1 and tinyint_col < double_col filter
.filter(
col("id")
.gt(lit(1))
.and(col("tinyint_col").lt(col("double_col"))),
)?;
let sql = plan_to_sql(df.logical_plan())?.to_string();
assert_eq!(
sql,
r#"SELECT alltypes_plain.int_col, alltypes_plain.double_col, CAST(alltypes_plain.date_string_col AS VARCHAR) FROM alltypes_plain WHERE ((alltypes_plain.id > 1) AND (alltypes_plain.tinyint_col < alltypes_plain.double_col))"#
);
Ok(())

Maybe we should add a function to DataFrame to make it easier to find 🤔

@alamb
Copy link
Contributor Author

alamb commented Jul 7, 2024

Feel free to tag me on these example changes. I share you view that reviewing and refining documentation and examples is high impact and it's a great way for me to continue learning more.

Thank you very much @efredine -- I couldn't agree more.

Now I just need to find more time (it takes me much longer to update docs typically than it does to work on code :) )

// ---|-------------
// 1 | 9000
// 2 | 8000
// 3 | 7000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The in-memory examples are concise and its easy to get the gist of what's going on. But it also throws people in to the deep end of the Arrow format which lacks a gentle introduction IMO. The Arrow-rs documentation gets immediately into the weeds!

It's likely that many users might never even need to know or access the arrow format directly. They will just read and write to csv or parquet.

I don't think this needs to change, but perhaps what's missing is a section on how and when to use the Arrow format? A gentler introduction to Record Batches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a gentle arrow introduction would be awesome -- here is a ticke tracking such a thing upstream: apache/arrow-rs#4071

I actually think the basic content / structure could be copied from https://jorgecarleitao.github.io/arrow2/main/guide/ with the examples being updated to reflect arrow-rs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also add a small section in the DataFusion docs about record batches as well - filed #11336 to track that idea

@alamb alamb marked this pull request as ready for review July 8, 2024 16:19
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @alamb

@alamb
Copy link
Contributor Author

alamb commented Jul 9, 2024

Thank you @efredine and @comphead -- I think we'll have more to do to improve the docs, but this is a step forward I think so merging it in. I'll keep working to improve other sections over time

@alamb alamb merged commit 1e0c06e into apache:main Jul 9, 2024
24 checks passed
@alamb alamb deleted the alamb/more_exmaples branch July 9, 2024 21:45
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* Improve and test dataframe API examples in docs

* Update introduction with pointer to user guide

* Make example consistent

* Make read_csv comment consistent

* clarifications

* prettier + tweaks

* Update docs/source/library-user-guide/using-the-dataframe-api.md

Co-authored-by: Eric Fredine <[email protected]>

* Update docs/source/library-user-guide/using-the-dataframe-api.md

Co-authored-by: Eric Fredine <[email protected]>

---------

Co-authored-by: Eric Fredine <[email protected]>
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 18, 2024
* Improve and test dataframe API examples in docs

* Update introduction with pointer to user guide

* Make example consistent

* Make read_csv comment consistent

* clarifications

* prettier + tweaks

* Update docs/source/library-user-guide/using-the-dataframe-api.md

Co-authored-by: Eric Fredine <[email protected]>

* Update docs/source/library-user-guide/using-the-dataframe-api.md

Co-authored-by: Eric Fredine <[email protected]>

---------

Co-authored-by: Eric Fredine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants