-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Library Guide: Add Using the DataFrame API #8319
Conversation
Signed-off-by: veeupup <[email protected]>
dfc074d
to
aaf1d1c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Veeupup -- this is a great addition. I left a few suggestions, but I also think we could make those changes as a follow on PR.
|
||
You can also serialize `DataFrame` to a file. For now, `Datafusion` supports write `DataFrame` to `csv`, `json` and `parquet`. | ||
|
||
Before writing to a file, it will call collect to calculate all the results of the DataFrame and then write to file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the DataFrame API calls collect -- instead I think it uses the streaming APIs
Before writing to a file, it will call collect to calculate all the results of the DataFrame and then write to file. | |
When writing a file, DataFusion will execute the DataFrame and stream the results to a file. |
|
||
## Transform between LogicalPlan and DataFrame | ||
|
||
As it is showed above, `DataFrame` is just a very thin wrapper of `LogicalPlan`, so you can easily go back and forth between them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is showed above, `DataFrame` is just a very thin wrapper of `LogicalPlan`, so you can easily go back and forth between them. | |
As shown above, `DataFrame` is just a very thin wrapper of `LogicalPlan`, so you can easily go back and forth between them. |
Coming Soon | ||
## What is a DataFrame | ||
|
||
`DataFrame` is a basic concept in `datafusion` and is only a thin wrapper over LogicalPlan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`DataFrame` is a basic concept in `datafusion` and is only a thin wrapper over LogicalPlan. | |
`DataFrame` in `DataFrame` is modeled after the Pandas DataFrame interface, and is a thin wrapper over LogicalPlan that adds functionality for building and executing those plans. |
} | ||
``` | ||
|
||
For both `DataFrame` and `LogicalPlan`, you can build the query manually, such as: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For both `DataFrame` and `LogicalPlan`, you can build the query manually, such as: | |
You can build up `DataFrame`s using its methods, similarly to building `LogicalPlan`s using `LogicalPlanBuilder`: |
```rust | ||
let df = ctx.table("users").await?; | ||
|
||
let new_df = df.select(vec![col("id"), col("bank_account")])? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let new_df = df.select(vec![col("id"), col("bank_account")])? | |
// Create a new DataFrame sorted by `id`, `bank_account` | |
let new_df = df.select(vec![col("id"), col("bank_account")])? |
|
||
You can manually call the `DataFrame` API or automatically generate a `DataFrame` through the SQL query planner just like: | ||
|
||
use `sql` to construct `DataFrame`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use `sql` to construct `DataFrame`: | |
For example, to use `sql` to construct `DataFrame`: |
let dataframe = ctx.sql("SELECT * FROM users;").await?; | ||
``` | ||
|
||
construct `DataFrame` manually |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
construct `DataFrame` manually | |
To construct `DataFrame` using the API: |
|
||
## Collect / Streaming Exec | ||
|
||
When you have a `DataFrame`, you may want to access the results of the internal `LogicalPlan`. You can do this by using `collect` to retrieve all outputs at once, or `streaming_exec` to obtain a `SendableRecordBatchStream`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you have a `DataFrame`, you may want to access the results of the internal `LogicalPlan`. You can do this by using `collect` to retrieve all outputs at once, or `streaming_exec` to obtain a `SendableRecordBatchStream`. | |
DataFusion `DataFrame`s are "lazy", meaning they do not do any processing until they are executed, which allows for additional optimizations. | |
When you have a `DataFrame`, you can run it in one of three ways: | |
1. `collect` which executes the query and buffers all the output into a `Vec<RecordBatch>` | |
2. `streaming_exec`, which begins executions and returns a `SendableRecordBatchStream` which incrementally computes output on each call to `next()` | |
3. `cache` which executes the query and buffers the output into a new in memory DataFrame. |
let batches = df.collect().await?; | ||
``` | ||
|
||
You can also use stream output to iterate the `RecordBatch` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also use stream output to iterate the `RecordBatch` | |
You can also use stream output to incrementally generate output one `RecordBatch` at a time |
|
||
Before writing to a file, it will call collect to calculate all the results of the DataFrame and then write to file. | ||
|
||
For example, if you write it to a csv_file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, if you write it to a csv_file | |
For example, to write a csv_file |
Signed-off-by: veeupup <[email protected]>
Thank you for your detailed comments! Very helpful! @alamb I have changed its content as your comments : ) |
Thanks again @Veeupup |
Signed-off-by: veeupup [email protected]
Which issue does this PR close?
Closes #7305
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?