Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parquet_metadata for datafusion-cli #8413

Merged
merged 4 commits into from
Dec 7, 2023

Conversation

Veeupup
Copy link
Contributor

@Veeupup Veeupup commented Dec 4, 2023

Which issue does this PR close?

Closes #8367

Function behaves just like duckdb.

The above is from duckdb, and the below is from datafusion.

image

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@Veeupup
Copy link
Contributor Author

Veeupup commented Dec 4, 2023

@alamb PTAL : )

how do you like the display format?

@alamb
Copy link
Contributor

alamb commented Dec 4, 2023

Thank you @Veeupup -- I will check this out later today!

Copy link
Contributor

@simonvandel simonvandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool feature!

datafusion-cli/src/functions.rs Outdated Show resolved Hide resolved
datafusion-cli/src/functions.rs Show resolved Hide resolved
datafusion-cli/src/functions.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Veeupup and @simonvandel -- this is a great start.

I would like to suggest that we follow DuckDB for parquet_metadata and not reinvent the wheel / formatting here.

That that means one output row per file, per row group
Each row has 23 columns

Here is an example of what duckdbs output looks like:

D select * from read_parquet('./parquet-testing/data/int64_decimal.parquet');
┌───────────────┐
│     value     │
│ decimal(10,2) │
├───────────────┤
│          1.00 │
│          2.00 │
│          3.00 │
│          4.00 │
│          5.00 │
│          6.00 │
│          7.00 │
│          8.00 │
│          9.00 │
│         10.00 │
│         11.00 │
│         12.00 │
│         13.00 │
│         14.00 │
│         15.00 │
│         16.00 │
│         17.00 │
│         18.00 │
│         19.00 │
│         20.00 │
│         21.00 │
│         22.00 │
│         23.00 │
│         24.00 │
├───────────────┤
│    24 rows    │
└───────────────┘
D .mode markdown
D select * from parquet_metadata('./parquet-testing/data/int64_decimal.parquet');
file_name row_group_id row_group_num_rows row_group_num_columns row_group_bytes column_id file_offset num_values path_in_schema type stats_min stats_max stats_null_count stats_distinct_count stats_min_value stats_max_value compression encodings index_page_offset dictionary_page_offset data_page_offset total_compressed_size total_uncompressed_size
./parquet-testing/data/int64_decimal.parquet 0 24 1 241 0 24 value INT64 1.00 24.00 0 UNCOMPRESSED BIT_PACKED, RLE, PLAIN 4 241 241
D

Here is the schema

D describe select * from parquet_metadata('./parquet-testing/data/int64_decimal.parquet');



|       column_name       | column_type | null | key | default | extra |
|-------------------------|-------------|------|-----|---------|-------|
| file_name               | VARCHAR     | YES  |     |         |       |
| row_group_id            | BIGINT      | YES  |     |         |       |
| row_group_num_rows      | BIGINT      | YES  |     |         |       |
| row_group_num_columns   | BIGINT      | YES  |     |         |       |
| row_group_bytes         | BIGINT      | YES  |     |         |       |
| column_id               | BIGINT      | YES  |     |         |       |
| file_offset             | BIGINT      | YES  |     |         |       |
| num_values              | BIGINT      | YES  |     |         |       |
| path_in_schema          | VARCHAR     | YES  |     |         |       |
| type                    | VARCHAR     | YES  |     |         |       |
| stats_min               | VARCHAR     | YES  |     |         |       |
| stats_max               | VARCHAR     | YES  |     |         |       |
| stats_null_count        | BIGINT      | YES  |     |         |       |
| stats_distinct_count    | BIGINT      | YES  |     |         |       |
| stats_min_value         | VARCHAR     | YES  |     |         |       |
| stats_max_value         | VARCHAR     | YES  |     |         |       |
| compression             | VARCHAR     | YES  |     |         |       |
| encodings               | VARCHAR     | YES  |     |         |       |
| index_page_offset       | BIGINT      | YES  |     |         |       |
| dictionary_page_offset  | BIGINT      | YES  |     |         |       |
| data_page_offset        | BIGINT      | YES  |     |         |       |
| total_compressed_size   | BIGINT      | YES  |     |         |       |
| total_uncompressed_size | BIGINT      | YES  |     |         |       |

.iter()
.map(|rg| {
format!(
"num_columns: {}, num_rows: {}, total_byte_size: {}, sorting_columns: {:?}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every row group should have the same number of columns, for what it is worth

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that each column in each row_group will be one row in duckdb, so there will be Num(cols) * Num(row_group) rows

image

I'll just mock the duckdb display format then ~

@Veeupup
Copy link
Contributor Author

Veeupup commented Dec 6, 2023

@alamb @simonvandel Hi! I made parquet_metadata behave just like duckdb!

The above is from duckdb, and the below is from datafusion. You can click on the picture to see more clearly!

image

Signed-off-by: veeupup <[email protected]>
@Veeupup Veeupup force-pushed the cli_parquet_metadata branch from dfbd127 to 5516189 Compare December 6, 2023 14:30
@Veeupup Veeupup requested review from alamb and simonvandel December 6, 2023 14:30
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this out and it is quite neat. Thank you @Veeupup . This and all your previous PRs have been a joy to review.

I think this could be merged as is and we could make improvements as a follow on, as it is new functionality. However, it would also be neat to fix them prior to merge

I left suggestions about what needed to be fixed

Request 1: Can we make string literal arguments work? Something like

❯ select * from parquet_metadata('/Users/andrewlamb/Software/arrow-datafusion/datafusion/core/tests/data/clickbench_hits_10.parquet');
Error during planning: parquet_metadata requires string argument as its input

Request 2 Do not panic when statistics are not set

❯ select * from parquet_metadata("/Users/andrewlamb/Software/arrow-datafusion/datafusion/core/tests/data/clickbench_hits_10.parquet");
thread 'main' panicked at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/parquet-49.0.0/src/file/statistics.rs:474:27:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I highlighted the problem I think in comments

Request 3: Add some tests

It would be really nice if you could add some regression tests to make sure we don't break this in the future. For example, could we read and verify the data that is read from the two files (one that has min/max stats set and one that does not?)

Thanks again

datafusion-cli/src/functions.rs Outdated Show resolved Hide resolved
datafusion-cli/src/functions.rs Show resolved Hide resolved
@@ -185,6 +186,8 @@ pub async fn main() -> Result<()> {
ctx.state().catalog_list(),
ctx.state_weak_ref(),
)));
// register `parquet_metadata` table function to get metadata from parquet files
ctx.register_udtf("parquet_metadata", Arc::new(ParquetMetadataFunc {}));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳

@Veeupup Veeupup force-pushed the cli_parquet_metadata branch from d237d66 to 5f7d8c3 Compare December 7, 2023 14:36
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it -- thank you @Veeupup -- really nice work

@alamb alamb merged commit 6767ea3 into apache:main Dec 7, 2023
22 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 7, 2023

This feature was so cool I started using it, and I found one suggestion for improvement already: #8464

appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Dec 15, 2023
* Support parquet_metadata for datafusion-cli

Signed-off-by: veeupup <[email protected]>

* make tomlfmt happy

* display like duckdb

Signed-off-by: veeupup <[email protected]>

* add test & fix single quote

---------

Signed-off-by: veeupup <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement parquet_metadata function in datafusion-cli
3 participants