Investigate downloading partial parquet data for preview #7043

JasonYeMSFT · 2023-07-07T18:07:09Z

Suggested in #6990. We should look into downloading partial parquet data for preview instead of always downloading the entire file.

keen85 · 2023-10-16T13:07:51Z

This would be really helpful since in a Lakehouse architecture parquet files usually get really large.
Downloading the complete file and not just some rowgroups will take time, create unnecessary traffic and come at egress cost.
@JasonYeMSFT, any chance you will implement this? 😅

keen85 · 2024-03-09T12:20:25Z

@JasonYeMSFT, @MRayermannMSFT, @craxal, just wanted to renew my statement that this feature would be highly appreciated.
Especially in the context of Microsoft Fabric and Azure Synapse, large parquet fiels are very common. For previewing the content, it would be just fine to read and display just a small excerpt of the file.

sbarnoud · 2024-03-09T12:39:44Z

I use actually DuckDB and its Azure extension for that. It works like a charm, and brings you the full SQL semantics on your Parquet.
however simple preview stills an interesting feature

craxal · 2024-03-11T17:35:04Z

@keen85 Thank you for the feedback. We appreciate it!

Here's an overview of the problem we face with this feature. Storage Explorer's preview is based on the principle of reading the first ~10 MiB of data from a blob and parsing that data for display. For most file formats, this is perfectly fine.

Parquet files are different, though. They are column-based, not row-based. This means, if you want to see even just one complete row, you'd have to read from essentially random locations in the file, which isn't possible when you only have the first ~10 MiB. The only way to allow random reads from the file is to download the entire file.

The alternative is to show partial rows (rows where not all the columns are downloaded). However, this likely isn't very useful for large files with many rows. Because the format is columnar, you would end up downloading only one column's worth of data for hundreds of thousands of rows, so the chances of you being able to preview the data you want is slim.

sbarnoud · 2024-03-11T17:55:42Z

@craxal, to read a parquet file, you have to:

First read the footer (the footer length is in the 8 last byte of the file, with a "magic")
Analyze the footer in order to get the offset of any block
Choose a block (a raw group) (you have the info in the metadata)
Read at most a block (for example the smallest)

FYI: the default Parquet block size is 128Mb (so it should be the upper limit + footer size to read).

I agree with you: you won't visualize the data with a single read of 10Mb, but you are able to do it with several (relatively) small read (up to 128Mb + footer) which is much better than the entire file.

See my comment in the ticket where we first speak about this feature: #6990 (comment)

One other possible solution ? : use DuckDB, it perfectly does the job.

keen85 · 2024-03-11T19:11:18Z

@sbarnoud can you please elaborate on your proposal using DuckDB?
Do you propose using DuckDB inside of Azure Storage Explorer to perform the preview? I'd like that!

But copying the file URL from Azure Storage Explore, opening up Azure Data Studio, writing some lines of Code, pasting in the file path and then run DuckDB can just be a workaround IMHO.

craxal · 2024-03-11T20:55:00Z

How does DuckDB solve the "don't download the entire file" problem? The issue at hand isn't really a data processing problem. It's a file format problem. And from what I'm seeing, DuckDB just takes data from a file and converts it into an SQLite database. We don't need full-fledged database structures and semantics. For this issue, we're just trying to get enough information from a file to preview it in a tabular format.

sbarnoud · 2024-03-11T20:56:25Z

DuckDB is a C++ library running on Windows (and Linux, ...) that you can integrate in any application. It comes with the full SQL semantic on Parquet files. With the Azure extension (also a C++ library), you have the full support of the Blob storage (Native ADLS support will come later, however, I have tested it on any ADLS Gen2 files, and it works already).

So, IMO, it is not so complex to directly bind the Storage Explorer with DuckDB.

The preview, in that case, could be a free SQL command on the file.

Actually, I'm using DuckDB has a JDBC Driver in dbeaver, and from dbeaver, I can run any SQL query I want on ADLS. I use the same directly in some of my API.

sbarnoud · 2024-03-11T20:58:42Z

DuckDB is a C++ library running on Windows (and Linux, ...) that you can integrate in any application. It comes with the full SQL semantic on Parquet files. With the Azure extension (also a C++ library), you have the full support of the Blob storage (Native ADLS support will come later, however, I have tested it on any ADLS Gen2 files, and it works already).

So, IMO, it is not so complex to directly bind the Storage Explorer with DuckDB.

The preview, in that case, could be a free SQL command on the file.

Actually, I'm using DuckDB has a JDBC Driver in dbeaver, and from dbeaver, I can run any SQL query I want on ADLS. I use the same directly in some of

sbarnoud · 2024-03-11T21:02:37Z

@craxal DuckDB solve it exactly as I have already explained. DuckDB knows the structure of Parquet, read the 8 last bytes first to get the offset of the footer, then read the footer, then analyze metadata to get the offset and length of any block.

As you know, the Storage Account API allow you to read the data given an offset and a length. That's as simple as that.

Then once you read ONLY the expected length from the expected offset, you can get all rows in the block (also called a row group).

I do not tell you: use DuckDB. I tell you, DuckDB do NOT download the entire files, just read what is needed. Exactly as Spark or any SQL engine. Hopefully you do not need to download entire files to anlyze it. Else Parquet won't be so popular (can you imagine that Spark download the entires files in any executor ? No, of course ...)

MRayermannMSFT added this to the Future milestone Jul 10, 2023

MRayermannMSFT added 💡 feature request New feature or request 💡 file previewing Feature request for file previewing labels Jul 10, 2023

keen85 mentioned this issue Feb 3, 2025

Include DuckDB and use it for previewing tabular data (parquet, csv, json, ...) #8447

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate downloading partial parquet data for preview #7043

Investigate downloading partial parquet data for preview #7043

JasonYeMSFT commented Jul 7, 2023

keen85 commented Oct 16, 2023 •

edited

Loading

keen85 commented Mar 9, 2024

sbarnoud commented Mar 9, 2024 •

edited

Loading

craxal commented Mar 11, 2024

sbarnoud commented Mar 11, 2024 •

edited

Loading

keen85 commented Mar 11, 2024

craxal commented Mar 11, 2024

sbarnoud commented Mar 11, 2024

sbarnoud commented Mar 11, 2024

sbarnoud commented Mar 11, 2024 •

edited

Loading

Investigate downloading partial parquet data for preview #7043

Investigate downloading partial parquet data for preview #7043

Comments

JasonYeMSFT commented Jul 7, 2023

keen85 commented Oct 16, 2023 • edited Loading

keen85 commented Mar 9, 2024

sbarnoud commented Mar 9, 2024 • edited Loading

craxal commented Mar 11, 2024

sbarnoud commented Mar 11, 2024 • edited Loading

keen85 commented Mar 11, 2024

craxal commented Mar 11, 2024

sbarnoud commented Mar 11, 2024

sbarnoud commented Mar 11, 2024

sbarnoud commented Mar 11, 2024 • edited Loading

keen85 commented Oct 16, 2023 •

edited

Loading

sbarnoud commented Mar 9, 2024 •

edited

Loading

sbarnoud commented Mar 11, 2024 •

edited

Loading

sbarnoud commented Mar 11, 2024 •

edited

Loading