Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate downloading partial parquet data for preview #7043

Open
JasonYeMSFT opened this issue Jul 7, 2023 · 10 comments
Open

Investigate downloading partial parquet data for preview #7043

JasonYeMSFT opened this issue Jul 7, 2023 · 10 comments
Labels
💡 feature request New feature or request 💡 file previewing Feature request for file previewing
Milestone

Comments

@JasonYeMSFT
Copy link
Contributor

Suggested in #6990. We should look into downloading partial parquet data for preview instead of always downloading the entire file.

@MRayermannMSFT MRayermannMSFT added this to the Future milestone Jul 10, 2023
@MRayermannMSFT MRayermannMSFT added 💡 feature request New feature or request 💡 file previewing Feature request for file previewing labels Jul 10, 2023
@keen85
Copy link

keen85 commented Oct 16, 2023

This would be really helpful since in a Lakehouse architecture parquet files usually get really large.
Downloading the complete file and not just some rowgroups will take time, create unnecessary traffic and come at egress cost.
@JasonYeMSFT, any chance you will implement this? 😅

@keen85
Copy link

keen85 commented Mar 9, 2024

@JasonYeMSFT, @MRayermannMSFT, @craxal, just wanted to renew my statement that this feature would be highly appreciated.
Especially in the context of Microsoft Fabric and Azure Synapse, large parquet fiels are very common. For previewing the content, it would be just fine to read and display just a small excerpt of the file.

@sbarnoud
Copy link

sbarnoud commented Mar 9, 2024

I use actually DuckDB and its Azure extension for that. It works like a charm, and brings you the full SQL semantics on your Parquet.
however simple preview stills an interesting feature

@craxal
Copy link
Contributor

craxal commented Mar 11, 2024

@keen85 Thank you for the feedback. We appreciate it!

Here's an overview of the problem we face with this feature. Storage Explorer's preview is based on the principle of reading the first ~10 MiB of data from a blob and parsing that data for display. For most file formats, this is perfectly fine.

Parquet files are different, though. They are column-based, not row-based. This means, if you want to see even just one complete row, you'd have to read from essentially random locations in the file, which isn't possible when you only have the first ~10 MiB. The only way to allow random reads from the file is to download the entire file.

The alternative is to show partial rows (rows where not all the columns are downloaded). However, this likely isn't very useful for large files with many rows. Because the format is columnar, you would end up downloading only one column's worth of data for hundreds of thousands of rows, so the chances of you being able to preview the data you want is slim.

@sbarnoud
Copy link

sbarnoud commented Mar 11, 2024

@craxal, to read a parquet file, you have to:

  1. First read the footer (the footer length is in the 8 last byte of the file, with a "magic")
  2. Analyze the footer in order to get the offset of any block
  3. Choose a block (a raw group) (you have the info in the metadata)
  4. Read at most a block (for example the smallest)

FYI: the default Parquet block size is 128Mb (so it should be the upper limit + footer size to read).

I agree with you: you won't visualize the data with a single read of 10Mb, but you are able to do it with several (relatively) small read (up to 128Mb + footer) which is much better than the entire file.

See my comment in the ticket where we first speak about this feature: #6990 (comment)

One other possible solution ? : use DuckDB, it perfectly does the job.

@keen85
Copy link

keen85 commented Mar 11, 2024

@sbarnoud can you please elaborate on your proposal using DuckDB?
Do you propose using DuckDB inside of Azure Storage Explorer to perform the preview? I'd like that!

But copying the file URL from Azure Storage Explore, opening up Azure Data Studio, writing some lines of Code, pasting in the file path and then run DuckDB can just be a workaround IMHO.

@craxal
Copy link
Contributor

craxal commented Mar 11, 2024

How does DuckDB solve the "don't download the entire file" problem? The issue at hand isn't really a data processing problem. It's a file format problem. And from what I'm seeing, DuckDB just takes data from a file and converts it into an SQLite database. We don't need full-fledged database structures and semantics. For this issue, we're just trying to get enough information from a file to preview it in a tabular format.

@sbarnoud
Copy link

DuckDB is a C++ library running on Windows (and Linux, ...) that you can integrate in any application. It comes with the full SQL semantic on Parquet files. With the Azure extension (also a C++ library), you have the full support of the Blob storage (Native ADLS support will come later, however, I have tested it on any ADLS Gen2 files, and it works already).

So, IMO, it is not so complex to directly bind the Storage Explorer with DuckDB.

The preview, in that case, could be a free SQL command on the file.

Actually, I'm using DuckDB has a JDBC Driver in dbeaver, and from dbeaver, I can run any SQL query I want on ADLS. I use the same directly in some of my API.

@sbarnoud
Copy link

DuckDB is a C++ library running on Windows (and Linux, ...) that you can integrate in any application. It comes with the full SQL semantic on Parquet files. With the Azure extension (also a C++ library), you have the full support of the Blob storage (Native ADLS support will come later, however, I have tested it on any ADLS Gen2 files, and it works already).

So, IMO, it is not so complex to directly bind the Storage Explorer with DuckDB.

The preview, in that case, could be a free SQL command on the file.

Actually, I'm using DuckDB has a JDBC Driver in dbeaver, and from dbeaver, I can run any SQL query I want on ADLS. I use the same directly in some of

@sbarnoud
Copy link

sbarnoud commented Mar 11, 2024

@craxal DuckDB solve it exactly as I have already explained. DuckDB knows the structure of Parquet, read the 8 last bytes first to get the offset of the footer, then read the footer, then analyze metadata to get the offset and length of any block.

As you know, the Storage Account API allow you to read the data given an offset and a length. That's as simple as that.

Then once you read ONLY the expected length from the expected offset, you can get all rows in the block (also called a row group).

I do not tell you: use DuckDB. I tell you, DuckDB do NOT download the entire files, just read what is needed. Exactly as Spark or any SQL engine. Hopefully you do not need to download entire files to anlyze it. Else Parquet won't be so popular (can you imagine that Spark download the entires files in any executor ? No, of course ...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💡 feature request New feature or request 💡 file previewing Feature request for file previewing
Projects
None yet
Development

No branches or pull requests

5 participants