-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate downloading partial parquet data for preview #7043
Comments
This would be really helpful since in a Lakehouse architecture parquet files usually get really large. |
@JasonYeMSFT, @MRayermannMSFT, @craxal, just wanted to renew my statement that this feature would be highly appreciated. |
I use actually DuckDB and its Azure extension for that. It works like a charm, and brings you the full SQL semantics on your Parquet. |
@keen85 Thank you for the feedback. We appreciate it! Here's an overview of the problem we face with this feature. Storage Explorer's preview is based on the principle of reading the first ~10 MiB of data from a blob and parsing that data for display. For most file formats, this is perfectly fine. Parquet files are different, though. They are column-based, not row-based. This means, if you want to see even just one complete row, you'd have to read from essentially random locations in the file, which isn't possible when you only have the first ~10 MiB. The only way to allow random reads from the file is to download the entire file. The alternative is to show partial rows (rows where not all the columns are downloaded). However, this likely isn't very useful for large files with many rows. Because the format is columnar, you would end up downloading only one column's worth of data for hundreds of thousands of rows, so the chances of you being able to preview the data you want is slim. |
@craxal, to read a parquet file, you have to:
FYI: the default Parquet block size is 128Mb (so it should be the upper limit + footer size to read). I agree with you: you won't visualize the data with a single read of 10Mb, but you are able to do it with several (relatively) small read (up to 128Mb + footer) which is much better than the entire file. See my comment in the ticket where we first speak about this feature: #6990 (comment) One other possible solution ? : use DuckDB, it perfectly does the job. |
@sbarnoud can you please elaborate on your proposal using DuckDB? But copying the file URL from Azure Storage Explore, opening up Azure Data Studio, writing some lines of Code, pasting in the file path and then run DuckDB can just be a workaround IMHO. |
How does DuckDB solve the "don't download the entire file" problem? The issue at hand isn't really a data processing problem. It's a file format problem. And from what I'm seeing, DuckDB just takes data from a file and converts it into an SQLite database. We don't need full-fledged database structures and semantics. For this issue, we're just trying to get enough information from a file to preview it in a tabular format. |
DuckDB is a C++ library running on Windows (and Linux, ...) that you can integrate in any application. It comes with the full SQL semantic on Parquet files. With the Azure extension (also a C++ library), you have the full support of the Blob storage (Native ADLS support will come later, however, I have tested it on any ADLS Gen2 files, and it works already). So, IMO, it is not so complex to directly bind the Storage Explorer with DuckDB. The preview, in that case, could be a free SQL command on the file. Actually, I'm using DuckDB has a JDBC Driver in dbeaver, and from dbeaver, I can run any SQL query I want on ADLS. I use the same directly in some of my API. |
DuckDB is a C++ library running on Windows (and Linux, ...) that you can integrate in any application. It comes with the full SQL semantic on Parquet files. With the Azure extension (also a C++ library), you have the full support of the Blob storage (Native ADLS support will come later, however, I have tested it on any ADLS Gen2 files, and it works already). So, IMO, it is not so complex to directly bind the Storage Explorer with DuckDB. The preview, in that case, could be a free SQL command on the file. Actually, I'm using DuckDB has a JDBC Driver in dbeaver, and from dbeaver, I can run any SQL query I want on ADLS. I use the same directly in some of |
@craxal DuckDB solve it exactly as I have already explained. DuckDB knows the structure of Parquet, read the 8 last bytes first to get the offset of the footer, then read the footer, then analyze metadata to get the offset and length of any block. As you know, the Storage Account API allow you to read the data given an offset and a length. That's as simple as that. Then once you read ONLY the expected length from the expected offset, you can get all rows in the block (also called a row group). I do not tell you: use DuckDB. I tell you, DuckDB do NOT download the entire files, just read what is needed. Exactly as Spark or any SQL engine. Hopefully you do not need to download entire files to anlyze it. Else Parquet won't be so popular (can you imagine that Spark download the entires files in any executor ? No, of course ...) |
Suggested in #6990. We should look into downloading partial parquet data for preview instead of always downloading the entire file.
The text was updated successfully, but these errors were encountered: