fix: Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema #13750

Blizzara · 2024-12-12T22:53:28Z

Which issue does this PR close?

Ignore empty files in ListingTable. Sometimes input datasets can contain empty files (as in 0 bytes), and trying to treat them like normal files fails when e.g. reading parquet metadata.

Closes #13737.

Rationale for this change

Empty files cannot contribute anything to the table, other than to break things, so ignoring them is pretty much strictly better. Also aligns with e.g. Spark.

What changes are included in this PR?

Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema

One thing I'm not sure about is that it seems pruned_partition_list is also used when writing to table, in

datafusion/datafusion/core/src/datasource/listing/table.rs

Line 1001 in 28e4c64

let file_list_stream = pruned_partition_list(

. Is it a problem to ignore empty files there?

Are these changes tested?

Added empty file into existing tests for pruned_partition_list, as well as new tests for list_partitions. Table.rs didn't seem to have any tests for schema inference, so I didn't add anything for it.

Also tested in our production (well, testing) system.

Are there any user-facing changes?

Reading input ListingTables containing empty files now succeeds.

…thout partition filters, as well as when inferring schema

Blizzara · 2024-12-13T15:01:46Z

datafusion/core/src/datasource/file_format/csv.rs

    #[tokio::test]
-    async fn test_csv_parallel_empty_file(n_partitions: usize) -> Result<()> {


I assume there was no special reason to test parallelism for empty files? Now that we just skip empty files altogether there is no parallelism, the test does pass with the repartition settings as well but they seemed unrelevant so I cleaned them away. Can add back if there's a reason to keep them!

I think this is fine.

I did some research and found it seems to have been added in #6801 by @2010YOUY01 . As long as the code works with empty files (aka doesn't throw an error / go into a infinite loop) I think we are good

Thus I suggest leaving at least one test where we set the repartition file sizes/min file size to 0 and make sure nothing bad happens

done! 3f58b02

Blizzara · 2024-12-13T15:02:00Z

datafusion/core/src/datasource/file_format/json.rs

    #[tokio::test]
-    async fn it_can_read_empty_ndjson_in_parallel(n_partitions: usize) -> Result<()> {


Blizzara · 2024-12-13T15:03:16Z

datafusion/core/src/datasource/file_format/csv.rs

        let query = "select * from empty where random() > 0.5;";
        let query_result = ctx.sql(query).await?.collect().await?;
-        let actual_partitions = count_query_csv_partitions(&ctx, query).await?;


this checks that the plan has a CsvExec node, which we no longer have (now it's a TableScan: empty / EmptyExec)

Blizzara · 2024-12-13T15:04:44Z

datafusion/core/src/datasource/listing/helpers.rs

+            result
+                .objects
+                .into_iter()
+                .filter(|object_meta| object_meta.size > 0)


needed for (hive-style) partitioned reads

Blizzara · 2024-12-13T15:04:57Z

datafusion/core/src/datasource/listing/helpers.rs

@@ -418,6 +424,7 @@ pub async fn pruned_partition_list<'a>(
            table_path
                .list_all_files(ctx, store, file_extension)
                .await?
+                .try_filter(|object_meta| futures::future::ready(object_meta.size > 0))


needed for non-(hive-style)-partitioned reads

Blizzara · 2024-12-13T15:05:13Z

datafusion/core/src/datasource/listing/table.rs

@@ -470,6 +470,8 @@ impl ListingOptions {
        let files: Vec<_> = table_path
            .list_all_files(state, store.as_ref(), &self.file_extension)
            .await?
+            // Empty files cannot affect schema but may throw when trying to read for it
+            .try_filter(|object_meta| future::ready(object_meta.size > 0))


needed for all (parquet) reads that don't provide schema

Blizzara · 2024-12-13T15:40:39Z

datafusion/core/src/datasource/file_format/parquet.rs

@@ -2251,6 +2251,59 @@ mod tests {
        scan_format(state, &*format, &testdata, file_name, projection, limit).await
    }

+    /// Test that 0-byte files don't break while reading
+    #[tokio::test]
+    async fn test_read_empty_parquet() -> Result<()> {


These validate the two reading modes for parquet, + also the schema inference change. Reverting any of those three will cause one or both of these to fail.

alamb

Thank you @Blizzara -- I think this PR makes sense to me 🙏

I think it would be good to leave at least one "repartition a bunch of zero sized files" test in to make sure that doesn't trigger some cornercase, but otherwise this looks good to me

alamb · 2024-12-15T21:32:12Z

datafusion/core/src/datasource/file_format/csv.rs

    #[tokio::test]
-    async fn test_csv_parallel_empty_file(n_partitions: usize) -> Result<()> {


I did some research and found it seems to have been added in #6801 by @2010YOUY01 . As long as the code works with empty files (aka doesn't throw an error / go into a infinite loop) I think we are good

Thus I suggest leaving at least one test where we set the repartition file sizes/min file size to 0 and make sure nothing bad happens

alamb · 2024-12-15T21:36:51Z

datafusion/core/src/datasource/listing/helpers.rs

@@ -671,6 +680,106 @@ mod tests {
        );
    }

+    fn describe_partition(partition: &Partition) -> (&str, usize, Vec<&str>) {


a comment here might be nice explaining what the str/usize/Vec<&str> means for future readers

done! 3f58b02

…he csv empty files tests

goldmedal

Thanks @Blizzara. It looks good to me 👍

goldmedal · 2024-12-18T16:05:42Z

Thanks @Blizzara and @alamb for reviewing

fix: Ignore empty files in ListingTable when listing files with or wi…

789897b

…thout partition filters, as well as when inferring schema

github-actions bot added the core Core DataFusion crate label Dec 12, 2024

Blizzara mentioned this pull request Dec 12, 2024

Ignore empty (parquet) files when using ListingTable #13737

Closed

Blizzara added 3 commits December 13, 2024 09:46

clippy

c5d46f2

fix csv and json tests

61c5e0d

add testing for parquet

af25b26

Blizzara commented Dec 13, 2024

View reviewed changes

Blizzara added 2 commits December 13, 2024 16:05

cleanup

53d96be

fix parquet tests

38cdbf2

Blizzara commented Dec 13, 2024

View reviewed changes

alamb approved these changes Dec 15, 2024

View reviewed changes

document describe_partition, add back repartition options to one of t…

3f58b02

…he csv empty files tests

alamb approved these changes Dec 16, 2024

View reviewed changes

goldmedal approved these changes Dec 18, 2024

View reviewed changes

goldmedal merged commit 1fc7769 into apache:main Dec 18, 2024
25 checks passed

Blizzara deleted the avo/listing-table-ignore-empty-files branch December 18, 2024 16:17

alamb mentioned this pull request Jan 1, 2025

Jan 1, 2025: This week(s) in DataFusion #13970

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema #13750

fix: Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema #13750

Blizzara commented Dec 12, 2024 •

edited

Loading

Blizzara Dec 13, 2024

alamb Dec 13, 2024

alamb Dec 15, 2024

Blizzara Dec 16, 2024

Blizzara Dec 13, 2024

Blizzara Dec 13, 2024 •

edited

Loading

Blizzara Dec 13, 2024

Blizzara Dec 13, 2024

Blizzara Dec 13, 2024 •

edited

Loading

Blizzara Dec 13, 2024

alamb left a comment

alamb Dec 15, 2024

alamb Dec 15, 2024

Blizzara Dec 16, 2024

goldmedal left a comment

goldmedal commented Dec 18, 2024

		#[tokio::test]
		async fn test_csv_parallel_empty_file(n_partitions: usize) -> Result<()> {

		#[tokio::test]
		async fn it_can_read_empty_ndjson_in_parallel(n_partitions: usize) -> Result<()> {

fix: Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema #13750

fix: Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema #13750

Conversation

Blizzara commented Dec 12, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Blizzara Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Blizzara Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goldmedal left a comment

Choose a reason for hiding this comment

goldmedal commented Dec 18, 2024

Blizzara commented Dec 12, 2024 •

edited

Loading

Blizzara Dec 13, 2024 •

edited

Loading

Blizzara Dec 13, 2024 •

edited

Loading