Enable bloom filters by default on read #10299

hiltontj · 2024-04-29T14:54:30Z

Is your feature request related to a problem or challenge?

When reading from parquet files, bloom filters are not enabled by default. It is not immediately obvious that they are not being used when performing queries, so there may be users out there who are not aware that bloom filters in their parquet files are being ignored.

Part of the issue, however, is that the default behaviour looks to be shared between read and write operations.

Describe the solution you'd like

It would be ideal if bloom filters were enabled by default on read. We should be careful, however, as I do not think they should be enabled by default on write, where, depending on how they are configured, their inclusion can be expensive.

Describe alternatives you've considered

Currently, the bloom filters can be enabled, but must be done explicitly. For example, with datafusion-cli, which uses the default configuration, one must enable the setting via the environment, e.g.,

DATAFUSION_EXECUTION_PARQUET_BLOOM_FILTER_ENABLED=true datafusion-cli

or by setting it explicity, e.g.,

SET datafusion.execution.parquet.bloom_filter_enabled=true;

This may not work for everyone, however, since it may cause problems by writing with bloom filters enabled.

Additional context

Bloom filters are disabled by default here: https://github.com/apache/datafusion/blob/37.1.0/datafusion/common/src/config.rs#L398-L399

This setting is ultimately used to prune row groups on read here: https://github.com/apache/datafusion/blob/37.1.0/datafusion/core/src/datasource/physical_plan/parquet/mod.rs#L531-L545

It looks like this setting is also applied on write here: https://github.com/apache/datafusion/blob/37.1.0/datafusion/common/src/file_options/parquet_writer.rs#L68

There is an existing SLT test that explicitly enables this setting when performing a query here: https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/predicates.slt#L509-L547, however, I do not see any tests that are using this setting on write.

The text was updated successfully, but these errors were encountered:

alamb · 2024-04-29T15:18:39Z

Thank you @hiltontj -- this is a great description. I agree it seems like a good idea to enable using bloom filters on read when they are present in the file. The test coverage is pretty good as far as I know these and I believe several users are already using the bloom filters

@hengfeiyang -- as you contributed the initial implementation in #7821, do you know of any reason we shouldn't enable bloom filters by default?

@Ted-Jiang or @progval -- I think you have worked on this feature recently as well -- do you know of any reason we shouldn't change the default?

alamb · 2024-04-29T15:19:55Z

(BTW this is a wonderfully clear ticket in my mind -- thank you @hiltontj )

hengfeiyang · 2024-04-29T15:24:49Z

@alamb As an experiment feature, we disable it as default on first implementation. But i think we can enable this feature by default now.

we already enabled as default in our production.

progval · 2024-04-29T16:39:19Z

I don't see any reason not to enable them. Ditto with datafusion.execution.parquet.pushdown_filters.

lewiszlw · 2024-04-30T06:07:53Z

I created a pr to split parquet bloom filter config and enable bloom filter on read by default #10306.

hiltontj added the enhancement New feature or request label Apr 29, 2024

lewiszlw mentioned this issue Apr 30, 2024

Split parquet bloom filter config and enable bloom filter on read by default #10306

Merged

alamb closed this as completed in #10306 May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable bloom filters by default on read #10299

Enable bloom filters by default on read #10299

hiltontj commented Apr 29, 2024

alamb commented Apr 29, 2024

alamb commented Apr 29, 2024

hengfeiyang commented Apr 29, 2024 •

edited

Loading

progval commented Apr 29, 2024

lewiszlw commented Apr 30, 2024

Enable bloom filters by default on read #10299

Enable bloom filters by default on read #10299

Comments

hiltontj commented Apr 29, 2024

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Apr 29, 2024

alamb commented Apr 29, 2024

hengfeiyang commented Apr 29, 2024 • edited Loading

progval commented Apr 29, 2024

lewiszlw commented Apr 30, 2024

hengfeiyang commented Apr 29, 2024 •

edited

Loading