Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow scans with the number of partitions exceeding the limit #14225

Merged
merged 3 commits into from
Oct 7, 2022

Conversation

arhimondr
Copy link
Contributor

Description

Treat such scans as scans over non partitioned tables without applying partition pruning. This is needed to avoid storing too much information in HiveTableHandle.

Non-technical explanation

Allow scanning more partitions than the limit per scan

Release notes

( ) This is not user-visible or docs only and no release notes are required.
(x) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@arhimondr
Copy link
Contributor Author

An attempt to remove the limitation on the number of partitions per scan. This change doesn't remove the setting defining the limit, instead when the number of partitions is higher than the limit the system will try to avoid loading partitions eagerly. Certain optimizations are not possible (such as partition pruning) when partitions cannot be loaded inmemory.

This change also changes how HiveSplitSource works to minimize memory footprint when the number of partitions is high.

@electrum @findepi @Praveen2112 Could you please take a look and let me know what you think?

@raunaqmorarka
Copy link
Member

hive.max-partitions-per-scan could be used by admins to block queries which scan a huge number of partitions. If we're going to allow running such queries now without partition pruning, that totally changes the meaning of that config and it might not be the behaviour that an admin wants.
Could we define a separate config for the threshold on the number of partitions after which we want to avoid loading partitions eagerly and use that to cap memory usage ? If someone wants to allow queries on a large number of partitions, they can raise the limit on hive.max-partitions-per-scan in their config.

@arhimondr
Copy link
Contributor Author

The hive.max-partitions-per-scan setting was added in 2016 by @electrum to avoid memory pressure on coordinator. This change resolves the problem with memory pressure by avoiding enumerating partitions eagerly in one shot. Since the underlying problem is resolved I wonder whether it is necessary to have a configured limit? Though I agree that the property name may need to be changed and I'm open for suggestions.

@hashhar
Copy link
Member

hashhar commented Sep 21, 2022

One practical way this config is used to prevent people from running queries which scan too many partitions (to prevent costs and to force users to add predicates on partition columns to get efficient queries).

For cost there's the better option now with query.max-scan-physical-bytes (and query_max_scan_physical_bytes session property) - it was added in 339/341.

I don't think someone would still actually want to enforce limits based on number of partitions since that's very arbitrary but I agree with @raunaqmorarka we should not re-purpose existing configs.

Ideally we can mark existing config as @DefunctConfig so that proper error is thrown for people who have that set and we can add a new config if we want for new behaviour to act as a kill switch for some time while the new impl is proven out.

@findepi
Copy link
Member

findepi commented Sep 21, 2022

I don't think someone would still actually want to enforce limits based on number of partitions since that's very arbitrary

Looking at configs like hive.query-partition-filter-required, the hive.max-partitions-per-scan is even more reasonable as a safety net preventing queries from accessing "too much"

cc @JamesRTaylor

we should not re-purpose existing configs.

Not like this, yes

Ideally we can mark existing config as @DefunctConfig so that proper error is thrown for people who have that set

Do we have any migration path? Do we need any?

@raunaqmorarka
Copy link
Member

For cost there's the better option now with query.max-scan-physical-bytes

query.max-scan-physical-bytes is useful but the problem with it is that the limit will be enforced after the specified amount of data has been already read. This would potentially waste a lot of resources before killing the query. hive.max-partitions-per-scan has the advantage of stopping the query before we spend worker resources to scan the data.

The hive.max-partitions-per-scan setting was added in 2016 to avoid memory pressure on coordinator

I don't think users take into account the intent of the original author when using a config. In this case there is no documentation or description available to see that this had anything to do with memory. So anyone already using it can't possibly know that this was not meant to be used as a way to block queries touching a large number of partitions.

I think we should look into deprecating and removing hive.max-partitions-per-scan as a separate issue if we're convinced that it's not useful anymore. Maybe start a discussion in slack to see if there are ppl relying on its current behaviour.
We can have a new config for switching to lazy loading of partitions beyond some threshold and assume that those who want to query a large number of partitions will raise the limit on hive.max-partitions-per-scan to a larger number to get past the errors thrown by that.

I'm also wondering if this trade-off of reducing coordinator memory usage but giving up on partition pruning makes sense. If someone wanted to run queries on a large number of partitions and the coordinator memory was a limiting factor, why wouldn't they get a bigger coordinator instead of incurring higher cost of running query without partition pruning ? It seems cheaper to upgrade 1 node rather than consume a lot more resources on workers.

Is it possible that we can still prune splits on the workers using the predicate on partitioned columns ? E.g. for dynamic partition pruning we have HivePageSourceProvider#shouldSkipSplit, could we fallback to similar split pruning for static predicates as well when partition pruning on coordinator was skipped to keep memory usage low ?

@arhimondr arhimondr force-pushed the remove-partition-limit branch from 28bcba0 to 3148f77 Compare September 23, 2022 18:59
@arhimondr
Copy link
Contributor Author

Thanks everybody for the feedback.

I updated the PR preserving the hive.max-partitions-per-scan. The hive.max-partitions-per-scan property can still be used to enforce the limit on the maximum number of partitions.

Instead I introduced a new property, hive.max-partitions-for-eager-load. This property controls how many partitions can be loaded eagerly on coordinator and it is set to 100000 by default.

This PR also makes it possible to scan tables that exceed the value set by hive.max-partitions-for-eager-load. It is done by refactoring the HiveSplitManager to avoid loading partitions eagerly. However when the number of partitions is higher some optimizations (such as filter pushdown) will not be performed. While it is not ideal (and we should probably improve it) this PR is an incremental improvement allowing the engine to be used on tables with high number of partitions, although with potentially reduced efficiency.

Please take an another look.

@@ -2722,7 +2722,9 @@ public OptionalLong executeDelete(ConnectorSession session, ConnectorTableHandle
metastore.truncateUnpartitionedTable(session, handle.getSchemaName(), handle.getTableName());
}
else {
for (HivePartition hivePartition : partitionManager.getOrLoadPartitions(metastore, handle)) {
Iterator<HivePartition> partitions = partitionManager.getPartitions(metastore, handle);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we skip the partition check during DELETE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think there's a real reason to limit the number of partitions for delete. Also the property says "scan".

@@ -221,9 +227,6 @@ public ConnectorSplitSource getSplits(
// validate bucket bucketed execution
Optional<HiveBucketHandle> bucketHandle = hiveTable.getBucketHandle();

// sort partitions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have to sort them ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it is not necessary. It could be that case that the partitions were sorted for debugging convenience reasons. @electrum / @dain do you remember why could it be necessary?

addLoaderIfNecessary();
}

private void addLoaderIfNecessary()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we extract it as a preparatory commit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What specifically? The commit is to "Avoid loading partitions eagerly in HiveSplitManager" and the changes to the BackgroundHiveSplitLoader are necessary to achieve that

@@ -70,7 +70,7 @@
private boolean singleStatementWritesOnly;

private DataSize maxSplitSize = DataSize.of(64, MEGABYTE);
private int maxPartitionsPerScan = 100_000;
private int maxPartitionsPerScan = 1_000_000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't it impact the memory. IIRC this check was added to restrict the memory usage by HivePartition which internally has a Map for the partition keys.

cc: @findepi / @electrum

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Avoid loading partitions eagerly in HiveSplitManager improves split enumeration to avoid parsing / loading partitions eagerly. The memory should no longer be of concern. The property now is mostly for system administrators to prevent unnecessary large scans.

ReentrantLock lock = new ReentrantLock();
lock.lock();
try {
executor.execute(() -> checkState(!lock.isHeldByCurrentThread(), "executor is a direct executor"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need the lock? This seems to be dependent on the fact that the exception is thrown in the calling thread, so we could simply do

executor.execute(() -> {
    throw new IllegalArgumentException("executor is a direct executor");
});

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that. I'm worried that if an executor is configured to handle uncaught exceptions (for example log them) it will create unnecessary noise in the output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

@arhimondr arhimondr force-pushed the remove-partition-limit branch from 3148f77 to 1161609 Compare October 6, 2022 10:31
@arhimondr
Copy link
Contributor Author

Updated

@Chaho12
Copy link
Member

Chaho12 commented Apr 27, 2023

Thx for very interesting feature. We were expecting this kind of feature to limit maximum number of partitions.

Before this, we were using Bytebuddy to Hook HivePartitionManager class to prevent users from throwing abusing queries and slowing cluster by having limitations on maximum number of partitions they can query in single query.

We simply check and return error if number of partitions read is greater than config for specific tables only (tables which we are sure that has HUGE amount of data per partitions)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

7 participants