Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit memory used by fault tolerant scheduler on coordinator #10877

Merged
merged 2 commits into from
Feb 9, 2022

Conversation

arhimondr
Copy link
Contributor

No description provided.

@cla-bot cla-bot bot added the cla-signed label Feb 1, 2022
@arhimondr arhimondr mentioned this pull request Feb 1, 2022
31 tasks
@arhimondr arhimondr force-pushed the task-descriptor-storage branch from e65d467 to 5a4ee73 Compare February 1, 2022 15:34
long result = retainedSizeInBytes;
if (result == 0) {
result = INSTANCE_SIZE
+ estimatedSizeOf(asMap(splits), PlanNodeId::getRetainedSizeInBytes, splits -> estimatedSizeOf(splits, Split::getRetainedSizeInBytes))
Copy link
Member

@losipiuk losipiuk Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to add support for estimatedSizeOf with Mutlimap to airlift.
Change from Multimap to ListMultimap does not seem super natural.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multimap is a Guava class. The SizeOf class is in the slice project that doesn't have a Guava dependency.

@@ -104,4 +106,18 @@ public NodeMemoryConfig setHeapHeadroom(DataSize heapHeadroom)
this.heapHeadroom = heapHeadroom;
return this;
}

@NotNull
public DataSize getMaxTaskDescriptorStorageMemory()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeMemoryConfig is more about workers, right? Would MemoryManagerConfig be a better place for this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like MemoryManagerConfig is specific to ClusterMemoryManager. But I agree, NodeMemoryConfig feels to be more specific to workers. I'm not sure if this config is worth a separate configuration file. What do you think about moving this config to the QueryManagerConfig? Currently this class contains many configuration parameters related to query / split scheduling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. Moved to QueryManagerConfig.

@@ -42,6 +42,8 @@

private DataSize heapHeadroom = DataSize.ofBytes(Math.round(AVAILABLE_HEAP_MEMORY * 0.3));

private DataSize maxTaskDescriptorStorageMemory = DataSize.ofBytes(Math.round(AVAILABLE_HEAP_MEMORY * 0.3));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels a lot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. Reduced to 0.15 for now

return reservedBytes;
}

@NotThreadSafe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not important as this is internal class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tough it wouldn't hurt to declare the intention explicitly so then nobody is tempted to use this object outside the lock or try enforce synchronization in the object itself.

{
reservedBytes += delta;
while (reservedBytes > maxTaskDescriptorStorageMemoryInBytes) {
// drop a query that uses the most storage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure killing the biggest query is best approach. The queries which use the most memory are the biggest, and probably already executing for a long time. Killing such query is costly. I would rather kill the queries which made least progress already (newest ones).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I am not a big fan of entangling storage and killing logic. But I must agree it makes interfaces simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure killing the biggest query is best approach. The queries which use the most memory are the biggest, and probably already executing for a long time. Killing such query is costly. I would rather kill the queries which made least progress already (newest ones).

I was also thinking about this. Contrary killing any other query than the one that has the most meta information buffered might be confusing. It is easier to communicate to a user that their query got killed because it scans very large tables or tables with a lot of small files (that results in high memory utilization in split enumeration) vs communicating that their queries are killed because there's other query running in a cluster that scans large tables.

Regardless, it still just a stop gap approach. If we see that the queries are killed because of this limitation and it is problematic we will invest into spilling to prevent failures of this kind altogether.

Also I am not a big fan of entangling storage and killing logic. But I must agree it makes interfaces simpler.

Killing is just a temporary solution. The end goal is to have spilling capabilities implemented in the storage. Thus I'm not sure if it is worth overthinking how the queries are killed, as eventually we don't want the queries to be killed at all.

Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks.

@arhimondr arhimondr force-pushed the task-descriptor-storage branch from 5a4ee73 to 6348ebc Compare February 9, 2022 16:38
@arhimondr
Copy link
Contributor Author

Updated

@losipiuk
Copy link
Member

losipiuk commented Feb 9, 2022

ci / test (:trino-hive) (pull_request) #10772

@losipiuk
Copy link
Member

losipiuk commented Feb 9, 2022

The other one is #10773

@losipiuk losipiuk merged commit 0862aaa into trinodb:master Feb 9, 2022
@github-actions github-actions bot added this to the 371 milestone Feb 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants