-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit memory used by fault tolerant scheduler on coordinator #10877
Conversation
e65d467
to
5a4ee73
Compare
long result = retainedSizeInBytes; | ||
if (result == 0) { | ||
result = INSTANCE_SIZE | ||
+ estimatedSizeOf(asMap(splits), PlanNodeId::getRetainedSizeInBytes, splits -> estimatedSizeOf(splits, Split::getRetainedSizeInBytes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to add support for estimatedSizeOf
with Mutlimap
to airlift.
Change from Multimap
to ListMultimap
does not seem super natural.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multimap
is a Guava class. The SizeOf
class is in the slice
project that doesn't have a Guava dependency.
@@ -104,4 +106,18 @@ public NodeMemoryConfig setHeapHeadroom(DataSize heapHeadroom) | |||
this.heapHeadroom = heapHeadroom; | |||
return this; | |||
} | |||
|
|||
@NotNull | |||
public DataSize getMaxTaskDescriptorStorageMemory() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NodeMemoryConfig
is more about workers, right? Would MemoryManagerConfig
be a better place for this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like MemoryManagerConfig
is specific to ClusterMemoryManager
. But I agree, NodeMemoryConfig
feels to be more specific to workers. I'm not sure if this config is worth a separate configuration file. What do you think about moving this config to the QueryManagerConfig
? Currently this class contains many configuration parameters related to query / split scheduling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. Moved to QueryManagerConfig
.
@@ -42,6 +42,8 @@ | |||
|
|||
private DataSize heapHeadroom = DataSize.ofBytes(Math.round(AVAILABLE_HEAP_MEMORY * 0.3)); | |||
|
|||
private DataSize maxTaskDescriptorStorageMemory = DataSize.ofBytes(Math.round(AVAILABLE_HEAP_MEMORY * 0.3)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feels a lot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. Reduced to 0.15
for now
core/trino-main/src/main/java/io/trino/execution/scheduler/TaskDescriptorStorage.java
Outdated
Show resolved
Hide resolved
return reservedBytes; | ||
} | ||
|
||
@NotThreadSafe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: not important as this is internal class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tough it wouldn't hurt to declare the intention explicitly so then nobody is tempted to use this object outside the lock or try enforce synchronization in the object itself.
core/trino-main/src/test/java/io/trino/execution/scheduler/TestTaskDescriptorStorage.java
Outdated
Show resolved
Hide resolved
{ | ||
reservedBytes += delta; | ||
while (reservedBytes > maxTaskDescriptorStorageMemoryInBytes) { | ||
// drop a query that uses the most storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure killing the biggest query is best approach. The queries which use the most memory are the biggest, and probably already executing for a long time. Killing such query is costly. I would rather kill the queries which made least progress already (newest ones).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I am not a big fan of entangling storage and killing logic. But I must agree it makes interfaces simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure killing the biggest query is best approach. The queries which use the most memory are the biggest, and probably already executing for a long time. Killing such query is costly. I would rather kill the queries which made least progress already (newest ones).
I was also thinking about this. Contrary killing any other query than the one that has the most meta information buffered might be confusing. It is easier to communicate to a user that their query got killed because it scans very large tables or tables with a lot of small files (that results in high memory utilization in split enumeration) vs communicating that their queries are killed because there's other query running in a cluster that scans large tables.
Regardless, it still just a stop gap approach. If we see that the queries are killed because of this limitation and it is problematic we will invest into spilling to prevent failures of this kind altogether.
Also I am not a big fan of entangling storage and killing logic. But I must agree it makes interfaces simpler.
Killing is just a temporary solution. The end goal is to have spilling capabilities implemented in the storage. Thus I'm not sure if it is worth overthinking how the queries are killed, as eventually we don't want the queries to be killed at all.
core/trino-main/src/main/java/io/trino/execution/scheduler/TaskDescriptorStorage.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks.
core/trino-main/src/main/java/io/trino/execution/scheduler/TaskDescriptorStorage.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/test/java/io/trino/execution/scheduler/TestTaskDescriptorStorage.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/test/java/io/trino/execution/scheduler/TestTaskDescriptorStorage.java
Outdated
Show resolved
Hide resolved
5a4ee73
to
6348ebc
Compare
Updated |
ci / test (:trino-hive) (pull_request) #10772 |
The other one is #10773 |
No description provided.