New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Limit memory used by fault tolerant scheduler on coordinator #10877

Merged

losipiuk merged 2 commits into trinodb:master from arhimondr:task-descriptor-storage

Feb 9, 2022

Contributor

arhimondr commented Feb 1, 2022

No description provided.

cla-bot bot added the cla-signed label

arhimondr requested review from losipiuk and linzebing

February 1, 2022 02:26

arhimondr mentioned this pull request

Support Failure Recovery #9101

Closed

31 tasks

arhimondr force-pushed the task-descriptor-storage branch from e65d467 to 5a4ee73 Compare

February 1, 2022 15:34

losipiuk reviewed

View reviewed changes

core/trino-main/src/main/java/io/trino/metadata/Split.java Show resolved Hide resolved

losipiuk reviewed

View reviewed changes

core/trino-main/src/main/java/io/trino/execution/scheduler/TaskDescriptor.java

    
                      long result = retainedSizeInBytes;

                      if (result == 0) {

                          result = INSTANCE_SIZE

                                  + estimatedSizeOf(asMap(splits), PlanNodeId::getRetainedSizeInBytes, splits -> estimatedSizeOf(splits, Split::getRetainedSizeInBytes))

Member

losipiuk Feb 8, 2022 •

edited

Loading

would be nice to add support for estimatedSizeOf with Mutlimap to airlift.
Change from Multimap to ListMultimap does not seem super natural.

Contributor Author

arhimondr Feb 9, 2022

Multimap is a Guava class. The SizeOf class is in the slice project that doesn't have a Guava dependency.

losipiuk reviewed

View reviewed changes

core/trino-main/src/main/java/io/trino/memory/NodeMemoryConfig.java Outdated

    
            @@ -104,4 +106,18 @@ public NodeMemoryConfig setHeapHeadroom(DataSize heapHeadroom)
          
                      this.heapHeadroom = heapHeadroom;

                      return this;

                  }

                  @NotNull

                  public DataSize getMaxTaskDescriptorStorageMemory()

Member

losipiuk Feb 8, 2022

NodeMemoryConfig is more about workers, right? Would MemoryManagerConfig be a better place for this one?

Contributor Author

arhimondr Feb 9, 2022

It looks like MemoryManagerConfig is specific to ClusterMemoryManager. But I agree, NodeMemoryConfig feels to be more specific to workers. I'm not sure if this config is worth a separate configuration file. What do you think about moving this config to the QueryManagerConfig? Currently this class contains many configuration parameters related to query / split scheduling.

Member

losipiuk Feb 9, 2022

Fine for me

Contributor Author

arhimondr Feb 9, 2022

Discussed offline. Moved to QueryManagerConfig.

losipiuk reviewed

View reviewed changes

core/trino-main/src/main/java/io/trino/memory/NodeMemoryConfig.java Outdated

    
            @@ -42,6 +42,8 @@
          
                  private DataSize heapHeadroom = DataSize.ofBytes(Math.round(AVAILABLE_HEAP_MEMORY * 0.3));

                  private DataSize maxTaskDescriptorStorageMemory = DataSize.ofBytes(Math.round(AVAILABLE_HEAP_MEMORY * 0.3));

Member

losipiuk Feb 8, 2022

feels a lot

Contributor Author

arhimondr Feb 9, 2022

Discussed offline. Reduced to 0.15 for now

losipiuk reviewed

View reviewed changes

core/trino-main/src/main/java/io/trino/execution/scheduler/TaskDescriptorStorage.java Outdated Show resolved Hide resolved

losipiuk reviewed

View reviewed changes

core/trino-main/src/main/java/io/trino/execution/scheduler/TaskDescriptorStorage.java

    
                      return reservedBytes;

                  }

                  @NotThreadSafe

Member

losipiuk Feb 8, 2022

nit: not important as this is internal class.

Contributor Author

arhimondr Feb 9, 2022

I just tough it wouldn't hurt to declare the intention explicitly so then nobody is tempted to use this object outside the lock or try enforce synchronization in the object itself.

losipiuk reviewed

View reviewed changes

core/trino-main/src/test/java/io/trino/execution/scheduler/TestTaskDescriptorStorage.java Outdated Show resolved Hide resolved

losipiuk reviewed

View reviewed changes

core/trino-main/src/main/java/io/trino/execution/scheduler/TaskDescriptorStorage.java

    
                  {

                      reservedBytes += delta;

                      while (reservedBytes > maxTaskDescriptorStorageMemoryInBytes) {

                          // drop a query that uses the most storage

Member

losipiuk Feb 8, 2022

I am not sure killing the biggest query is best approach. The queries which use the most memory are the biggest, and probably already executing for a long time. Killing such query is costly. I would rather kill the queries which made least progress already (newest ones).

Member

losipiuk Feb 8, 2022

Also I am not a big fan of entangling storage and killing logic. But I must agree it makes interfaces simpler.

Contributor Author

arhimondr Feb 9, 2022

I am not sure killing the biggest query is best approach. The queries which use the most memory are the biggest, and probably already executing for a long time. Killing such query is costly. I would rather kill the queries which made least progress already (newest ones).

I was also thinking about this. Contrary killing any other query than the one that has the most meta information buffered might be confusing. It is easier to communicate to a user that their query got killed because it scans very large tables or tables with a lot of small files (that results in high memory utilization in split enumeration) vs communicating that their queries are killed because there's other query running in a cluster that scans large tables.

Regardless, it still just a stop gap approach. If we see that the queries are killed because of this limitation and it is problematic we will invest into spilling to prevent failures of this kind altogether.

Also I am not a big fan of entangling storage and killing logic. But I must agree it makes interfaces simpler.

Killing is just a temporary solution. The end goal is to have spilling capabilities implemented in the storage. Thus I'm not sure if it is worth overthinking how the queries are killed, as eventually we don't want the queries to be killed at all.

losipiuk reviewed

View reviewed changes

core/trino-main/src/main/java/io/trino/execution/scheduler/TaskDescriptorStorage.java Outdated Show resolved Hide resolved

losipiuk approved these changes

View reviewed changes

Member

losipiuk left a comment

Looks good. Thanks.

linzebing approved these changes

View reviewed changes

core/trino-main/src/main/java/io/trino/execution/scheduler/TaskDescriptorStorage.java Outdated Show resolved Hide resolved

core/trino-main/src/test/java/io/trino/execution/scheduler/TestTaskDescriptorStorage.java Outdated Show resolved Hide resolved

core/trino-main/src/test/java/io/trino/execution/scheduler/TestTaskDescriptorStorage.java Outdated Show resolved Hide resolved

arhimondr added 2 commits

February 9, 2022 11:27


          Implement memory accounting for TaskDescriptor


          Limit memory used by fault tolerant scheduler on coordinator

6348ebc

arhimondr force-pushed the task-descriptor-storage branch from 5a4ee73 to 6348ebc Compare

February 9, 2022 16:38

Contributor Author

arhimondr commented Feb 9, 2022

Updated

Member

losipiuk commented Feb 9, 2022

ci / test (:trino-hive) (pull_request) #10772

Member

losipiuk commented Feb 9, 2022

The other one is #10773

losipiuk merged commit 0862aaa into trinodb:master

github-actions bot added this to the 371 milestone

mosabua mentioned this pull request

Add Trino 371 release notes #10943

Merged

losipiuk assigned arhimondr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels