Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YARN-11073. Avoid unnecessary preemption for tiny queues under certain corner cases #4110

Merged
merged 2 commits into from
May 13, 2022

Conversation

jchenjc
Copy link
Contributor

@jchenjc jchenjc commented Mar 26, 2022

Description of PR (YARN-11073)

When running a Hive job in a low-capacity queue on an idle cluster, preemption kicked in to preempt job containers even though there's no other job running and competing for resources.

Let's take this scenario as an example:

cluster resource : <Memory:168GB, VCores:48>
queue_low: min_capacity 1%
queue_mid: min_capacity 19%
queue_high: min_capacity 80%
CapacityScheduler with DRF

During the fifo preemption candidates selection process, the preemptableAmountCalculator needs to first "computeIdealAllocation" which depends on each queue's guaranteed/min capacity. A queue's guaranteed capacity is currently calculated as "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed capacity of queue_low is:
queue_low: <Memory: (168*0.01)GB, VCores:(48*0.01)> = <Memory:1.68GB, VCores:0.48>, but since the Resource object takes only Long values, these Doubles values get casted into Long, and then the final result becomes <Memory:1GB, VCores:0>

Because the guaranteed capacity of queue_low is 0, its normalized guaranteed capacity based on active queues is also 0 based on the current algorithm in "resetCapacity". This eventually leads to the continuous preemption of job containers running in queue_low.

In order to work around this corner case, "resetCapacity" needs to consider a couple new scenarios:

if the sum of absoluteCapacity/minCapacity of all active queues is zero, we should normalize their guaranteed capacity evenly: 1.0f / num_of_queues

if the sum of pre-normalized guaranteed capacity values (MB or VCores) of all active queues is zero, meaning we might have several queues like queue_low whose capacity value got casted into 0, we should normalize evenly as well like the first scenario (if they are all tiny, it really makes no big difference, for example, 1% vs 1.2%).

if one of the active queues has a zero pre-normalized guaranteed capacity value but its absoluteCapacity/minCapacity is not zero, then we should normalize based on the weight of their configured queue absoluteCapacity/minCapacity. This is to make sure queue_low gets a small but fair normalized value when queue_mid is also active.
minCapacity / (sum_of_min_capacity_of_active_queues)

How was this patch tested?

The patch was tested on a small cluster with queues configured to be same as the scenario described above, verified that

  • containers running in a low-capacity queue didn't get preempted when the cluster is idle
  • preemption kicked in properly in the low-capacity queue when cluster is busy (heavy usage in high-capacity queues)

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@aajisaka

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 37s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 33m 38s trunk passed
+1 💚 compile 1m 6s trunk passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 compile 0m 58s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 0m 47s trunk passed
+1 💚 mvnsite 1m 3s trunk passed
+1 💚 javadoc 0m 52s trunk passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 0m 47s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 1m 56s trunk passed
+1 💚 shadedclient 21m 10s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 51s the patch passed
+1 💚 compile 0m 58s the patch passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 javac 0m 58s the patch passed
+1 💚 compile 0m 49s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 0m 49s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 38s /results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 2 new + 17 unchanged - 0 fixed = 19 total (was 17)
+1 💚 mvnsite 0m 53s the patch passed
+1 💚 javadoc 0m 40s the patch passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 0m 38s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 1m 55s the patch passed
+1 💚 shadedclient 21m 1s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 96m 46s hadoop-yarn-server-resourcemanager in the patch passed.
+1 💚 asflicense 0m 34s The patch does not generate ASF License warnings.
188m 12s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4110/1/artifact/out/Dockerfile
GITHUB PR #4110
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname Linux a395380cea8d 4.15.0-161-generic #169-Ubuntu SMP Fri Oct 15 13:41:54 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 27ab492
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4110/1/testReport/
Max. process+thread count 946 (vs. ulimit of 5500)
modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4110/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 54s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 36m 25s trunk passed
+1 💚 compile 1m 13s trunk passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 compile 0m 57s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 0m 46s trunk passed
+1 💚 mvnsite 1m 2s trunk passed
+1 💚 javadoc 0m 49s trunk passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 0m 42s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 1m 58s trunk passed
+1 💚 shadedclient 24m 45s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 53s the patch passed
+1 💚 compile 1m 1s the patch passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 javac 1m 1s the patch passed
+1 💚 compile 0m 51s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 0m 51s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 39s the patch passed
+1 💚 mvnsite 0m 53s the patch passed
+1 💚 javadoc 0m 41s the patch passed with JDK Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 0m 37s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 2m 1s the patch passed
+1 💚 shadedclient 24m 26s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 100m 43s hadoop-yarn-server-resourcemanager in the patch passed.
+1 💚 asflicense 0m 30s The patch does not generate ASF License warnings.
201m 37s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4110/2/artifact/out/Dockerfile
GITHUB PR #4110
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname Linux e0e5b678d10f 4.15.0-166-generic #174-Ubuntu SMP Wed Dec 8 19:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / f9303d2
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4110/2/testReport/
Max. process+thread count 938 (vs. ulimit of 5500)
modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4110/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

Copy link
Member

@aajisaka aajisaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for opening the PR. I'm +1.

If you are willing to write a test case based on my comment, I can review the tests as well in this PR. If not, I can try to write a test case in a separate JIRA.

* @param queues
* the list of queues to consider
* @param ignoreGuar
* ignore guarantee.
*/
private void resetCapacity(Resource clusterResource,
Collection<TempQueuePerPartition> queues, boolean ignoreGuar) {
private void resetCapacity(Collection<TempQueuePerPartition> queues,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For testing, I think we can make this method static package-private and add @VisibleForTesting annotation. That way we can call this method directly from test class. Note that the package of the annotation must be "org.apache.hadoop.thirdparty.com.google.common.annotations".

@aajisaka aajisaka merged commit d2c9eb6 into apache:trunk May 13, 2022
@aajisaka aajisaka changed the title YARN-11073: avoid unnecessary preemption for tiny queues under certain corner cases YARN-11073. Avoid unnecessary preemption for tiny queues under certain corner cases May 13, 2022
HarshitGupta11 pushed a commit to HarshitGupta11/hadoop that referenced this pull request Nov 28, 2022
…n corner cases (apache#4110)

Co-authored-by: Jian Chen <[email protected]>
Signed-off-by: Akira Ajisaka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants