Enable the scheduler by default #5463

style95 · 2024-02-11T01:07:53Z

Description

This is to enable the scheduler by default.
It is necessary to release the next version of OpenWhisk Core.

Related issue and scope

I opened an issue to propose and discuss this change (#????)

My changes affect the following components

Types of changes

Bug fix (generally a non-breaking change which closes an issue).
Enhancement or new feature (adds new functionality).
Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

I signed an Apache CLA.
I reviewed the style guides and followed the recommendations (Travis CI will check :).
I added tests to cover my changes.
My changes require further changes to the documentation.
I updated the documentation where necessary.

codecov-commenter · 2024-02-11T01:26:55Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.75%. Comparing base (5529cc4) to head (e257423).

❗ Current head e257423 differs from pull request most recent head bcf8d2f. Consider uploading reports for the commit bcf8d2f to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5463      +/-   ##
==========================================
+ Coverage   76.26%   77.75%   +1.48%     
==========================================
  Files         234      241       +7     
  Lines       14386    14650     +264     
  Branches      640      644       +4     
==========================================
+ Hits        10972    11391     +419     
+ Misses       3414     3259     -155

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

style95 · 2024-02-14T22:42:09Z

I realized that it is necessary to split scheduler tests into unit tests and system tests to run successfully.
Some tests in the scheduler tests workflow manipulate ETCD data and it interferes other tests.

style95 · 2024-02-19T09:22:56Z

Strangely, unit tests only failed int the github action runner.

style95 · 2024-02-23T00:09:07Z

Now only two workflows are remaining..

style95

It's ready to review

style95 · 2024-02-24T07:41:14Z

.github/workflows/2-system.yaml

@@ -29,7 +29,7 @@ on:
 env:
  # openwhisk env
  TEST_SUITE: System
-  ANSIBLE_CMD: "ansible-playbook -i environments/local -e docker_image_prefix=testing"
+  ANSIBLE_CMD: "ansible-playbook -i environments/local -e docker_image_prefix=testing -e container_pool_akka_client=false"


The akka client is not well aligned with the FunctionPollingContainerProxy.
So I changed this to the traditional apache HTTP client.

style95 · 2024-02-24T07:41:48Z

.github/workflows/4-standalone.yaml

@@ -65,6 +65,8 @@ jobs:
          sudo rm -rf "$AGENT_TOOLSDIRECTORY"
      - name: Check free space
        run: df -h
+      - name: Disable the scheduler
+        run: "./tools/github/disable-scheduler.sh"


The standalone mode is only supported with ShardingPoolBalancer, so we need to disable the scheduler for standalone tests.

No objections to the change, but do you think the standalone mode could be supported with the new scheduler or should we be looking to deprecate standalone mode? Long term, I'd think we would want to consolidate down to a single scheduler (the new one) to simplify maintenance.

Yes, that would be great to discuss and we can handle it in another issue if necessary.
I just wanted not to postpone the release of the new version of OpenWhisk Core due to this.

I am still unclear if we can support the standalone mode with the scheduler or even if it would be efficient to enable the scheduler in the standalone mode.
The scheduler works as a queue for actions and serves activations according to requests from containers(invokers).
Currently, a controller works as a controller and an invoker at the same time in the standalone mode.
If we put the scheduler into it, it would balance loads, provision containers, buffer activations, and send them to containers. We can make it work but I am not sure making one component that is in charge of all of these roles is aligned well with the goal of the standalone mode. I expect the standalone mode would be used in an environment with limited resources like an IoT machine. I have a concern that it might be too complex and require too many resources in such a circumstance.

style95 · 2024-02-24T07:41:59Z

ansible/README.md


 **ansible/environments/local/group_vars/all**
 ```yaml
-scheduler_enable: true
+scheduler_enable: false


Now, the scheduler is enabled by default.

style95 · 2024-02-24T07:42:30Z

common/scala/src/main/resources/reference.conf

@@ -23,12 +23,12 @@ whisk.spi {
  MessagingProvider = org.apache.openwhisk.connector.kafka.KafkaMessagingProvider
  ContainerFactoryProvider = org.apache.openwhisk.core.containerpool.docker.DockerContainerFactoryProvider
  LogStoreProvider = org.apache.openwhisk.core.containerpool.logging.DockerToActivationLogStoreProvider
-  LoadBalancerProvider = org.apache.openwhisk.core.loadBalancer.ShardingContainerPoolBalancer
-  EntitlementSpiProvider = org.apache.openwhisk.core.entitlement.LocalEntitlementProvider
+  LoadBalancerProvider = org.apache.openwhisk.core.loadBalancer.FPCPoolBalancer


Scheduler-related SPI providers are used by default.

style95 · 2024-02-24T07:42:53Z

core/controller/src/main/resources/reference.conf

@@ -31,7 +31,7 @@ whisk {
    timeout-addon = 1m

    fpc {
-      use-per-min-throttles = false
+      use-per-min-throttles = true


This is required to pass per-minute throttling tests.

style95 · 2024-02-24T07:52:50Z

tests/src/test/scala/limits/ThrottleTests.scala

@@ -229,68 +229,6 @@ class ThrottleTests
      settleThrottles(alreadyWaited)
    }
  }
-
-  it should "throttle 'concurrent' activations of one action" in withAssetCleaner(wskprops) { (wp, assetHelper) =>


This test assumes the ShardingPoolBalancer is used.
The concurrency in the FPCscheduler differs from the ShardingPoolBalancer's one.

The sharding pool balancer counts the number of activations on the fly and throttles requests based on the concurrent invocation limit.
On the other hand, in the new scheduler, the concurrent limit implies the number of concurrent containers for a given action. Throttling works based on the processing power of existing containers.
When the maximum number of containers are being used and another action is invoked for the first time, it can't create more containers, it will reject the request with 429 too many requests.

Please refer to this comment: https://github.com/apache/openwhisk/blob/master/core/controller/src/main/scala/org/apache/openwhisk/core/loadBalancer/FPCPoolBalancer.scala#L683

In this regard, I removed this test case.

@style95 is there an alternate test that validates the behavior described int the comments?

It's not the same, but there are two related tests.
The load balancer gets throttling flags from ETCD and rejects/accepts requests according to it.
So this test covers the behavior of the load balancer that decides based on the throttling flags.
https://github.com/apache/openwhisk/blob/master/tests/src/test/scala/org/apache/openwhisk/core/controller/test/FPCEntitlementTests.scala

And MemoryQueue populates these throttling flags with its state transition.
So this test covers if the memory queue sets the throttling key when it changes its state to a throttled state.
https://github.com/apache/openwhisk/blob/master/tests/src/test/scala/org/apache/openwhisk/core/scheduler/queue/test/MemoryQueueFlowTests.scala#L363

There are also some other tests related to throttling like

Throttling is enabled and disabled when it goes back to the Running state again

style95 · 2024-02-24T07:55:51Z

tests/src/test/scala/org/apache/openwhisk/core/limits/ConcurrencyTests.scala

-      val runs = (1 to requestCount).map { _ =>
-        Future {
-          //expect only 1 activation concurrently (within the 1 second timeout implemented in concurrent.js)
-          wsk.action.invoke(name, Map("requestCount" -> 1.toJson), blocking = true)


This test case also assumes the sharding pool balancer is used.
It expects multiple containers to be created and each of them receives only 1 request each.
But with the new scheduler, one container could receive multiple activations.

style95 · 2024-02-24T07:56:24Z

tools/github/disable-scheduler.sh

+  MessagingProvider = org.apache.openwhisk.connector.kafka.KafkaMessagingProvider
+  ContainerFactoryProvider = org.apache.openwhisk.core.containerpool.docker.DockerContainerFactoryProvider
+  LogStoreProvider = org.apache.openwhisk.core.containerpool.logging.DockerToActivationLogStoreProvider
+  LoadBalancerProvider = org.apache.openwhisk.core.loadBalancer.ShardingContainerPoolBalancer


This script is to enable the ShardingPoolBalancer for the standalone tests.

dgrove-oss

Thanks for the detailed rationale for the changes!

dgrove-oss · 2024-02-24T20:15:40Z

.github/workflows/4-standalone.yaml

@@ -65,6 +65,8 @@ jobs:
          sudo rm -rf "$AGENT_TOOLSDIRECTORY"
      - name: Check free space
        run: df -h
+      - name: Disable the scheduler
+        run: "./tools/github/disable-scheduler.sh"


No objections to the change, but do you think the standalone mode could be supported with the new scheduler or should we be looking to deprecate standalone mode? Long term, I'd think we would want to consolidate down to a single scheduler (the new one) to simplify maintenance.

rabbah

LGTM, had a question about the tests being removed (specifically if there are comparable tests for the new scheduler, if that makes sense).

rabbah · 2024-02-25T15:25:59Z

tests/src/test/scala/limits/ThrottleTests.scala

@@ -229,68 +229,6 @@ class ThrottleTests
      settleThrottles(alreadyWaited)
    }
  }
-
-  it should "throttle 'concurrent' activations of one action" in withAssetCleaner(wskprops) { (wp, assetHelper) =>


@style95 is there an alternate test that validates the behavior described int the comments?

dubee

@rabbah, @style95, are there docs explaining how the scheduler works?

bdoyle0182 · 2024-02-26T21:03:29Z

LGTM. I think if we're going to continue working on this scheduler, the next major needed architectural improvement is to distribute function traffic across schedulers. As it is right now since a queue to handle activations for a specific action is assigned to a single scheduler, the max throughput of an action is limited to the max cpu / network throughput / kafka consumer reads of that scheduler node.

style95 · 2024-02-27T00:31:35Z

are there docs explaining how the scheduler works?

@dubee
You can refer to this document: https://cwiki.apache.org/confluence/display/OPENWHISK/New+architecture+proposal
It might be slightly outdated, but it explains most of the major things.

style95 · 2024-02-27T00:32:37Z

LGTM. I think if we're going to continue working on this scheduler, the next major needed architectural improvement is to distribute function traffic across schedulers. As it is right now since a queue to handle activations for a specific action is assigned to a single scheduler, the max throughput of an action is limited to the max cpu / network throughput / kafka consumer reads of that scheduler node.

Exactly, merging this PR should be a starting point.
I think there is still room for improvement in many directions.

style95 · 2024-02-28T01:52:48Z

Thank you all for the reviews.
I will merge this at the end of this week. Please share any concerns, or comments at any time.
After merging this, I will look into the openwhisk-deploy-kube repo part.

style95 mentioned this pull request Feb 11, 2024

Add test cases that guarantee the main execution flows of FPCScheduler. #5308

Open

22 tasks

style95 commented Feb 24, 2024

View reviewed changes

style95 requested review from ningyougang and bdoyle0182 February 24, 2024 07:56

style95 added 13 commits February 24, 2024 16:58

Enable the scheduler by default

7915509

Configure proper spi implementations to enable the scheduler

781ee24

Adjust tests category

992c05c

Enable per-minute throttling for FPC load balancer

b4b3fde

Fix FPCEntitlementProviderTests

935e673

Disable akka client

ba331e0

Remove a test case that does not fit with the scheduler

8a15023

Remove concurrency-related tests

d169032

Disable the scheduler in the standalone tests

5355cb5

Revert disabling akka http client

60cb23f

Disable the scheduler from the workflow

97632ac

Fix the root directory

842c7f3

Disable akka http client for system tests

bcf8d2f

style95 force-pushed the enable-scheduler-by-default branch from 116a730 to bcf8d2f Compare February 24, 2024 07:58

dgrove-oss approved these changes Feb 24, 2024

View reviewed changes

rabbah approved these changes Feb 25, 2024

View reviewed changes

ningyougang approved these changes Feb 26, 2024

View reviewed changes

dubee reviewed Feb 26, 2024

View reviewed changes

bdoyle0182 approved these changes Feb 26, 2024

View reviewed changes

style95 merged commit 4fac03a into apache:master Mar 6, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable the scheduler by default #5463

Enable the scheduler by default #5463

style95 commented Feb 11, 2024

codecov-commenter commented Feb 11, 2024 •

edited

Loading

style95 commented Feb 14, 2024

style95 commented Feb 19, 2024

style95 commented Feb 23, 2024

style95 left a comment

style95 Feb 24, 2024

style95 Feb 24, 2024

dgrove-oss Feb 24, 2024

style95 Feb 25, 2024

style95 Feb 24, 2024

style95 Feb 24, 2024

style95 Feb 24, 2024

style95 Feb 24, 2024

rabbah Feb 25, 2024

style95 Feb 26, 2024 •

edited

Loading

style95 Feb 24, 2024

style95 Feb 24, 2024

dgrove-oss left a comment

dgrove-oss Feb 24, 2024

rabbah left a comment

rabbah Feb 25, 2024

dubee left a comment

bdoyle0182 commented Feb 26, 2024 •

edited

Loading

style95 commented Feb 27, 2024

style95 commented Feb 27, 2024

style95 commented Feb 28, 2024

Enable the scheduler by default #5463

Enable the scheduler by default #5463

Conversation

style95 commented Feb 11, 2024

Description

Related issue and scope

My changes affect the following components

Types of changes

Checklist:

codecov-commenter commented Feb 11, 2024 • edited Loading

Codecov Report

style95 commented Feb 14, 2024

style95 commented Feb 19, 2024

style95 commented Feb 23, 2024

style95 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

style95 Feb 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgrove-oss left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabbah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dubee left a comment

Choose a reason for hiding this comment

bdoyle0182 commented Feb 26, 2024 • edited Loading

style95 commented Feb 27, 2024

style95 commented Feb 27, 2024

style95 commented Feb 28, 2024

codecov-commenter commented Feb 11, 2024 •

edited

Loading

style95 Feb 26, 2024 •

edited

Loading

bdoyle0182 commented Feb 26, 2024 •

edited

Loading