-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable the scheduler by default #5463
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #5463 +/- ##
==========================================
+ Coverage 76.26% 77.75% +1.48%
==========================================
Files 234 241 +7
Lines 14386 14650 +264
Branches 640 644 +4
==========================================
+ Hits 10972 11391 +419
+ Misses 3414 3259 -155 ☔ View full report in Codecov by Sentry. |
I realized that it is necessary to split scheduler tests into unit tests and system tests to run successfully. |
Strangely, unit tests only failed int the github action runner. |
Now only two workflows are remaining.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ready to review
@@ -29,7 +29,7 @@ on: | |||
env: | |||
# openwhisk env | |||
TEST_SUITE: System | |||
ANSIBLE_CMD: "ansible-playbook -i environments/local -e docker_image_prefix=testing" | |||
ANSIBLE_CMD: "ansible-playbook -i environments/local -e docker_image_prefix=testing -e container_pool_akka_client=false" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The akka client is not well aligned with the FunctionPollingContainerProxy.
So I changed this to the traditional apache HTTP client.
@@ -65,6 +65,8 @@ jobs: | |||
sudo rm -rf "$AGENT_TOOLSDIRECTORY" | |||
- name: Check free space | |||
run: df -h | |||
- name: Disable the scheduler | |||
run: "./tools/github/disable-scheduler.sh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standalone mode is only supported with ShardingPoolBalancer, so we need to disable the scheduler for standalone tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objections to the change, but do you think the standalone mode could be supported with the new scheduler or should we be looking to deprecate standalone mode? Long term, I'd think we would want to consolidate down to a single scheduler (the new one) to simplify maintenance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be great to discuss and we can handle it in another issue if necessary.
I just wanted not to postpone the release of the new version of OpenWhisk Core due to this.
I am still unclear if we can support the standalone mode with the scheduler or even if it would be efficient to enable the scheduler in the standalone mode.
The scheduler works as a queue for actions and serves activations according to requests from containers(invokers).
Currently, a controller works as a controller and an invoker at the same time in the standalone mode.
If we put the scheduler into it, it would balance loads, provision containers, buffer activations, and send them to containers. We can make it work but I am not sure making one component that is in charge of all of these roles is aligned well with the goal of the standalone mode. I expect the standalone mode would be used in an environment with limited resources like an IoT machine. I have a concern that it might be too complex and require too many resources in such a circumstance.
|
||
**ansible/environments/local/group_vars/all** | ||
```yaml | ||
scheduler_enable: true | ||
scheduler_enable: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, the scheduler is enabled by default.
@@ -23,12 +23,12 @@ whisk.spi { | |||
MessagingProvider = org.apache.openwhisk.connector.kafka.KafkaMessagingProvider | |||
ContainerFactoryProvider = org.apache.openwhisk.core.containerpool.docker.DockerContainerFactoryProvider | |||
LogStoreProvider = org.apache.openwhisk.core.containerpool.logging.DockerToActivationLogStoreProvider | |||
LoadBalancerProvider = org.apache.openwhisk.core.loadBalancer.ShardingContainerPoolBalancer | |||
EntitlementSpiProvider = org.apache.openwhisk.core.entitlement.LocalEntitlementProvider | |||
LoadBalancerProvider = org.apache.openwhisk.core.loadBalancer.FPCPoolBalancer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scheduler-related SPI providers are used by default.
@@ -31,7 +31,7 @@ whisk { | |||
timeout-addon = 1m | |||
|
|||
fpc { | |||
use-per-min-throttles = false | |||
use-per-min-throttles = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required to pass per-minute throttling tests.
@@ -229,68 +229,6 @@ class ThrottleTests | |||
settleThrottles(alreadyWaited) | |||
} | |||
} | |||
|
|||
it should "throttle 'concurrent' activations of one action" in withAssetCleaner(wskprops) { (wp, assetHelper) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test assumes the ShardingPoolBalancer is used.
The concurrency in the FPCscheduler differs from the ShardingPoolBalancer's one.
The sharding pool balancer counts the number of activations on the fly and throttles requests based on the concurrent invocation limit.
On the other hand, in the new scheduler, the concurrent limit implies the number of concurrent containers for a given action. Throttling works based on the processing power of existing containers.
When the maximum number of containers are being used and another action is invoked for the first time, it can't create more containers, it will reject the request with 429 too many requests.
Please refer to this comment: https://github.com/apache/openwhisk/blob/master/core/controller/src/main/scala/org/apache/openwhisk/core/loadBalancer/FPCPoolBalancer.scala#L683
In this regard, I removed this test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@style95 is there an alternate test that validates the behavior described int the comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not the same, but there are two related tests.
The load balancer gets throttling flags from ETCD and rejects/accepts requests according to it.
So this test covers the behavior of the load balancer that decides based on the throttling flags.
https://github.com/apache/openwhisk/blob/master/tests/src/test/scala/org/apache/openwhisk/core/controller/test/FPCEntitlementTests.scala
And MemoryQueue populates these throttling flags with its state transition.
So this test covers if the memory queue sets the throttling key when it changes its state to a throttled state.
https://github.com/apache/openwhisk/blob/master/tests/src/test/scala/org/apache/openwhisk/core/scheduler/queue/test/MemoryQueueFlowTests.scala#L363
There are also some other tests related to throttling like
val runs = (1 to requestCount).map { _ => | ||
Future { | ||
//expect only 1 activation concurrently (within the 1 second timeout implemented in concurrent.js) | ||
wsk.action.invoke(name, Map("requestCount" -> 1.toJson), blocking = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case also assumes the sharding pool balancer is used.
It expects multiple containers to be created and each of them receives only 1 request each.
But with the new scheduler, one container could receive multiple activations.
MessagingProvider = org.apache.openwhisk.connector.kafka.KafkaMessagingProvider | ||
ContainerFactoryProvider = org.apache.openwhisk.core.containerpool.docker.DockerContainerFactoryProvider | ||
LogStoreProvider = org.apache.openwhisk.core.containerpool.logging.DockerToActivationLogStoreProvider | ||
LoadBalancerProvider = org.apache.openwhisk.core.loadBalancer.ShardingContainerPoolBalancer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script is to enable the ShardingPoolBalancer for the standalone tests.
116a730
to
bcf8d2f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed rationale for the changes!
@@ -65,6 +65,8 @@ jobs: | |||
sudo rm -rf "$AGENT_TOOLSDIRECTORY" | |||
- name: Check free space | |||
run: df -h | |||
- name: Disable the scheduler | |||
run: "./tools/github/disable-scheduler.sh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objections to the change, but do you think the standalone mode could be supported with the new scheduler or should we be looking to deprecate standalone mode? Long term, I'd think we would want to consolidate down to a single scheduler (the new one) to simplify maintenance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, had a question about the tests being removed (specifically if there are comparable tests for the new scheduler, if that makes sense).
@@ -229,68 +229,6 @@ class ThrottleTests | |||
settleThrottles(alreadyWaited) | |||
} | |||
} | |||
|
|||
it should "throttle 'concurrent' activations of one action" in withAssetCleaner(wskprops) { (wp, assetHelper) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@style95 is there an alternate test that validates the behavior described int the comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think if we're going to continue working on this scheduler, the next major needed architectural improvement is to distribute function traffic across schedulers. As it is right now since a queue to handle activations for a specific action is assigned to a single scheduler, the max throughput of an action is limited to the max cpu / network throughput / kafka consumer reads of that scheduler node. |
@dubee |
Exactly, merging this PR should be a starting point. |
Thank you all for the reviews. |
Description
This is to enable the scheduler by default.
It is necessary to release the next version of OpenWhisk Core.
Related issue and scope
My changes affect the following components
Types of changes
Checklist: