Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable the scheduler by default #5463

Merged
merged 13 commits into from
Mar 6, 2024

Conversation

style95
Copy link
Member

@style95 style95 commented Feb 11, 2024

Description

This is to enable the scheduler by default.
It is necessary to release the next version of OpenWhisk Core.

Related issue and scope

  • I opened an issue to propose and discuss this change (#????)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Scheduler
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

@codecov-commenter
Copy link

codecov-commenter commented Feb 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.75%. Comparing base (5529cc4) to head (e257423).

❗ Current head e257423 differs from pull request most recent head bcf8d2f. Consider uploading reports for the commit bcf8d2f to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5463      +/-   ##
==========================================
+ Coverage   76.26%   77.75%   +1.48%     
==========================================
  Files         234      241       +7     
  Lines       14386    14650     +264     
  Branches      640      644       +4     
==========================================
+ Hits        10972    11391     +419     
+ Misses       3414     3259     -155     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@style95
Copy link
Member Author

style95 commented Feb 14, 2024

I realized that it is necessary to split scheduler tests into unit tests and system tests to run successfully.
Some tests in the scheduler tests workflow manipulate ETCD data and it interferes other tests.

@style95
Copy link
Member Author

style95 commented Feb 19, 2024

Strangely, unit tests only failed int the github action runner.

@style95
Copy link
Member Author

style95 commented Feb 23, 2024

Now only two workflows are remaining..

Copy link
Member Author

@style95 style95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ready to review

@@ -29,7 +29,7 @@ on:
env:
# openwhisk env
TEST_SUITE: System
ANSIBLE_CMD: "ansible-playbook -i environments/local -e docker_image_prefix=testing"
ANSIBLE_CMD: "ansible-playbook -i environments/local -e docker_image_prefix=testing -e container_pool_akka_client=false"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The akka client is not well aligned with the FunctionPollingContainerProxy.
So I changed this to the traditional apache HTTP client.

@@ -65,6 +65,8 @@ jobs:
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
- name: Check free space
run: df -h
- name: Disable the scheduler
run: "./tools/github/disable-scheduler.sh"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standalone mode is only supported with ShardingPoolBalancer, so we need to disable the scheduler for standalone tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objections to the change, but do you think the standalone mode could be supported with the new scheduler or should we be looking to deprecate standalone mode? Long term, I'd think we would want to consolidate down to a single scheduler (the new one) to simplify maintenance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be great to discuss and we can handle it in another issue if necessary.
I just wanted not to postpone the release of the new version of OpenWhisk Core due to this.

I am still unclear if we can support the standalone mode with the scheduler or even if it would be efficient to enable the scheduler in the standalone mode.
The scheduler works as a queue for actions and serves activations according to requests from containers(invokers).
Currently, a controller works as a controller and an invoker at the same time in the standalone mode.
If we put the scheduler into it, it would balance loads, provision containers, buffer activations, and send them to containers. We can make it work but I am not sure making one component that is in charge of all of these roles is aligned well with the goal of the standalone mode. I expect the standalone mode would be used in an environment with limited resources like an IoT machine. I have a concern that it might be too complex and require too many resources in such a circumstance.


**ansible/environments/local/group_vars/all**
```yaml
scheduler_enable: true
scheduler_enable: false
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, the scheduler is enabled by default.

@@ -23,12 +23,12 @@ whisk.spi {
MessagingProvider = org.apache.openwhisk.connector.kafka.KafkaMessagingProvider
ContainerFactoryProvider = org.apache.openwhisk.core.containerpool.docker.DockerContainerFactoryProvider
LogStoreProvider = org.apache.openwhisk.core.containerpool.logging.DockerToActivationLogStoreProvider
LoadBalancerProvider = org.apache.openwhisk.core.loadBalancer.ShardingContainerPoolBalancer
EntitlementSpiProvider = org.apache.openwhisk.core.entitlement.LocalEntitlementProvider
LoadBalancerProvider = org.apache.openwhisk.core.loadBalancer.FPCPoolBalancer
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scheduler-related SPI providers are used by default.

@@ -31,7 +31,7 @@ whisk {
timeout-addon = 1m

fpc {
use-per-min-throttles = false
use-per-min-throttles = true
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required to pass per-minute throttling tests.

@@ -229,68 +229,6 @@ class ThrottleTests
settleThrottles(alreadyWaited)
}
}

it should "throttle 'concurrent' activations of one action" in withAssetCleaner(wskprops) { (wp, assetHelper) =>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test assumes the ShardingPoolBalancer is used.
The concurrency in the FPCscheduler differs from the ShardingPoolBalancer's one.

The sharding pool balancer counts the number of activations on the fly and throttles requests based on the concurrent invocation limit.
On the other hand, in the new scheduler, the concurrent limit implies the number of concurrent containers for a given action. Throttling works based on the processing power of existing containers.
When the maximum number of containers are being used and another action is invoked for the first time, it can't create more containers, it will reject the request with 429 too many requests.

Please refer to this comment: https://github.com/apache/openwhisk/blob/master/core/controller/src/main/scala/org/apache/openwhisk/core/loadBalancer/FPCPoolBalancer.scala#L683

In this regard, I removed this test case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@style95 is there an alternate test that validates the behavior described int the comments?

Copy link
Member Author

@style95 style95 Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the same, but there are two related tests.
The load balancer gets throttling flags from ETCD and rejects/accepts requests according to it.
So this test covers the behavior of the load balancer that decides based on the throttling flags.
https://github.com/apache/openwhisk/blob/master/tests/src/test/scala/org/apache/openwhisk/core/controller/test/FPCEntitlementTests.scala

And MemoryQueue populates these throttling flags with its state transition.
So this test covers if the memory queue sets the throttling key when it changes its state to a throttled state.
https://github.com/apache/openwhisk/blob/master/tests/src/test/scala/org/apache/openwhisk/core/scheduler/queue/test/MemoryQueueFlowTests.scala#L363

There are also some other tests related to throttling like

val runs = (1 to requestCount).map { _ =>
Future {
//expect only 1 activation concurrently (within the 1 second timeout implemented in concurrent.js)
wsk.action.invoke(name, Map("requestCount" -> 1.toJson), blocking = true)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case also assumes the sharding pool balancer is used.
It expects multiple containers to be created and each of them receives only 1 request each.
But with the new scheduler, one container could receive multiple activations.

MessagingProvider = org.apache.openwhisk.connector.kafka.KafkaMessagingProvider
ContainerFactoryProvider = org.apache.openwhisk.core.containerpool.docker.DockerContainerFactoryProvider
LogStoreProvider = org.apache.openwhisk.core.containerpool.logging.DockerToActivationLogStoreProvider
LoadBalancerProvider = org.apache.openwhisk.core.loadBalancer.ShardingContainerPoolBalancer
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script is to enable the ShardingPoolBalancer for the standalone tests.

@style95 style95 force-pushed the enable-scheduler-by-default branch from 116a730 to bcf8d2f Compare February 24, 2024 07:58
Copy link
Member

@dgrove-oss dgrove-oss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed rationale for the changes!

@@ -65,6 +65,8 @@ jobs:
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
- name: Check free space
run: df -h
- name: Disable the scheduler
run: "./tools/github/disable-scheduler.sh"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objections to the change, but do you think the standalone mode could be supported with the new scheduler or should we be looking to deprecate standalone mode? Long term, I'd think we would want to consolidate down to a single scheduler (the new one) to simplify maintenance.

Copy link
Member

@rabbah rabbah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, had a question about the tests being removed (specifically if there are comparable tests for the new scheduler, if that makes sense).

@@ -229,68 +229,6 @@ class ThrottleTests
settleThrottles(alreadyWaited)
}
}

it should "throttle 'concurrent' activations of one action" in withAssetCleaner(wskprops) { (wp, assetHelper) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@style95 is there an alternate test that validates the behavior described int the comments?

Copy link
Member

@dubee dubee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rabbah, @style95, are there docs explaining how the scheduler works?

@bdoyle0182
Copy link
Contributor

bdoyle0182 commented Feb 26, 2024

LGTM. I think if we're going to continue working on this scheduler, the next major needed architectural improvement is to distribute function traffic across schedulers. As it is right now since a queue to handle activations for a specific action is assigned to a single scheduler, the max throughput of an action is limited to the max cpu / network throughput / kafka consumer reads of that scheduler node.

@style95
Copy link
Member Author

style95 commented Feb 27, 2024

are there docs explaining how the scheduler works?

@dubee
You can refer to this document: https://cwiki.apache.org/confluence/display/OPENWHISK/New+architecture+proposal
It might be slightly outdated, but it explains most of the major things.

@style95
Copy link
Member Author

style95 commented Feb 27, 2024

LGTM. I think if we're going to continue working on this scheduler, the next major needed architectural improvement is to distribute function traffic across schedulers. As it is right now since a queue to handle activations for a specific action is assigned to a single scheduler, the max throughput of an action is limited to the max cpu / network throughput / kafka consumer reads of that scheduler node.

Exactly, merging this PR should be a starting point.
I think there is still room for improvement in many directions.

@style95
Copy link
Member Author

style95 commented Feb 28, 2024

Thank you all for the reviews.
I will merge this at the end of this week. Please share any concerns, or comments at any time.
After merging this, I will look into the openwhisk-deploy-kube repo part.

@style95 style95 merged commit 4fac03a into apache:master Mar 6, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants