Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support proactive checks in overload manager Api #18256

Merged
merged 11 commits into from
Dec 2, 2021

Conversation

nezdolik
Copy link
Member

@nezdolik nezdolik commented Sep 24, 2021

Signed-off-by: Kateryna Nezdolii [email protected]

This PR adds proacrive checks into overload manager/thread local state, when callers of overload manager API can check resource usage instantaneously on demand. This is first PR out of series (full code here: #15707). next PR will include new proactive monitor for downstream connections.

Commit Message: Support reactive checks in overload manager Api
Additional Description:
Risk Level: Low/Medium
Testing:
Docs Changes: NA
Release Notes:
Platform Specific Features:
Fixes #12419

@nezdolik nezdolik requested a review from htuch as a code owner September 24, 2021 15:01
@nezdolik
Copy link
Member Author

nezdolik commented Sep 24, 2021

@antoniovicente @mattklein123 @KBaichoo as discussed some time ago in #15707, broke huge change into smaller parts.

@nezdolik nezdolik changed the title Support reactive checks in overload manager Api Support proactive checks in overload manager Api Sep 24, 2021
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
@nezdolik
Copy link
Member Author

need to update overload_integration_test to support proactive resource monitors as well

@antoniovicente
Copy link
Contributor

Unfortunately, I currently don't have a lot of cycles for Envoy reviews. Assigning to Kevin to do a first pass.

Copy link
Contributor

@KBaichoo KBaichoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. Here's a first pass.

Is there's a doc on the rest of the chain of PRs that I can read to uunderstand the bigger picture? Thanks

/**
* Returns current resource usage tracked by monitor.
*/
virtual int64_t currentResourceUsage() const PURE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naive question: is the semantics of this to to block on that read or will implementors get the most recent read or?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Callers will get most recent read, added explanation to api docs.


class ProactiveResourceMonitor {
public:
ProactiveResourceMonitor() = default;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we're using both uint64_t and int64_t in this interface, perhaps we can simplify on one to avoid lots of cast. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched all to int64_t

OverloadProactiveResourceName::GlobalDownstreamMaxConnections}};
};

using OverloadProactiveResourceNames = ConstSingleton<OverloadProactiveResourceNameValues>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enum and this alias are almost the same apart from an 's'. Perhaps there's a better name?

@@ -46,6 +67,32 @@ class ThreadLocalOverloadState : public ThreadLocal::ThreadLocalObject {
public:
// Get a thread-local reference to the value for the given action key.
virtual const OverloadActionState& getState(const std::string& action) PURE;
/**
* Invokes corresponding resource monitor to allocate resource for given resource monitor in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invokes the corresponding

same below

@@ -46,6 +67,32 @@ class ThreadLocalOverloadState : public ThreadLocal::ThreadLocalObject {
public:
// Get a thread-local reference to the value for the given action key.
virtual const OverloadActionState& getState(const std::string& action) PURE;
/**
* Invokes corresponding resource monitor to allocate resource for given resource monitor in
* thread safe manner. Returns true if there is enough resource quota available and allocation has
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a thread safe manner.

Same below.

createProactiveResourceMonitor(const Protobuf::Message& config,
ResourceMonitorFactoryContext& context) PURE;

std::string category() const override { return "envoy.resource_monitors"; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the difference between this factory and the resource monitor factory? (Sorry if I'm being dense, they just look the same and I wonder whether they could be collapsed) Thanks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference is the return type for newly created monitor. We could have had instead two methods within ResourceMonitorFactory to create regular and proactive resource monitor. The downside there would be that each resource monitor type config factory will always need to provide implementation for both methods and one of them will always be "unimplemented". I see this as more evil (mode code duplication) that duplicating factory class here.

} else {
ENVOY_LOG_MISC(warn, " {Failed to allocate unknown proactive resource }");
// Resource monitor is not configured, pass through mode.
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems not be what the api specified e.g. return false if resource not registered is what I think the api says.

See: https://github.com/envoyproxy/envoy/pull/18256/files#diff-2f26eb90eacceaaace91ff4f8dd5084b053cf741d7facc156b11b34c2c5c867dR73

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

Comment on lines 51 to 55
if (proactive_resource->second.tryAllocateResource(increment)) {
return true;
} else {
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could just do return proactive_resource->second.tryAllocateResource(increment);

// proactive and regular resource monitors in configuration API. But internally we will maintain
// two distinct collections of proactive and regular resources. Proactive resources are not
// subject to periodic flushes and can be recalculated/updated on demand by invoking
// `tryAllocateResource/tryDeallocateResource` via thread local overload state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was very helpful, is there a doc somewhere I can read upon the differences of these some more? Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KBaichoo we don't have design doc that describes in detail the latest proposed implementation for proactive checks, there were multiple iterations. I will prep a doc, good idea.

@KBaichoo
Copy link
Contributor

KBaichoo commented Oct 5, 2021

/wait

Kateryna Nezdolii added 2 commits October 15, 2021 14:11
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
@nezdolik
Copy link
Member Author

nezdolik commented Oct 15, 2021

@KBaichoo documented the motivation and technical solution here: https://docs.google.com/document/d/1utDdgePeX8jTTYazi4FO-M73niQk8BQ_6cvrm0bHeIo/edit?usp=sharing. Please let me know if it is detailed enough.

@nezdolik
Copy link
Member Author

Spent time working on integration test and then realised it's not possible yet to have an overload integration test for proactive checks because there is no client code using proactive checks yet.

Kateryna Nezdolii added 2 commits October 15, 2021 20:11
Signed-off-by: Kateryna Nezdolii <[email protected]>
KBaichoo
KBaichoo previously approved these changes Oct 15, 2021
Copy link
Contributor

@KBaichoo KBaichoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting together a design doc, add some comments. This lgtm.

absl::optional<std::reference_wrapper<ResourceUpdateCallbacks>> callbacks_;
};

class FakeProactiveResourceMonitor : public ProactiveResourceMonitor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe consider refactoring this and the factory like FakeResourceMonitor. See:

class FakeResourceMonitor : public Server::ResourceMonitor {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KBaichoo should refactoring go in the same PR or a separate one, wdyt? (i would make it a separate one)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, missed this. I think a separate one sgtm.

@nezdolik
Copy link
Member Author

Clang tidy check fails with error which i don't think is related to this change:

clang-tidy check failed, potentially fixed by clang-apply-replacements:
Diagnostics:
- BuildDirectory: /build/tmp/_bazel_envoybuild/b570b5ccd0454dc9af9f65ab1833764d/execroot/envoy
  DiagnosticMessage:
    FileOffset: 313
    FilePath: ./source/common/common/hash.h
    Message: '''xxhash.h'' file not found'
    Replacements: []
  DiagnosticName: clang-diagnostic-error
  Level: Error
  Ranges:
  - {FileOffset: 313, FilePath: ./source/common/common/hash.h, Length: 10}

@nezdolik
Copy link
Member Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #18256 (comment) was created by @nezdolik.

see: more, trace.

@yanavlasov yanavlasov self-assigned this Nov 10, 2021
@rojkov
Copy link
Member

rojkov commented Nov 15, 2021

The CI failure looks irrelevant indeed. Could you perhaps merge the main branch again and see if it helps?

/wait

Kateryna Nezdolii added 2 commits November 19, 2021 10:56
Signed-off-by: Kateryna Nezdolii <[email protected]>
@nezdolik
Copy link
Member Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Check envoy-presubmit isn't fully completed, but will still attempt retrying.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #18256 (comment) was created by @nezdolik.

see: more, trace.

@rojkov
Copy link
Member

rojkov commented Nov 19, 2021

This time there is a new problem #19060 with CI.

@nezdolik
Copy link
Member Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #18256 (comment) was created by @nezdolik.

see: more, trace.

KBaichoo
KBaichoo previously approved these changes Nov 22, 2021
Copy link
Contributor

@KBaichoo KBaichoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bearing with us @nezdolik. @yanavlasov were you planning to review this as well? This LGTM apart from the nits pointed out.

OverloadProactiveResourceName::GlobalDownstreamMaxConnections));
bool resource_allocated = manager->getThreadLocalOverloadState().tryAllocateResource(
Server::OverloadProactiveResourceName::GlobalDownstreamMaxConnections, 1);
EXPECT_EQ(true, resource_allocated);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: EXPECT_TRUE()

EXPECT_EQ(true, resource_allocated);
resource_allocated = manager->getThreadLocalOverloadState().tryAllocateResource(
Server::OverloadProactiveResourceName::GlobalDownstreamMaxConnections, 3);
EXPECT_EQ(false, resource_allocated);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: EXPECT_FALSE()


bool resource_deallocated = manager->getThreadLocalOverloadState().tryDeallocateResource(
Server::OverloadProactiveResourceName::GlobalDownstreamMaxConnections, 1);
EXPECT_EQ(true, resource_deallocated);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise

Signed-off-by: Kateryna Nezdolii <[email protected]>
@nezdolik
Copy link
Member Author

Thanks @KBaichoo @yanavlasov, applied review comments.

@nezdolik
Copy link
Member Author

nezdolik commented Dec 1, 2021

@yanavlasov would you be able to take a look?

@yanavlasov yanavlasov merged commit 4b5eee6 into envoyproxy:main Dec 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

overload manager: overload signals based on number of downstream connections and active requests
5 participants