-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify heartbeat jobs and tasks for simplicity #9258
Conversation
func AfterJob(j Job, after func(*beat.Event, []Job, error) (*beat.Event, []Job, error)) Job { | ||
|
||
return CreateNamedJob( | ||
j.Name(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just copy the name, or proxy the execution here? I'm thinking there's a good argument for either one, but the copy is a little simpler. These names should be pretty stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the other option you propose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other option is to have name be a function rather than a field. Every time the name was asked for we'd invoke the child job. That would allow names to be more dynamic and lazy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OTOH, I don't know if we actually need that. This is def simpler.
return CreateNamedJob( | ||
job.Name(), | ||
func() (*beat.Event, []Job, error) { | ||
start := time.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the original behavior may have been incorrect, starting the timer when the Job was created, not when it was invoked. This behavior, at any rate, is def correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments and questions. I think the part I didn't fully follow is the move around of the settings.
jobName := jobName(typ, scheme, endpoint.Host, endpoint.Ports) | ||
settings := monitors.MakeHostJobSettings(jobName, endpoint.Host, mode).WithFields(fields) | ||
jobID := jobID(typ, scheme, endpoint.Host, endpoint.Ports) | ||
settings := monitors.MakeHostJobSettings(jobID, endpoint.Host, mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for myself: Trying to understand why WithFields
was moved from there to line 204.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's because those fields used to be part of the job settings. A general theme in this patch is applying Job level stuff higher up in the call stack, whereas before those options were passed in to a deep point, then used from annotate
. The wrapping strategy now used is more clear IMHO.
func AfterJob(j Job, after func(*beat.Event, []Job, error) (*beat.Event, []Job, error)) Job { | ||
|
||
return CreateNamedJob( | ||
j.Name(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the other option you propose.
heartbeat/monitors/monitor.go
Outdated
// will cause it to run with the given scheduler until Stop() is called. | ||
type Monitor struct { | ||
name string | ||
config *common.Config | ||
registrar *pluginsReg | ||
uniqueName string | ||
scheduler *scheduler.Scheduler | ||
jobTasks []*task | ||
jobTasks []*configuredJob |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially confusing that we have a variable jobTasks
with the task part inside? Not sure i have good alternatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree that's really confusing, that's a holdover. I think we should just call it configuredJobs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or just jobs
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing I don't like about jobs
is that then we can call jobs[0].job
. I'm going with configuredJobs
for now. Glad to discuss further if you think that isn't a great reason.
"status": status, | ||
}, | ||
MergeEventFields(event, common.MapStr{ | ||
"error": look.Reason(err), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the event look like this creates? Just want to make sure error
is an object and not a keyword.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It returns an object with two fields type
, and message
. This should be identical to the existing behavior (and we have tests backing this)
return MakeCont(func() (common.MapStr, []TaskRunner, error) { | ||
event, cont, err := r.Run() | ||
if event != nil { | ||
event = event.Clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this clone here is important to prevent races but I can't find it in the code anymore in MergeEventFields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I'm not clear if we actually do need this. I could understand us cloning the passed in fields
, but cloning the event seems weird. That should not have any shared values.
I don't think we need the clone.
@urso your thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make to double check this. Can you also check when this was introduced and the PR comments for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it was introduced in #6950 , which did this to prevent races among shared structures in the case continuations were created. After a good amount of spelunking I don't see any races.
However, reasoning about those races is really hard to do, you basically need to bootstrap all the code into your head and assume that you understand it correctly.
That said, I don't think the approach of fixing it in this spot is necessarily correct. This assumes that WithFields
is the only place mutation happens.
I'd like to think about this for a bit and see if there's a more bullet-proof solution I can think of (short of moving toward immutable datastructures).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you see as disadvantages of 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I don't think 1 and 2 are competing proposals. These are separate issues with separate solutions. I don't see any disadvantages with that strategy aside from it always being a bit awkward for functions to mutate their arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is your suggestion to move forward and not reintroduce the potential race?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is to re-introduce the clone in this PR and then introduce a subsequent PR that restructures things so as to make it unnecessary. That would basically mean rewriting all the function signatures for everything here. I got pretty far with it on a new branch but I don't think it's worth the effort right at this moment, and I think it'd make this PR far too unwieldy to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on your approach. Like this we can also discuss the follow up changes separately.
Pinging @elastic/uptime |
@urso @ruflin I wound up discovering that it was simple enough just to finish the refactor of method signatures. All tests are passing. TBH this PR turned into a monster. It was intended as a supporting PR for the AWS autodiscovery which may not happen. It's still a good improvement in code quality, but I wouldn't have started it if I'd known we wouldn't release that feature. Given that it's here, however, it is worth a merge. |
Looks like some updates to the tests are needed. |
OK, looks like there was a small behavior change where we still logged an |
Can you rebase again on master. There was an issue introduce in master which made the build fail which should be fixed now. Like this also Travis should run through. Overall LGTM. |
In heartbeat we have separate notions of jobs and tasks. This can make following the code and sharing the code among monitors confusing. This patch unifies the two. There is a tiny amount of extra execution overhead. Tasks only passed a single `common.MapStr`, where we now pass a `*beat.Event`, however, the benefit is that we have a more consistent and simple codebase.
This lets us reason about concurrency more easily since jobs are now explicitly about mutation. This should also, theoretically, cut down on garbage. Practically we still use a lot of map merging however.
Woohoo! Looks like, with the exception of the broken mac workers, CI is happy! |
Unify heartbeat jobs and tasks for simplicity In heartbeat we have separate notions of jobs and tasks. This can make following the code and sharing the code among monitors confusing. This patch unifies the two. There is a tiny amount of extra execution overhead. Tasks only passed a single `common.MapStr`, where we now pass a `*beat.Event`, however, the benefit is that we have a more consistent and simple codebase. Also makes events arguments to Jobs rather than return values. This lets us reason about concurrency more easily since jobs are now explicitly about mutation, and accidentally sharing data between tasks is made more difficult. (cherry picked from commit 37a5b1f)
Unify heartbeat jobs and tasks for simplicity In heartbeat we have separate notions of jobs and tasks. This can make following the code and sharing the code among monitors confusing. This patch unifies the two. There is a tiny amount of extra execution overhead. Tasks only passed a single `common.MapStr`, where we now pass a `*beat.Event`, however, the benefit is that we have a more consistent and simple codebase. Also makes events arguments to Jobs rather than return values. This lets us reason about concurrency more easily since jobs are now explicitly about mutation, and accidentally sharing data between tasks is made more difficult. (cherry picked from commit 37a5b1f)
In heartbeat we have separate notions of jobs and tasks.
One thing that's hard to do today is to invoke one monitor from another due to the distinction. A monitor has a top-level
job
that spawns manytasks
. For a task to spawn ajob
without this patch would require an awkward and confusing unwrap / rewrap. This paves the way for enhanced multi-part jobs like #9122 . The other thing this does is move the hoisting of errors from Job return values to fields up into the highest level creation functions. This was required for the composition of jobs as in #9122. It's also cleaner, since we use theerror
return value for errors up until we no longer can.Just as important as the composition benefits are the code readibility benefits. This patch makes the code easier to follow. By removing these different notions of Job and Task navigating the execution of a monitor is made simpler since there are fewer types to reason about.
There is probably a tiny amount of extra execution overhead. Tasks only passed a single
common.MapStr
, where we now pass a*beat.Event
, however, the benefit is that we have a more consistent and simple codebase.