Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify heartbeat jobs and tasks for simplicity #9258

Merged
merged 7 commits into from
Dec 19, 2018

Conversation

andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Nov 27, 2018

In heartbeat we have separate notions of jobs and tasks.

One thing that's hard to do today is to invoke one monitor from another due to the distinction. A monitor has a top-level job that spawns many tasks. For a task to spawn a job without this patch would require an awkward and confusing unwrap / rewrap. This paves the way for enhanced multi-part jobs like #9122 . The other thing this does is move the hoisting of errors from Job return values to fields up into the highest level creation functions. This was required for the composition of jobs as in #9122. It's also cleaner, since we use the error return value for errors up until we no longer can.

Just as important as the composition benefits are the code readibility benefits. This patch makes the code easier to follow. By removing these different notions of Job and Task navigating the execution of a monitor is made simpler since there are fewer types to reason about.

There is probably a tiny amount of extra execution overhead. Tasks only passed a single common.MapStr, where we now pass a *beat.Event, however, the benefit is that we have a more consistent and simple codebase.

@andrewvc andrewvc added in progress Pull request is currently in progress. Heartbeat Team:obs-ds-hosted-services Label for the Observability Hosted Services team labels Nov 27, 2018
heartbeat/monitors/util.go Show resolved Hide resolved
heartbeat/monitors/util.go Show resolved Hide resolved
heartbeat/monitors/util.go Show resolved Hide resolved
heartbeat/monitors/util.go Show resolved Hide resolved
libbeat/cmd/instance/imports.go Outdated Show resolved Hide resolved
heartbeat/monitors/job.go Outdated Show resolved Hide resolved
heartbeat/monitors/job.go Show resolved Hide resolved
heartbeat/monitors/job.go Outdated Show resolved Hide resolved
heartbeat/monitors/job.go Show resolved Hide resolved
heartbeat/monitors/active/http/task.go Outdated Show resolved Hide resolved
heartbeat/monitors/active/dialchain/builder.go Outdated Show resolved Hide resolved
heartbeat/monitors/active/dialchain/builder.go Outdated Show resolved Hide resolved
heartbeat/monitors/active/http/task.go Outdated Show resolved Hide resolved
heartbeat/monitors/active/http/task.go Outdated Show resolved Hide resolved
heartbeat/monitors/active/http/http.go Outdated Show resolved Hide resolved
func AfterJob(j Job, after func(*beat.Event, []Job, error) (*beat.Event, []Job, error)) Job {

return CreateNamedJob(
j.Name(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just copy the name, or proxy the execution here? I'm thinking there's a good argument for either one, but the copy is a little simpler. These names should be pretty stable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the other option you propose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other option is to have name be a function rather than a field. Every time the name was asked for we'd invoke the child job. That would allow names to be more dynamic and lazy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTOH, I don't know if we actually need that. This is def simpler.

return CreateNamedJob(
job.Name(),
func() (*beat.Event, []Job, error) {
start := time.Now()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the original behavior may have been incorrect, starting the timer when the Job was created, not when it was invoked. This behavior, at any rate, is def correct.

@andrewvc andrewvc added review and removed in progress Pull request is currently in progress. labels Nov 27, 2018
Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments and questions. I think the part I didn't fully follow is the move around of the settings.

jobName := jobName(typ, scheme, endpoint.Host, endpoint.Ports)
settings := monitors.MakeHostJobSettings(jobName, endpoint.Host, mode).WithFields(fields)
jobID := jobID(typ, scheme, endpoint.Host, endpoint.Ports)
settings := monitors.MakeHostJobSettings(jobID, endpoint.Host, mode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for myself: Trying to understand why WithFields was moved from there to line 204.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because those fields used to be part of the job settings. A general theme in this patch is applying Job level stuff higher up in the call stack, whereas before those options were passed in to a deep point, then used from annotate. The wrapping strategy now used is more clear IMHO.

heartbeat/monitors/active/dialchain/net.go Outdated Show resolved Hide resolved
heartbeat/monitors/active/dialchain/socks5.go Outdated Show resolved Hide resolved
heartbeat/monitors/active/http/http.go Outdated Show resolved Hide resolved
heartbeat/monitors/active/http/task.go Outdated Show resolved Hide resolved
func AfterJob(j Job, after func(*beat.Event, []Job, error) (*beat.Event, []Job, error)) Job {

return CreateNamedJob(
j.Name(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the other option you propose.

// will cause it to run with the given scheduler until Stop() is called.
type Monitor struct {
name string
config *common.Config
registrar *pluginsReg
uniqueName string
scheduler *scheduler.Scheduler
jobTasks []*task
jobTasks []*configuredJob
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially confusing that we have a variable jobTasks with the task part inside? Not sure i have good alternatives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree that's really confusing, that's a holdover. I think we should just call it configuredJobs

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or just jobs :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I don't like about jobs is that then we can call jobs[0].job. I'm going with configuredJobs for now. Glad to discuss further if you think that isn't a great reason.

"status": status,
},
MergeEventFields(event, common.MapStr{
"error": look.Reason(err),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the event look like this creates? Just want to make sure error is an object and not a keyword.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns an object with two fields type, and message. This should be identical to the existing behavior (and we have tests backing this)

heartbeat/monitors/util.go Outdated Show resolved Hide resolved
return MakeCont(func() (common.MapStr, []TaskRunner, error) {
event, cont, err := r.Run()
if event != nil {
event = event.Clone()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this clone here is important to prevent races but I can't find it in the code anymore in MergeEventFields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I'm not clear if we actually do need this. I could understand us cloning the passed in fields, but cloning the event seems weird. That should not have any shared values.

I don't think we need the clone.

@urso your thoughts on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make to double check this. Can you also check when this was introduced and the PR comments for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it was introduced in #6950 , which did this to prevent races among shared structures in the case continuations were created. After a good amount of spelunking I don't see any races.

However, reasoning about those races is really hard to do, you basically need to bootstrap all the code into your head and assume that you understand it correctly.

That said, I don't think the approach of fixing it in this spot is necessarily correct. This assumes that WithFields is the only place mutation happens.

I'd like to think about this for a bit and see if there's a more bullet-proof solution I can think of (short of moving toward immutable datastructures).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you see as disadvantages of 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I don't think 1 and 2 are competing proposals. These are separate issues with separate solutions. I don't see any disadvantages with that strategy aside from it always being a bit awkward for functions to mutate their arguments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your suggestion to move forward and not reintroduce the potential race?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is to re-introduce the clone in this PR and then introduce a subsequent PR that restructures things so as to make it unnecessary. That would basically mean rewriting all the function signatures for everything here. I got pretty far with it on a new branch but I don't think it's worth the effort right at this moment, and I think it'd make this PR far too unwieldy to review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on your approach. Like this we can also discuss the follow up changes separately.

heartbeat/monitors/task.go Outdated Show resolved Hide resolved
@andrewvc
Copy link
Contributor Author

@urso @ruflin pushed a set of improvements incorporating your feedback. Awaiting y'alls input on a few outstanding things before we can wrap this one up.

@andrewvc andrewvc closed this Nov 29, 2018
@andrewvc andrewvc reopened this Nov 29, 2018
@ruflin ruflin added Team:obs-ds-hosted-services Label for the Observability Hosted Services team and removed Team:obs-ds-hosted-services Label for the Observability Hosted Services team labels Dec 3, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/uptime

heartbeat/monitors/job.go Outdated Show resolved Hide resolved
heartbeat/monitors/job.go Outdated Show resolved Hide resolved
heartbeat/monitors/job.go Outdated Show resolved Hide resolved
heartbeat/monitors/job.go Outdated Show resolved Hide resolved
@andrewvc
Copy link
Contributor Author

@urso @ruflin I wound up discovering that it was simple enough just to finish the refactor of method signatures. All tests are passing.

TBH this PR turned into a monster. It was intended as a supporting PR for the AWS autodiscovery which may not happen. It's still a good improvement in code quality, but I wouldn't have started it if I'd known we wouldn't release that feature. Given that it's here, however, it is worth a merge.

@ruflin
Copy link
Contributor

ruflin commented Dec 14, 2018

Looks like some updates to the tests are needed.

@andrewvc
Copy link
Contributor Author

OK, looks like there was a small behavior change where we still logged an http.rtt.total field in the case of an I/O error. Fixed with the latest push.

@ruflin
Copy link
Contributor

ruflin commented Dec 18, 2018

Can you rebase again on master. There was an issue introduce in master which made the build fail which should be fixed now. Like this also Travis should run through.

Overall LGTM.

In heartbeat we have separate notions of jobs and tasks.

This can make following the code and sharing the code among monitors confusing. This patch unifies the two. There is a tiny amount of extra execution overhead.

Tasks only passed a single `common.MapStr`, where we now pass a `*beat.Event`, however, the benefit is that we have a more consistent and simple codebase.
This lets us reason about concurrency more easily since jobs are now explicitly
about mutation. This should also, theoretically, cut down on garbage. Practically
we still use a lot of map merging however.
@andrewvc
Copy link
Contributor Author

Woohoo! Looks like, with the exception of the broken mac workers, CI is happy!

@andrewvc andrewvc merged commit 37a5b1f into elastic:master Dec 19, 2018
andrewvc added a commit to andrewvc/beats that referenced this pull request Dec 21, 2018
Unify heartbeat jobs and tasks for simplicity

In heartbeat we have separate notions of jobs and tasks.

This can make following the code and sharing the code among monitors confusing. This patch unifies the two. There is a tiny amount of extra execution overhead.

Tasks only passed a single `common.MapStr`, where we now pass a `*beat.Event`, however, the benefit is that we have a more consistent and simple codebase.

Also makes events arguments to Jobs rather than return values.

This lets us reason about concurrency more easily since jobs are now explicitly
about mutation, and accidentally sharing data between tasks is made more difficult.

(cherry picked from commit 37a5b1f)
andrewvc added a commit that referenced this pull request Dec 21, 2018
Unify heartbeat jobs and tasks for simplicity

In heartbeat we have separate notions of jobs and tasks.

This can make following the code and sharing the code among monitors confusing. This patch unifies the two. There is a tiny amount of extra execution overhead.

Tasks only passed a single `common.MapStr`, where we now pass a `*beat.Event`, however, the benefit is that we have a more consistent and simple codebase.

Also makes events arguments to Jobs rather than return values.

This lets us reason about concurrency more easily since jobs are now explicitly
about mutation, and accidentally sharing data between tasks is made more difficult.

(cherry picked from commit 37a5b1f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants