Discussion: Adopting a pattern/library for managing long lived go-routines #5810

hannahhoward · 2018-11-30T01:01:45Z

The issue

In many parts of go-IPFS, we launch go routines that run a long time -- essentially for as long as the ipfs daemon is running, or for the lifespan of a command or session. These are usually run-loops that have a structure like this:

for {
  select {
   case someData <- someChannel:
     doSomething(someData)
   case ...
   case <- doneIndicator:
     return
   }
}

We need to track these go-routines enough to make sure they quit somehow -- otherwise we're leaking memory. We might also need to track them so we can restart them if something goes wrong.

Appeal to authority:
https://dave.cheney.net/2016/12/22/never-start-a-goroutine-without-knowing-how-it-will-stop
https://rakyll.org/leakingctx/

Prior Art

OS Processes - The operating system as a notion of spawning, tracking, and closing long live routines -- they're called processes!
Erlang/OTP - Other concurrency oriented languages are much more explicit in calling out these long lived routines and providing mechanisms for spawning and managing them. Erlang maintains supervision trees -- essentially workers who do work (usually a GenServer or other Behavior) and supervisors who track, close, and restart these workers

Possible Solutions

Contexts -- a.k.a -- What we're (mostly) doing now

The most common pattern here in the existing code is to just use contexts, either for the lifespan of the daemon or the lifespan of an incoming API requests.

Benefits:

Contexts are standard
We're already using them

Downsides:

Context api is a bit weird in terms of semantics for the goal here. If I want to start a go routine and have ability to shut it down I'd do something like this:

startup childRoutine

childCtx, childCancel = context.WithCancel(parentCtx)
go childRoutine(childCtx)

shutDown childRoutine

childCancel()

childRoutine:

func childRoutine(ctx context.Context) {
   for {
      select {
      ...
      case <-ctx.Done():
      }
   }
}

It works, it's just a bit clunky:

How do I know what my cancel function will actually cancel? Especially if I want to pass it around. A cancel function it a pretty weird way to refer to a goroutine I might want to kill.
I have to rely on my child routine to do the right thing and listen to the cancel. No SIGKILL here :)
Cancel is more of a sigterm without a wait for it to be handled -- after I call cancel, my child routine will shutdown IN THE FUTURE.
If there are hierarchies of routines, they are very implicit.

One possible solution would simply be to call out that we're already doing this and establish a best practices doc of some sort to avoid some of the clunky inconsistencies in the code

GoProcess

GoProcess is a lightweight library written by our fearless leader @jbenet and used sporadically in parts of the project. It's primarily inspired by the OS model for spawning and managing processes.

Benefits:

Syntax is cleaner that context, without adding a whole lot more
It's already in use in parts of go-bitswap
Pretty lightweight

Downsides:

It's just able to spawn processing and shut them down in groups. Lacks some of the more classic patterns of supervision of the Erlang/OTP model
Not widely adopted - we'd be using a non-standard library and Juan is likely too busy to maintain it so we'd be doing that ourselves

go-sup

Go-sup implements supervision trees for Go and is created by our own @warpfork . In comparison to go-process, it provides more control over starting and managing processes, and provides some mechanism for tracking errors, as well as some basic behaviors for each task.

Benefits:

More control and tracking of processes
Handles errors more explicitly, and could be extended to handle restarts
Uses idioms from Erlang/OTP, which was the gold-standard for concurrent programming at least till Go and Rust showed up :)
Unlike Juan, @warpfork is actively contributing to the team and might be able to be more responsive.

Downsides:

Similarly non-standard like goprocess and not widely adopted. Under the hood, it also uses other libraries @warpfork wrote.
Syntax and concepts are more heavyweight, requires more spinning-up on how it works.

Maybe @warpfork can elaborate in the comments on his library

Use Widely Adopted Library

There are a couple of more widely adopted libraries available for doing supervision:
Suture - Implementation of supervision trees for go w/ 800 stars, stability, blog post and docs
ProtoActor - Full implementation of the Actor model for concurrency for go -- has 2400 stars, created by the person who wrote Akka.net (Well known actor model library)

Benefits:

Maturity & Adoption
Don't have to handle development

Downsides:

Third Party Dependency not maintained by us and all that entails

Suture in particular looks relatively lightweight and reasonable.

Discuss!

The text was updated successfully, but these errors were encountered:

dbaarda · 2018-11-30T05:30:42Z

This reminded me of the following article;

https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/

It is a bit over-hypey for something that is basically a pretty simple design pattern, but he does argue it well. It should be fairly simple to implement a go equivalent of his trio library to make using this pattern easier.

hannahhoward · 2018-11-30T18:52:48Z

Ok so someone pointed out to me that the line:

Unlike Juan, @warpfork is actively contributing to the team and might be able to be more responsive.

Could be read as saying @jbenet is not contributing to IPFS even though he invented it!

I just mean Juan is contributing now by leading all of Protocol Labs, IPFS, and all associated projects so working on a process library probably isn't in his current wheelhouse.

Oops! Thank you Juan for making IPFS and leading us forward, don't fire me. 😉

schomatis · 2018-11-30T21:42:02Z

This reminded me of the following article;

https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/

Great article, thanks for sharing it, very well crafted argument. I think we should have a similar mindset when designing our concurrent operations no matter what framework we end up using.

b5 · 2018-12-04T03:09:03Z

@hannahhoward this summary of goroutine mgmt is delightful. To add one point: I think @lanzafame & @warpfork hit on a key concern while I was reading through go-sup (which this issue pointed me to!): sounds like suture doesn't make use of context, which might make observability a little tougher.

Stebalien · 2018-12-14T19:37:40Z

There are really two different things we need:

Subprocesses: managing the lifetime of long-running goroutines. For example, workers, one-off tasks, etc. This will likely have a recursive structure where every process manages it's subprocesses.
Services: dependency resolution, in-order shutdown, etc. For example, the datastore, namesys, routing, etc. This will likely have a flat/centralized structure where we'll have a centralized service/dependency manager.

Unfortunately, we currently use goprocess (designed for subprocesses) when trying to manage services. We've been looking into ways to fix this in go-libp2p (e.g., by using some form of dependency injection system) but we haven't found a system that supports both complex dependency injection and in-order shutdown.

Key requirements in both cases:

Allows us to use other go libraries without having to modify them.
Allows other project to use go-ipfs without having to adopt our choice of service/process manager.

To this end, I'd like to find a system that just defines a "process" to be something with a Close() error method (and maybe an optional Context() context.Context method).

Really, for subprocess management, I think goprocess is pretty good. However, it's still pretty invasive so I'd like to try paring it down to something like:

// Process is a concrete process type that should be used *internally* by a process. We tend to 
type Process struct {}

func (p *ProcessCore) AddChild(c io.Closer) {}
func (p *ProcessCore) Go(p func(p *Process)) {}
func (p *ProcessCore) Context() context.Contest {}
func (p *ProcessCore) Close() error {}

And adding new features only as needed instead of adding anything that we think we might want someday.

We should also look through: https://github.com/avelino/awesome-go#goroutines

hannahhoward · 2018-12-18T01:11:05Z

This is perhaps a side discussion, but one thing I've also been thinking about is goroutine architecture. Since my background is almost all functional programming and also Erlang, where processes have isolated memory, I find myself wanting to lean on channels over traditional concurrency primitives like mutexes. Then I read this: https://github.com/golang/go/wiki/MutexOrChannel which makes me think I'm doing it wrong. Just pointing out there's managing goroutines and then there's the patterns you use to actually build goroutines.

Stebalien · 2018-12-21T18:41:51Z

Note from #5868: I'd like to stop using contexts for process cancellation. They were designed for aborting requests where either:

You know that the request has been aborted because you're waiting for the handler to return.
You don't care.

This is why we ended up with goprocess.

warpfork · 2018-12-22T07:48:41Z

Can I refine that to "I'd like to stop using contexts for process cancellation while not having a way to wait for the cancelled process to fully return"?

IMO the contexts-for-cancellation ship has sailed. That is how the entire rest of the go ecosystem works at this point. And it does, for the most part, do cancellation.

The useful part is the additional semantic of having a consistent way to wait on things to return after they've acknowledged the cancel and cleaned up. And we can have that as additional systems built to work well with Context.

Stebalien · 2018-12-27T22:43:55Z

IMO the contexts-for-cancellation ship has sailed. That is how the entire rest of the go ecosystem works at this point. And it does, for the most part, do cancellation.

Not quite. Everyone else uses contexts for request cancellation (and sometimes worker cancellation). However, we're passing them to every service. Unfortunately, this is causing issues like #5738 (can't shut down in order because we're shutting down by canceling a global context).

Stebalien · 2019-05-11T00:15:25Z

Experiment:

func NewMonitor(ctx context.Context) *Monitor {
	ctx, cancel := context.WithCancel(ctx)
	return &Monitor{
		context: ctx,
		cancel:  cancel,
	}
}

type Monitor struct {
	parent    *Monitor
	context   context.Context
	cancel    context.CancelFunc
	wg        sync.WaitGroup
	closeOnce sync.Once
}

func (p *Monitor) Child() *Monitor {
	p.wg.Add(1)
	child := NewMonitor(p.context)
	child.parent = p
	return child
}

func (p *Monitor) Close() error {
	p.closeOnce.Do(func() {
		p.cancel()
		p.wg.Wait()
		if p.parent != nil {
			p.parent.wg.Done()
			p.parent = nil
		}
	})
	return nil
}

func (p *Monitor) Go(fn func(ctx context.Context)) {
	p.wg.Add(1)
	go func() {
		defer p.wg.Done()
		fn(p.context)
	}()
}

This doesn't allow anything fancy like true sub-processes but it's a start. Unfortunately, I then tried applying this to go-libp2p-swarm and failed completely (we use context + wg there).

hannahhoward · 2019-05-13T20:30:57Z

Just something if we wanted to remove the assumption of context from the mix.

func NewMonitor() *Monitor {
	return &Monitor{
		children: make([]childRoutine, 0),
	}
}

type Monitor struct {
	children  []childRoutine
	wg        sync.WaitGroup
	closeOnce sync.Once
}

func (p *Monitor) Add(execute func(), interrupt func()) {
	p.children = append(p.children, childRoutine{execute, interrupt})
}

func (p *Monitor) Start() {
	p.wg.Add(len(p.children))
	for _, cr := range p.children {
		go func(cr childRoutine) {
			defer p.wg.Done()
			cr.execute()
		}(cr)
	}
}
func (p *Monitor) Shutdown() error {
	p.closeOnce.Do(func() {
		for _, cr := range p.children {
			cr.interrupt()
		}
		p.wg.Wait()
	})
	return nil
}

type childRoutine struct {
	execute   func()
	interrupt func()
}

Stebalien · 2019-05-14T17:58:47Z

Given that go uses contexts everywhere, I'd rather just embrace them.

lanzafame · 2019-05-14T23:41:22Z

Please do! From the context of running prod systems, not being able to get any tracing or metrics because context hasn't been passed down is frustrating.

…

On Wed, 15 May 2019, 03:59 Steven Allen, ***@***.***> wrote: Given that go uses contexts *everywhere*, I'd rather just embrace them. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5810>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABNGO2HCBFY3VV56654WL3TPVL4WLANCNFSM4GHLHUBA> .

hannahhoward · 2019-05-15T17:22:41Z

It seems like if we're embracing context, it'd be really cool if we could make it so the supervision can be passed transparently through the context i.e. an API that likes like--

ctx := context.Background()
ctx, cancel, waitForShutdown := ourlibrary.WithMonitoring(ctx)
...
ourlibrary.Go(ctx, func(ctx) {
...
})
...
subCtx, subCancel, subWaitForShutdown := ourLibrary.WithMonitoring(ctx) 
// only neccesary if you want to wait on a smaller scale -- otherwise, context.WithCancel works fine
...
cancel() // -- normal cancellation
waitForShutdown() // - waits for all monitored routines and child contexts to fully shutdown

Does this make sense?
I realize it will probably end up abusing context.WithValue under the hood, but could make for some easy crossing of API boundaries without a lot of ceremony.

Stebalien · 2019-05-15T17:40:33Z

Yeah, I agree we should be stashing some kind of monitor/process in the context.

Stebalien · 2019-05-22T17:29:02Z

Concern I just brought up in the meeting WRT stashing something in the context. I'm worried about introducing accidental dependencies between services where my service decides to wait for your service to stop.

However, thinking about it a bit, I don't think this'll actually be all that much of an issue given that any sub-processes using that context will be canceled anyways (i.e., if we use the wrong context, bad shit will happen regardless).

My other concern is the inverse: we need to be careful about somehow dropping the monitor without realizing it.

ctx, cancel := context.WithCancel(context.Background())
go func() {
  <-ctxWithMonitor.Done()
  cancel()
}

This is a common pattern when joining multiple contexts. The solution with your API above would be:

ctx, cancel, wait := context.WithCancel(context.Background())
ourlibrary.Go(ctxWithMonitor, func(ctx context.Context) {
  <-ctx.Done()
  cancel()
  wait()
})

Which is probably fine.

b5 · 2019-05-23T20:27:57Z

Do these stashed-monitoring processes need to work across golang-to-golang process boundaries?

I'm thinking of go-core-http-api calls to a go-ipfs process as an example.

If so, might be worth having ourlibrary be a superset of / interoperate with tools that encode context details to network metadata (HTTP headers, libp2p... something's) to cut down on the awful boilerplate of reconstructing contexts across network boundaries.

Stebalien · 2019-05-23T23:55:00Z

Do these stashed-monitoring processes need to work across golang-to-golang process boundaries?

I think you'd just wait for the HTTP request to finish on one side and block the HTTP request from finishing on the other. There isn't really any metadata here.

hannahhoward added the need/community-input Needs input from the wider community label Nov 30, 2018

Stebalien mentioned this issue Dec 21, 2018

The MFS root state isn't saved if the daemon is stopped with Ctrl+C #5868

Closed

Stebalien mentioned this issue May 11, 2019

Remove goprocess ipfs/go-bitswap#118

Closed

Stebalien mentioned this issue Feb 17, 2020

Disassociate RT membership from connectivity libp2p/go-libp2p-kbucket#50

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Adopting a pattern/library for managing long lived go-routines #5810

Discussion: Adopting a pattern/library for managing long lived go-routines #5810

hannahhoward commented Nov 30, 2018 •

edited

Loading

dbaarda commented Nov 30, 2018

hannahhoward commented Nov 30, 2018 •

edited

Loading

schomatis commented Nov 30, 2018

b5 commented Dec 4, 2018

Stebalien commented Dec 14, 2018

hannahhoward commented Dec 18, 2018

Stebalien commented Dec 21, 2018

warpfork commented Dec 22, 2018

Stebalien commented Dec 27, 2018

Stebalien commented May 11, 2019 •

edited

Loading

hannahhoward commented May 13, 2019 •

edited

Loading

Stebalien commented May 14, 2019

lanzafame commented May 14, 2019 via email

hannahhoward commented May 15, 2019 •

edited

Loading

Stebalien commented May 15, 2019

Stebalien commented May 22, 2019

b5 commented May 23, 2019

Stebalien commented May 23, 2019

Discussion: Adopting a pattern/library for managing long lived go-routines #5810

Discussion: Adopting a pattern/library for managing long lived go-routines #5810

Comments

hannahhoward commented Nov 30, 2018 • edited Loading

The issue

Prior Art

Possible Solutions

Contexts -- a.k.a -- What we're (mostly) doing now

GoProcess

go-sup

Use Widely Adopted Library

dbaarda commented Nov 30, 2018

hannahhoward commented Nov 30, 2018 • edited Loading

schomatis commented Nov 30, 2018

b5 commented Dec 4, 2018

Stebalien commented Dec 14, 2018

hannahhoward commented Dec 18, 2018

Stebalien commented Dec 21, 2018

warpfork commented Dec 22, 2018

Stebalien commented Dec 27, 2018

Stebalien commented May 11, 2019 • edited Loading

hannahhoward commented May 13, 2019 • edited Loading

Stebalien commented May 14, 2019

lanzafame commented May 14, 2019 via email

hannahhoward commented May 15, 2019 • edited Loading

Stebalien commented May 15, 2019

Stebalien commented May 22, 2019

b5 commented May 23, 2019

Stebalien commented May 23, 2019

hannahhoward commented Nov 30, 2018 •

edited

Loading

hannahhoward commented Nov 30, 2018 •

edited

Loading

Stebalien commented May 11, 2019 •

edited

Loading

hannahhoward commented May 13, 2019 •

edited

Loading

hannahhoward commented May 15, 2019 •

edited

Loading