Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Adopting a pattern/library for managing long lived go-routines #5810

Open
hannahhoward opened this issue Nov 30, 2018 · 18 comments
Labels
need/community-input Needs input from the wider community

Comments

@hannahhoward
Copy link
Contributor

hannahhoward commented Nov 30, 2018

The issue

In many parts of go-IPFS, we launch go routines that run a long time -- essentially for as long as the ipfs daemon is running, or for the lifespan of a command or session. These are usually run-loops that have a structure like this:

for {
  select {
   case someData <- someChannel:
     doSomething(someData)
   case ...
   case <- doneIndicator:
     return
   }
}

We need to track these go-routines enough to make sure they quit somehow -- otherwise we're leaking memory. We might also need to track them so we can restart them if something goes wrong.

Appeal to authority:
https://dave.cheney.net/2016/12/22/never-start-a-goroutine-without-knowing-how-it-will-stop
https://rakyll.org/leakingctx/

Prior Art

  1. OS Processes - The operating system as a notion of spawning, tracking, and closing long live routines -- they're called processes!
  2. Erlang/OTP - Other concurrency oriented languages are much more explicit in calling out these long lived routines and providing mechanisms for spawning and managing them. Erlang maintains supervision trees -- essentially workers who do work (usually a GenServer or other Behavior) and supervisors who track, close, and restart these workers

Possible Solutions

Contexts -- a.k.a -- What we're (mostly) doing now

The most common pattern here in the existing code is to just use contexts, either for the lifespan of the daemon or the lifespan of an incoming API requests.

Benefits:

  1. Contexts are standard
  2. We're already using them

Downsides:

  1. Context api is a bit weird in terms of semantics for the goal here. If I want to start a go routine and have ability to shut it down I'd do something like this:

startup childRoutine

childCtx, childCancel = context.WithCancel(parentCtx)
go childRoutine(childCtx)

shutDown childRoutine

childCancel()

childRoutine:

func childRoutine(ctx context.Context) {
   for {
      select {
      ...
      case <-ctx.Done():
      }
   }
}

It works, it's just a bit clunky:

  1. How do I know what my cancel function will actually cancel? Especially if I want to pass it around. A cancel function it a pretty weird way to refer to a goroutine I might want to kill.
  2. I have to rely on my child routine to do the right thing and listen to the cancel. No SIGKILL here :)
  3. Cancel is more of a sigterm without a wait for it to be handled -- after I call cancel, my child routine will shutdown IN THE FUTURE.
  4. If there are hierarchies of routines, they are very implicit.

One possible solution would simply be to call out that we're already doing this and establish a best practices doc of some sort to avoid some of the clunky inconsistencies in the code

GoProcess

GoProcess is a lightweight library written by our fearless leader @jbenet and used sporadically in parts of the project. It's primarily inspired by the OS model for spawning and managing processes.

Benefits:

  1. Syntax is cleaner that context, without adding a whole lot more
  2. It's already in use in parts of go-bitswap
  3. Pretty lightweight

Downsides:

  1. It's just able to spawn processing and shut them down in groups. Lacks some of the more classic patterns of supervision of the Erlang/OTP model
  2. Not widely adopted - we'd be using a non-standard library and Juan is likely too busy to maintain it so we'd be doing that ourselves

go-sup

Go-sup implements supervision trees for Go and is created by our own @warpfork . In comparison to go-process, it provides more control over starting and managing processes, and provides some mechanism for tracking errors, as well as some basic behaviors for each task.

Benefits:

  1. More control and tracking of processes
  2. Handles errors more explicitly, and could be extended to handle restarts
  3. Uses idioms from Erlang/OTP, which was the gold-standard for concurrent programming at least till Go and Rust showed up :)
  4. Unlike Juan, @warpfork is actively contributing to the team and might be able to be more responsive.

Downsides:

  1. Similarly non-standard like goprocess and not widely adopted. Under the hood, it also uses other libraries @warpfork wrote.
  2. Syntax and concepts are more heavyweight, requires more spinning-up on how it works.

Maybe @warpfork can elaborate in the comments on his library

Use Widely Adopted Library

There are a couple of more widely adopted libraries available for doing supervision:
Suture - Implementation of supervision trees for go w/ 800 stars, stability, blog post and docs
ProtoActor - Full implementation of the Actor model for concurrency for go -- has 2400 stars, created by the person who wrote Akka.net (Well known actor model library)

Benefits:

  1. Maturity & Adoption
  2. Don't have to handle development

Downsides:

  1. Third Party Dependency not maintained by us and all that entails

Suture in particular looks relatively lightweight and reasonable.

Discuss!

@hannahhoward hannahhoward added the need/community-input Needs input from the wider community label Nov 30, 2018
@dbaarda
Copy link

dbaarda commented Nov 30, 2018

This reminded me of the following article;

https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/

It is a bit over-hypey for something that is basically a pretty simple design pattern, but he does argue it well. It should be fairly simple to implement a go equivalent of his trio library to make using this pattern easier.

@hannahhoward
Copy link
Contributor Author

hannahhoward commented Nov 30, 2018

Ok so someone pointed out to me that the line:

Unlike Juan, @warpfork is actively contributing to the team and might be able to be more responsive.

Could be read as saying @jbenet is not contributing to IPFS even though he invented it!

I just mean Juan is contributing now by leading all of Protocol Labs, IPFS, and all associated projects so working on a process library probably isn't in his current wheelhouse.

Oops! Thank you Juan for making IPFS and leading us forward, don't fire me. 😉

@schomatis
Copy link
Contributor

This reminded me of the following article;

https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/

Great article, thanks for sharing it, very well crafted argument. I think we should have a similar mindset when designing our concurrent operations no matter what framework we end up using.

@b5
Copy link
Contributor

b5 commented Dec 4, 2018

@hannahhoward this summary of goroutine mgmt is delightful. To add one point: I think @lanzafame & @warpfork hit on a key concern while I was reading through go-sup (which this issue pointed me to!): sounds like suture doesn't make use of context, which might make observability a little tougher.

@Stebalien
Copy link
Member

There are really two different things we need:

  1. Subprocesses: managing the lifetime of long-running goroutines. For example, workers, one-off tasks, etc. This will likely have a recursive structure where every process manages it's subprocesses.
  2. Services: dependency resolution, in-order shutdown, etc. For example, the datastore, namesys, routing, etc. This will likely have a flat/centralized structure where we'll have a centralized service/dependency manager.

Unfortunately, we currently use goprocess (designed for subprocesses) when trying to manage services. We've been looking into ways to fix this in go-libp2p (e.g., by using some form of dependency injection system) but we haven't found a system that supports both complex dependency injection and in-order shutdown.

Key requirements in both cases:

  1. Allows us to use other go libraries without having to modify them.
  2. Allows other project to use go-ipfs without having to adopt our choice of service/process manager.

To this end, I'd like to find a system that just defines a "process" to be something with a Close() error method (and maybe an optional Context() context.Context method).


Really, for subprocess management, I think goprocess is pretty good. However, it's still pretty invasive so I'd like to try paring it down to something like:

// Process is a concrete process type that should be used *internally* by a process. We tend to 
type Process struct {}

func (p *ProcessCore) AddChild(c io.Closer) {}
func (p *ProcessCore) Go(p func(p *Process)) {}
func (p *ProcessCore) Context() context.Contest {}
func (p *ProcessCore) Close() error {}

And adding new features only as needed instead of adding anything that we think we might want someday.

We should also look through: https://github.com/avelino/awesome-go#goroutines

@hannahhoward
Copy link
Contributor Author

This is perhaps a side discussion, but one thing I've also been thinking about is goroutine architecture. Since my background is almost all functional programming and also Erlang, where processes have isolated memory, I find myself wanting to lean on channels over traditional concurrency primitives like mutexes. Then I read this: https://github.com/golang/go/wiki/MutexOrChannel which makes me think I'm doing it wrong. Just pointing out there's managing goroutines and then there's the patterns you use to actually build goroutines.

@Stebalien
Copy link
Member

Note from #5868: I'd like to stop using contexts for process cancellation. They were designed for aborting requests where either:

  1. You know that the request has been aborted because you're waiting for the handler to return.
  2. You don't care.

This is why we ended up with goprocess.

@warpfork
Copy link
Member

Can I refine that to "I'd like to stop using contexts for process cancellation while not having a way to wait for the cancelled process to fully return"?

IMO the contexts-for-cancellation ship has sailed. That is how the entire rest of the go ecosystem works at this point. And it does, for the most part, do cancellation.

The useful part is the additional semantic of having a consistent way to wait on things to return after they've acknowledged the cancel and cleaned up. And we can have that as additional systems built to work well with Context.

@Stebalien
Copy link
Member

IMO the contexts-for-cancellation ship has sailed. That is how the entire rest of the go ecosystem works at this point. And it does, for the most part, do cancellation.

Not quite. Everyone else uses contexts for request cancellation (and sometimes worker cancellation). However, we're passing them to every service. Unfortunately, this is causing issues like #5738 (can't shut down in order because we're shutting down by canceling a global context).

@Stebalien
Copy link
Member

Stebalien commented May 11, 2019

Experiment:

func NewMonitor(ctx context.Context) *Monitor {
	ctx, cancel := context.WithCancel(ctx)
	return &Monitor{
		context: ctx,
		cancel:  cancel,
	}
}

type Monitor struct {
	parent    *Monitor
	context   context.Context
	cancel    context.CancelFunc
	wg        sync.WaitGroup
	closeOnce sync.Once
}

func (p *Monitor) Child() *Monitor {
	p.wg.Add(1)
	child := NewMonitor(p.context)
	child.parent = p
	return child
}

func (p *Monitor) Close() error {
	p.closeOnce.Do(func() {
		p.cancel()
		p.wg.Wait()
		if p.parent != nil {
			p.parent.wg.Done()
			p.parent = nil
		}
	})
	return nil
}

func (p *Monitor) Go(fn func(ctx context.Context)) {
	p.wg.Add(1)
	go func() {
		defer p.wg.Done()
		fn(p.context)
	}()
}

This doesn't allow anything fancy like true sub-processes but it's a start. Unfortunately, I then tried applying this to go-libp2p-swarm and failed completely (we use context + wg there).

@hannahhoward
Copy link
Contributor Author

hannahhoward commented May 13, 2019

Just something if we wanted to remove the assumption of context from the mix.

func NewMonitor() *Monitor {
	return &Monitor{
		children: make([]childRoutine, 0),
	}
}

type Monitor struct {
	children  []childRoutine
	wg        sync.WaitGroup
	closeOnce sync.Once
}

func (p *Monitor) Add(execute func(), interrupt func()) {
	p.children = append(p.children, childRoutine{execute, interrupt})
}

func (p *Monitor) Start() {
	p.wg.Add(len(p.children))
	for _, cr := range p.children {
		go func(cr childRoutine) {
			defer p.wg.Done()
			cr.execute()
		}(cr)
	}
}
func (p *Monitor) Shutdown() error {
	p.closeOnce.Do(func() {
		for _, cr := range p.children {
			cr.interrupt()
		}
		p.wg.Wait()
	})
	return nil
}

type childRoutine struct {
	execute   func()
	interrupt func()
}

@Stebalien
Copy link
Member

Given that go uses contexts everywhere, I'd rather just embrace them.

@lanzafame
Copy link
Contributor

lanzafame commented May 14, 2019 via email

@hannahhoward
Copy link
Contributor Author

hannahhoward commented May 15, 2019

It seems like if we're embracing context, it'd be really cool if we could make it so the supervision can be passed transparently through the context i.e. an API that likes like--

ctx := context.Background()
ctx, cancel, waitForShutdown := ourlibrary.WithMonitoring(ctx)
...
ourlibrary.Go(ctx, func(ctx) {
...
})
...
subCtx, subCancel, subWaitForShutdown := ourLibrary.WithMonitoring(ctx) 
// only neccesary if you want to wait on a smaller scale -- otherwise, context.WithCancel works fine
...
cancel() // -- normal cancellation
waitForShutdown() // - waits for all monitored routines and child contexts to fully shutdown

Does this make sense?
I realize it will probably end up abusing context.WithValue under the hood, but could make for some easy crossing of API boundaries without a lot of ceremony.

@Stebalien
Copy link
Member

Yeah, I agree we should be stashing some kind of monitor/process in the context.

@Stebalien
Copy link
Member

Concern I just brought up in the meeting WRT stashing something in the context. I'm worried about introducing accidental dependencies between services where my service decides to wait for your service to stop.

However, thinking about it a bit, I don't think this'll actually be all that much of an issue given that any sub-processes using that context will be canceled anyways (i.e., if we use the wrong context, bad shit will happen regardless).


My other concern is the inverse: we need to be careful about somehow dropping the monitor without realizing it.

ctx, cancel := context.WithCancel(context.Background())
go func() {
  <-ctxWithMonitor.Done()
  cancel()
}

This is a common pattern when joining multiple contexts. The solution with your API above would be:

ctx, cancel, wait := context.WithCancel(context.Background())
ourlibrary.Go(ctxWithMonitor, func(ctx context.Context) {
  <-ctx.Done()
  cancel()
  wait()
})

Which is probably fine.

@b5
Copy link
Contributor

b5 commented May 23, 2019

Do these stashed-monitoring processes need to work across golang-to-golang process boundaries?

I'm thinking of go-core-http-api calls to a go-ipfs process as an example.

If so, might be worth having ourlibrary be a superset of / interoperate with tools that encode context details to network metadata (HTTP headers, libp2p... something's) to cut down on the awful boilerplate of reconstructing contexts across network boundaries.

@Stebalien
Copy link
Member

Do these stashed-monitoring processes need to work across golang-to-golang process boundaries?

I think you'd just wait for the HTTP request to finish on one side and block the HTTP request from finishing on the other. There isn't really any metadata here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/community-input Needs input from the wider community
Projects
None yet
Development

No branches or pull requests

7 participants