Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Netconf channel stuck in ReadUntilPrompt #178

Closed
netixx opened this issue Apr 8, 2024 · 6 comments · Fixed by #179
Closed

Netconf channel stuck in ReadUntilPrompt #178

netixx opened this issue Apr 8, 2024 · 6 comments · Fixed by #179

Comments

@netixx
Copy link
Contributor

netixx commented Apr 8, 2024

Hello,

In my quest to squash all remaining memory leaks, I found one more:
image

I am now using the Custom Transport (which in my case is more or less the same as standard transport), and I am seeing some goroutines stuck in ReadUntilPrompt, which in turn is leaving some memory unreleased.

It looks like it's stuck in an infinite loop here:

// ReadUntilPrompt reads bytes out of the channel Q object until the channel PromptPattern regex
// pattern is seen in the output. Once that pattern is seen, all read bytes are returned.
func (c *Channel) ReadUntilPrompt() ([]byte, error) {
	var rb []byte

	for {
		nb, err := c.Read()
		if err != nil {
			return nil, err
		}

		if nb == nil {
			time.Sleep(c.ReadDelay)

			continue
		}

		rb = append(rb, nb...)

		if c.PromptPattern.Match(processReadBuf(rb, c.PromptSearchDepth)) {
			return rb, nil
		}
	}
}

I was expecting the ReadUntilPrompt to respect options.WithTimeoutOps(120 * time.Second), parameter, but looking at the code it doesn't seem to actually apply to the Open phase (which is causing the issue in my case).

I was thinking we could start a timer (or a context) like in GetPrompt() ?
And maybe this should apply for all Read call (so that no call to Read can last more than timeout seconds).

This made me think about contexts as well, shouldn't all the network bound operations of scrapligo accept a context as a first argument ? That way we can control as a user with context.WithTimeout what duration we all for each phase/call ?

@carlmontanari
Copy link
Contributor

bam, another one :D haha we'll get them all cleaned up one day 🤣

makes sense and I will tackle this this week and report back! regarding the context thing... yea... it probably "should" be done like that to be idiomatic. it isn't because it was more or less 1:1 translated from python when I first wrote things. I also had some notion that I wanted it to stay as similar/same as python things.

so.. for now I think I wont do anything about this context business, but it is on my mind/radar for sure.

@netixx
Copy link
Contributor Author

netixx commented Apr 8, 2024

okay :)
Honestly I guessed that you already thought about contexts, and had a reason not to have done it yet, but I figured I should mention it anyways for completeness.

My guess is that there are probably other Channel operations that could also use a timer to prevent indefinite execution time, but I am not sure which ones...

If it helps I could write a test for that (emulating a verrrry slow ssh peer) ?

@carlmontanari
Copy link
Contributor

yeah, tests are always welcome of course!

I think in general there is just timers on the get prompt and send input methods and nothing/not much else (maybe auth I guess). which I guess leads to the question: are you calling read until prompt directly or via something else?

I guess (thinking quickly so this is maybe very wrong) we need to send a done channel into readuntilprompt/any/etc. so that when we time them out from something like "send input" we can send the done signal to them. lemme know if that makes sense/sounds reasonable :)

@netixx
Copy link
Contributor Author

netixx commented Apr 8, 2024

I am not calling it directly, the call graph is Open > processServerCapabilities > ReadUntilPrompt.

func (d *Driver) processServerCapabilities() error {
	b, err := d.Channel.ReadUntilPrompt()
	if err != nil {
		return err
	}

It's actually possible that it's the router that never finishes it's capabilities (we've seen that before: carlmontanari/scrapli#233), but nevertheless, we should terminate open before the defined timeout.

@netixx
Copy link
Contributor Author

netixx commented Apr 9, 2024

I pushed a test and a couple of test-fixtures to https://github.com/netixx/scrapligo/blob/fix/open-timeout/driver/netconf/open_test.go

I set timeoutOps to 1 second, and give the fake driver 5s to complete the open/close sequence.

I added one test for fail (truncated capabilities) and one test for success using the start of one of the other test-fixture.

@carlmontanari
Copy link
Contributor

sorry for the delay! lets follow up in #179 -- I cherry picked your tests and tweaked a bit, but it passes nicely now and I think I got all the other read until X stuff properly closing out now too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

2 participants