Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize /proc traversal (again) #497

Closed
wants to merge 4 commits into from
Closed

Optimize /proc traversal (again) #497

wants to merge 4 commits into from

Conversation

inercia
Copy link
Contributor

@inercia inercia commented Sep 16, 2015

This is the same code found in #450, but replacing the caching library. This should fix #486.

In addition to the changes in #450, this PR also caches some file contents. In particular, /proc/<pid>/[comm, cmdline], as they do not change during the lifetime of a process (this is not always true, but I don't think we should care about these corner cases)


From the original #450 description:

The objective of this PR is to reduce the number of syscalls when collecting info from /proc. This PR does that by:

  • reading the required files (/proc/*/[cmdline, comm, stat, fd/*], /proc/net/tcp*) once per cycle (spy.interval), parsing and caching the processes/connections information
  • keeping a cache of open files for /proc/*/[cmdline, comm, stat] so we do not open()/close() these files so often.
  • removing unnecessary stat()s when reading /proc directories

Changes:

  • the required code for reading/parsing /proc has been proved from procspy to /probe/proc
  • the Walker has been replaced by a proc.Reader that provides both Processes() and Connections(). Users can walk the list of processes/connections in the same way by providing a function to these functions. A CachingReader can be used for caching and getting this info, updated when Update(). The code for reading and parsing connection and processes has been simplified.
  • a CachingReader is initialized in /probe/main.go and used in both endpoint.Reporter and process.Reporter for reporting about processed and connections, and updated periodically by calling Update() every spy.interval seconds.
  • unit tests have been improved by mocking things like file system operations
  • things like Proc and Process have been unified, and many things from /probe/process have been moved to /probe/proc

This is part of the solution to #284

@inercia inercia force-pushed the scope-486 branch 2 times, most recently from 8c3ed08 to eeb96d8 Compare September 16, 2015 11:38
@inercia
Copy link
Contributor Author

inercia commented Sep 16, 2015

@tomwilkie @peterbourgon please take a look at this when you have some time. The only change since #450 is the cache.go file.

creation time.Time
}

func (fce *filesCacheEntry) Pinned() bool { return fce.contents != nil }

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

@peterbourgon
Copy link
Contributor

Thanks for the change. Can we have a unit test which fails using bluele/gcache and passes using hashicorp/golang-lru? Otherwise, it's hard to have confidence this is a fix — and hard to prevent regressions.

Also, let's avoid introducing the tech debt of removing bluele/gcache from other packages, by proactively eliminating the dependency throughout the repo, and making the required changes in this PR.

@tomwilkie
Copy link
Contributor

@inercia please give me a chance to test & review this before submitting. I should be able to look at it first think next week.

@tomwilkie
Copy link
Contributor

@inercia I'm happy to remove the use of gcache in ids.go as part of this PR, if it helps.

@inercia
Copy link
Contributor Author

inercia commented Sep 17, 2015

@peterbourgon The problem with adding a unit test is that I did never have a problem with bluele/gcache, and I couldn't reproduce the crash described in #486 either. Looking at the code in bluele/gcache I discovered that it triggered go evictionFunction() and it could be problematic when evicting many cache items (hashicorp library does not do that). This would be difficult to test with unit tests anyway... So I think the best way to test this would be to reproduce the scenario where we originally found the crash...

@tomwilkie I will wait until a LGTM from you.

@inercia
Copy link
Contributor Author

inercia commented Sep 17, 2015

@tomwilkie @peterbourgon would you like to replace gcache in this PR or in a different one?

@peterbourgon
Copy link
Contributor

I would much rather replace gcache in this PR than another one, to avoid creating technical debt. I imagine @tomwilkie will have exactly the opposite opinion, to avoid conflating multiple concerns into a single PR.

@inercia
Copy link
Contributor Author

inercia commented Sep 17, 2015

@peterbourgon The last two commits remove all the references to gcache (there is a remaining reference in experimental, but I guess that s not important, right?)

return "", errNotFound
entry, ok := val.(*resolverCacheEntry)
if !ok {
panic(fmt.Sprintf("unknown entry type in cache: %+v", val))

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

@tomwilkie
Copy link
Contributor

There are too many races / leaks in this PR.

Going forward I'd drop all the caching, and focus on making a single walk of the proc filesystem, gathering both processes and connections at the same time, and porting the existing procspy code onto that.

We can potentially add caching on top of that.

@inercia
Copy link
Contributor Author

inercia commented Sep 22, 2015

@tomwilkie the usage pattern for this code would be to call the Update() once a second, so there is no concurrency problems in practice. I've been reluctant to add locks in the cache entries because YAGNI and, in fact, I think we could perfectly remove all the mutexes in this code and maybe replace them with some big locks in Connections()/Processes(). Anyway, and as you have noticed, there is a missing lock when evicting entries in the cache, and I think that should be enough for making the cache goroutines-safe...

The performance gain of caching file handles and contents reduces the number of syscalls dramatically, and that is a main part of objective of this PR. I will fix the second /proc traversal and then let me know what you think...

@inercia
Copy link
Contributor Author

inercia commented Sep 22, 2015

I've just realized that moving the Mutex from the cache entries to the Cache, with the expected usage pattern, reduces the numer of Locks/Unlocks... I'll change it.

@inercia inercia force-pushed the scope-486 branch 3 times, most recently from a1ea316 to 21446c5 Compare September 23, 2015 12:33
@inercia
Copy link
Contributor Author

inercia commented Sep 23, 2015

Please take another look @tomwilkie

)

type filesCacheEntry struct {
file File

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

@tomwilkie
Copy link
Contributor

Whats the motivation for a separate proc and process directory in probe?

@tomwilkie tomwilkie assigned inercia and unassigned tomwilkie Sep 24, 2015
@inercia
Copy link
Contributor Author

inercia commented Sep 24, 2015

The proc is the library for reading from the /proc, while the process is the processes reporter. The proc library is used by several reporters. Maybe the proc could be moved to some other location... I'm open to suggestions.

@tomwilkie
Copy link
Contributor

Yes please merge these two directories, keeping the original directory.

@inercia inercia force-pushed the scope-486 branch 3 times, most recently from 3fd37ab to 86595eb Compare September 25, 2015 10:39
…`probe`.

Cache processes and connections when reading from the /proc
Keep a cache of open files, reducing the number of open/close cycles in the /proc dir
Also cache some file contents
processes in the /proc directory by the name.
@inercia
Copy link
Contributor Author

inercia commented Sep 25, 2015

I've merged the proc and process directories. I've also merged the Reader and CachingReader structs (as it was being a pain to keep the same interface for both). I've also added the missing tcp/tcp6 reads for namespaces other than the global one.

I've also added an experimental feature where the Linux /proc reader will skip uninteresting processes, based on the process name found in comm and a list of well-known system processes/tasks. This reduces the number of syscalls dramatically. This would be the new perf graph:

profile

@peterbourgon or @tomwilkie , would you mind taking a look at this?

@tomwilkie
Copy link
Contributor

Alvaro - I've had a think about this PR, and a chat with Peter, and I think its best if we take a different approach. This is a big change to the very core of Scope, and the risks associated with getting it wrong are large. I want you to avoid rewriting large swathes of procspy as the current code is mature, and its behaviour is sometimes subtle.

So instead of proceding this PR, I suggest we take a much more incremental approach:

  • 'vendor' in procspy as is, into probe/proc (do this in 1 PR, without any other changes)
  • remove any usused / dead code, in another PR
  • make minimal adaptations to procspy to avoid the double walk of /proc, in another PR
  • evaluate adding a cache on top of this, again, in a fresh PR

From there we can evaluate other optimisations one PR at a time. Please avoid needless renaming to make this process go faster.

Thank you for working on this, we do appreciate your contribution here.

@inercia
Copy link
Contributor Author

inercia commented Sep 30, 2015

No problem, I'll change the code in several PRs. I'll close this PR then...

@inercia inercia closed this Sep 30, 2015
@tomwilkie tomwilkie deleted the scope-486 branch October 27, 2015 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Panic in proc reader
3 participants