Optimize /proc traversal (again) #497

inercia · 2015-09-16T11:36:00Z

This is the same code found in #450, but replacing the caching library. This should fix #486.

In addition to the changes in #450, this PR also caches some file contents. In particular, /proc/<pid>/[comm, cmdline], as they do not change during the lifetime of a process (this is not always true, but I don't think we should care about these corner cases)

From the original #450 description:

The objective of this PR is to reduce the number of syscalls when collecting info from /proc. This PR does that by:

reading the required files (/proc/*/[cmdline, comm, stat, fd/*], /proc/net/tcp*) once per cycle (spy.interval), parsing and caching the processes/connections information
keeping a cache of open files for /proc/*/[cmdline, comm, stat] so we do not open()/close() these files so often.
removing unnecessary stat()s when reading /proc directories

Changes:

the required code for reading/parsing /proc has been proved from procspy to /probe/proc
the Walker has been replaced by a proc.Reader that provides both Processes() and Connections(). Users can walk the list of processes/connections in the same way by providing a function to these functions. A CachingReader can be used for caching and getting this info, updated when Update(). The code for reading and parsing connection and processes has been simplified.
a CachingReader is initialized in /probe/main.go and used in both endpoint.Reporter and process.Reporter for reporting about processed and connections, and updated periodically by calling Update() every spy.interval seconds.
unit tests have been improved by mocking things like file system operations
things like Proc and Process have been unified, and many things from /probe/process have been moved to /probe/proc

This is part of the solution to #284

inercia · 2015-09-16T11:40:55Z

@tomwilkie @peterbourgon please take a look at this when you have some time. The only change since #450 is the cache.go file.

probe/proc/cache.go

+	creation time.Time
+}
+
+func (fce *filesCacheEntry) Pinned() bool { return fce.contents != nil }


peterbourgon · 2015-09-16T12:17:35Z

Thanks for the change. Can we have a unit test which fails using bluele/gcache and passes using hashicorp/golang-lru? Otherwise, it's hard to have confidence this is a fix — and hard to prevent regressions.

Also, let's avoid introducing the tech debt of removing bluele/gcache from other packages, by proactively eliminating the dependency throughout the repo, and making the required changes in this PR.

tomwilkie · 2015-09-16T14:19:23Z

@inercia please give me a chance to test & review this before submitting. I should be able to look at it first think next week.

tomwilkie · 2015-09-16T14:20:15Z

@inercia I'm happy to remove the use of gcache in ids.go as part of this PR, if it helps.

inercia · 2015-09-17T08:36:08Z

@peterbourgon The problem with adding a unit test is that I did never have a problem with bluele/gcache, and I couldn't reproduce the crash described in #486 either. Looking at the code in bluele/gcache I discovered that it triggered go evictionFunction() and it could be problematic when evicting many cache items (hashicorp library does not do that). This would be difficult to test with unit tests anyway... So I think the best way to test this would be to reproduce the scenario where we originally found the crash...

@tomwilkie I will wait until a LGTM from you.

inercia · 2015-09-17T08:38:05Z

@tomwilkie @peterbourgon would you like to replace gcache in this PR or in a different one?

peterbourgon · 2015-09-17T08:40:09Z

I would much rather replace gcache in this PR than another one, to avoid creating technical debt. I imagine @tomwilkie will have exactly the opposite opinion, to avoid conflating multiple concerns into a single PR.

inercia · 2015-09-17T11:56:23Z

@peterbourgon The last two commits remove all the references to gcache (there is a remaining reference in experimental, but I guess that s not important, right?)

probe/endpoint/resolver.go

-	return "", errNotFound
+	entry, ok := val.(*resolverCacheEntry)
+	if !ok {
+		panic(fmt.Sprintf("unknown entry type in cache: %+v", val))


tomwilkie · 2015-09-22T08:39:15Z

There are too many races / leaks in this PR.

Going forward I'd drop all the caching, and focus on making a single walk of the proc filesystem, gathering both processes and connections at the same time, and porting the existing procspy code onto that.

We can potentially add caching on top of that.

inercia · 2015-09-22T09:32:32Z

@tomwilkie the usage pattern for this code would be to call the Update() once a second, so there is no concurrency problems in practice. I've been reluctant to add locks in the cache entries because YAGNI and, in fact, I think we could perfectly remove all the mutexes in this code and maybe replace them with some big locks in Connections()/Processes(). Anyway, and as you have noticed, there is a missing lock when evicting entries in the cache, and I think that should be enough for making the cache goroutines-safe...

The performance gain of caching file handles and contents reduces the number of syscalls dramatically, and that is a main part of objective of this PR. I will fix the second /proc traversal and then let me know what you think...

inercia · 2015-09-22T10:00:14Z

I've just realized that moving the Mutex from the cache entries to the Cache, with the expected usage pattern, reduces the numer of Locks/Unlocks... I'll change it.

inercia · 2015-09-23T13:44:37Z

Please take another look @tomwilkie

probe/proc/cache.go

+)
+
+type filesCacheEntry struct {
+	file     File


tomwilkie · 2015-09-24T09:37:03Z

Whats the motivation for a separate proc and process directory in probe?

inercia · 2015-09-24T09:46:11Z

The proc is the library for reading from the /proc, while the process is the processes reporter. The proc library is used by several reporters. Maybe the proc could be moved to some other location... I'm open to suggestions.

tomwilkie · 2015-09-24T09:54:06Z

Yes please merge these two directories, keeping the original directory.

…`probe`. Cache processes and connections when reading from the /proc Keep a cache of open files, reducing the number of open/close cycles in the /proc dir Also cache some file contents

processes in the /proc directory by the name.

inercia · 2015-09-25T11:28:45Z

I've merged the proc and process directories. I've also merged the Reader and CachingReader structs (as it was being a pain to keep the same interface for both). I've also added the missing tcp/tcp6 reads for namespaces other than the global one.

I've also added an experimental feature where the Linux /proc reader will skip uninteresting processes, based on the process name found in comm and a list of well-known system processes/tasks. This reduces the number of syscalls dramatically. This would be the new perf graph:

@peterbourgon or @tomwilkie , would you mind taking a look at this?

tomwilkie · 2015-09-30T12:53:37Z

Alvaro - I've had a think about this PR, and a chat with Peter, and I think its best if we take a different approach. This is a big change to the very core of Scope, and the risks associated with getting it wrong are large. I want you to avoid rewriting large swathes of procspy as the current code is mature, and its behaviour is sometimes subtle.

So instead of proceding this PR, I suggest we take a much more incremental approach:

'vendor' in procspy as is, into probe/proc (do this in 1 PR, without any other changes)
remove any usused / dead code, in another PR
make minimal adaptations to procspy to avoid the double walk of /proc, in another PR
evaluate adding a cache on top of this, again, in a fresh PR

From there we can evaluate other optimisations one PR at a time. Please avoid needless renaming to make this process go faster.

Thank you for working on this, we do appreciate your contribution here.

inercia · 2015-09-30T13:43:24Z

No problem, I'll change the code in several PRs. I'll close this PR then...

inercia force-pushed the scope-486 branch 2 times, most recently from 8c3ed08 to eeb96d8 Compare September 16, 2015 11:38

inercia force-pushed the scope-486 branch from eeb96d8 to 7c58a04 Compare September 16, 2015 11:43

peterbourgon reviewed Sep 16, 2015
View reviewed changes

probe/proc/cache.go

creation time.Time

}

func (fce *filesCacheEntry) Pinned() bool { return fce.contents != nil }

This comment was marked as abuse.

Sign in to view

This comment was marked as abuse.

Sign in to view

This comment was marked as abuse.

Sign in to view

inercia force-pushed the scope-486 branch from 7c58a04 to 79eb58d Compare September 16, 2015 12:40

inercia force-pushed the scope-486 branch from 944f657 to b93365a Compare September 17, 2015 11:53

inercia assigned tomwilkie Sep 18, 2015

tomwilkie reviewed Sep 22, 2015
View reviewed changes

inercia force-pushed the scope-486 branch from b93365a to c5ab05f Compare September 22, 2015 11:44

inercia force-pushed the scope-486 branch 3 times, most recently from a1ea316 to 21446c5 Compare September 23, 2015 12:33

tomwilkie reviewed Sep 24, 2015
View reviewed changes

probe/proc/cache.go

)

type filesCacheEntry struct {

file File

This comment was marked as abuse.

Sign in to view

This comment was marked as abuse.

Sign in to view

This comment was marked as abuse.

Sign in to view

tomwilkie assigned inercia and unassigned tomwilkie Sep 24, 2015

inercia force-pushed the scope-486 branch 3 times, most recently from 3fd37ab to 86595eb Compare September 25, 2015 10:39

inercia added 4 commits September 25, 2015 13:09

Drop all the dependencies we had on procspy by moving some code to …

61bf6e4

…`probe`. Cache processes and connections when reading from the /proc Keep a cache of open files, reducing the number of open/close cycles in the /proc dir Also cache some file contents

Use hashicorp/golang-lru in the async resolver

fc0ccfc

Use hashicorp/golang-lru in report/id

52ff266

Reduce drastically the number of syscalls by skipping uninteresting

c647012

processes in the /proc directory by the name.

inercia force-pushed the scope-486 branch from 86595eb to c647012 Compare September 25, 2015 11:18

inercia closed this Sep 30, 2015

tomwilkie deleted the scope-486 branch October 27, 2015 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize /proc traversal (again) #497

Optimize /proc traversal (again) #497

inercia commented Sep 16, 2015

inercia commented Sep 16, 2015

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

peterbourgon commented Sep 16, 2015

tomwilkie commented Sep 16, 2015

tomwilkie commented Sep 16, 2015

inercia commented Sep 17, 2015

inercia commented Sep 17, 2015

peterbourgon commented Sep 17, 2015

inercia commented Sep 17, 2015

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

tomwilkie commented Sep 22, 2015

inercia commented Sep 22, 2015

inercia commented Sep 22, 2015

inercia commented Sep 23, 2015

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

tomwilkie commented Sep 24, 2015

inercia commented Sep 24, 2015

tomwilkie commented Sep 24, 2015

inercia commented Sep 25, 2015

tomwilkie commented Sep 30, 2015

inercia commented Sep 30, 2015

Optimize /proc traversal (again) #497

Optimize /proc traversal (again) #497

Conversation

inercia commented Sep 16, 2015

inercia commented Sep 16, 2015

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

peterbourgon commented Sep 16, 2015

tomwilkie commented Sep 16, 2015

tomwilkie commented Sep 16, 2015

inercia commented Sep 17, 2015

inercia commented Sep 17, 2015

peterbourgon commented Sep 17, 2015

inercia commented Sep 17, 2015

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

tomwilkie commented Sep 22, 2015

inercia commented Sep 22, 2015

inercia commented Sep 22, 2015

inercia commented Sep 23, 2015

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

tomwilkie commented Sep 24, 2015

inercia commented Sep 24, 2015

tomwilkie commented Sep 24, 2015

inercia commented Sep 25, 2015

tomwilkie commented Sep 30, 2015

inercia commented Sep 30, 2015