Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probe using 70% CPU #812

Closed
tomwilkie opened this issue Jan 13, 2016 · 24 comments
Closed

Probe using 70% CPU #812

tomwilkie opened this issue Jan 13, 2016 · 24 comments
Assignees
Milestone

Comments

@tomwilkie
Copy link
Contributor

@2opremio reports the probe is using 70% CPU on the service hosts.

@tomwilkie tomwilkie added this to the 0.12.0 milestone Jan 13, 2016
@2opremio
Copy link
Contributor

I restarted the probes with profiling (--probe.http.listen :4041) and curled http://localhost:4041/debug/pprof/profile from one of the probes: http://filebin.ca/2TJeW7veYKFA/profile.gz

@tomwilkie
Copy link
Contributor Author

cpu

@tomwilkie
Copy link
Contributor Author

At a cursory analysis, the probes are spending:

  • 30% of the CPU reading the /proc//net/tcp files
  • maybe <10% of cpu is spent on GC (<3 of 30s) according to that trace
  • 10% on copying in the report structure

I suspect fixing the copying will also fix the GC, but I can’t think of anything we can do to ‘fix’ the proc cpu usage, as this has already been heavily optimised (we only read the tcp files for net namespaces we haven't seen before).

I guess we can decrease the polling frequency / proc spy interval, and maybe even come up with some rate limiting scheme (ie only read 10/s, which would get us ~constant cpu usage, and at larger scales would means the UI updates less frequently).

@tomwilkie
Copy link
Contributor Author

@2opremio tried that suggestion (./scope -probe.spy.interval=3s) and confirmed CPU dropped to 20%, as expected.

@2opremio
Copy link
Contributor

2opremio commented Feb 4, 2016

The situation improves considerably after #898 . Now the probes (2ed3968) are using 30-40% CPU with peaks of 50%

Here go two probe profiles I have obtained from two different probes:

pprof.localhost_4041.samples.cpu.001.pb.gz
probe

pprof.localhost_4041.samples.cpu.002.pb.gz
probe2

  • The probes are spending a long time encoding reports. A non-negligible amount of time (10% in one of the profiles) is spent sending gob types, which can be brought to 0 through any of the proposed solutions from App using lots of CPU #854 (comment)
  • Although the Reads of /proc/*/net/tcp{,6}are not the bottleneck anymore thanks to Fix reading of network namespace inodes #898, the stat()s of /proc/*/fd/* are still quite expensive (10% and 35% in the reports above). So we should maybe reconsider [WIP] Rate limit reading expensive files from /proc #814, which I should had tested with more open files: sudo bash -c "echo /proc/*/fd/* | wc -w" prints numbers up to ~20K in the service machines, whereas my own machine only has 1K.
  • Kubernetes support (which is enabled in the probes) takes around 10% of the time in the second profile (I think this is the cause of the 10% CPU peaks), almost exclusively matching pods:
$ go tool pprof  ~/home-vagrant/scope-tested pprof.localhost_4041.samples.cpu.002.pb.gz 
Entering interactive mode (type "help" for commands)
(pprof) list podTopology.func1
Total: 11.78s
ROUTINE ======================== github.com/weaveworks/scope/probe/kubernetes.(*Reporter).podTopology.func1 in /go/src/github.com/weaveworks/scope/probe/kubernetes/reporter.go
      10ms      1.16s (flat, cum)  9.85% of Total
         .          .     51:}
         .          .     52:
         .          .     53:func (r *Reporter) podTopology(services []Service) (report.Topology, report.Topology, error) {
         .          .     54:   pods, containers := report.MakeTopology(), report.MakeTopology()
         .          .     55:   err := r.client.WalkPods(func(p Pod) error {
      10ms       10ms     56:           for _, service := range services {
         .      1.10s     57:                   if service.Selector().Matches(p.Labels()) {
         .          .     58:                           p.AddServiceID(service.ID())
         .          .     59:                   }
         .          .     60:           }
         .          .     61:           nodeID := report.MakePodNodeID(p.Namespace(), p.Name())
         .       40ms     62:           pods = pods.AddNode(nodeID, p.GetNode())
         .          .     63:
         .          .     64:           container := report.MakeNodeWith(map[string]string{
         .          .     65:                   PodID:     p.ID(),
         .          .     66:                   Namespace: p.Namespace(),
         .       10ms     67:           }).WithParents(report.EmptySets.Add(report.Pod, report.MakeStringSet(nodeID)))
         .          .     68:           for _, containerID := range p.ContainerIDs() {
         .          .     69:                   containers.AddNode(report.MakeContainerNodeID(containerID), container)
         .          .     70:           }
         .          .     71:           return nil
         .          .     72:   })
(pprof) list Selector
Total: 11.78s
ROUTINE ======================== github.com/weaveworks/scope/probe/kubernetes.(*service).Selector in /go/src/github.com/weaveworks/scope/probe/kubernetes/service.go
         0      1.04s (flat, cum)  8.83% of Total
         .          .     47:
         .          .     48:func (s *service) Selector() labels.Selector {
         .          .     49:   if s.Spec.Selector == nil {
         .          .     50:           return labels.Nothing()
         .          .     51:   }
         .      1.04s     52:   return labels.SelectorFromSet(labels.Set(s.Spec.Selector))
         .          .     53:}
         .          .     54:
         .          .     55:func (s *service) GetNode() report.Node {
         .          .     56:   return report.MakeNodeWith(map[string]string{
         .          .     57:           ServiceID:      s.ID(),

@paulbellamy Any ideas on how improve this? SelectorFromSet seems to be prohibitively expensive, mostly due to the regular expression matching (maybe we should fix that):

(pprof) list SelectorFromSet
Total: 11.78s
ROUTINE ======================== github.com/weaveworks/scope/vendor/k8s.io/kubernetes/pkg/labels.SelectorFromSet in /go/src/github.com/weaveworks/scope/vendor/k8s.io/kubernetes/pkg/labels/selector.go
      30ms      1.04s (flat, cum)  8.83% of Total
         .          .    719:func SelectorFromSet(ls Set) Selector {
         .          .    720:   if ls == nil {
         .          .    721:           return LabelSelector{}
         .          .    722:   }
         .          .    723:   var requirements []Requirement
      20ms       30ms    724:   for label, value := range ls {
         .      890ms    725:           if r, err := NewRequirement(label, EqualsOperator, sets.NewString(value)); err != nil {
         .          .    726:                   //TODO: double check errors when input comes from serialization?
         .          .    727:                   return LabelSelector{}
         .          .    728:           } else {
      10ms       80ms    729:                   requirements = append(requirements, *r)
         .          .    730:           }
         .          .    731:   }
         .       40ms    732:   return LabelSelector(requirements)
         .          .    733:}
(pprof) list NewRequirement
Total: 11.78s
ROUTINE ======================== github.com/weaveworks/scope/vendor/k8s.io/kubernetes/pkg/labels.NewRequirement in /go/src/github.com/weaveworks/scope/vendor/k8s.io/kubernetes/pkg/labels/selector.go
      20ms      660ms (flat, cum)  5.60% of Total
         .          .    105:// (5) The key is invalid due to its length, or sequence
         .          .    106://     of characters. See validateLabelKey for more details.
         .          .    107://
         .          .    108:// The empty string is a valid value in the input values set.
         .          .    109:func NewRequirement(key string, op Operator, vals sets.String) (*Requirement, error) {
      10ms      260ms    110:   if err := validateLabelKey(key); err != nil {
         .          .    111:           return nil, err
         .          .    112:   }
         .          .    113:   switch op {
         .          .    114:   case InOperator, NotInOperator:
         .          .    115:           if len(vals) == 0 {
         .          .    116:                   return nil, fmt.Errorf("for 'in', 'notin' operators, values set can't be empty")
         .          .    117:           }
      10ms       10ms    118:   case EqualsOperator, DoubleEqualsOperator, NotEqualsOperator:
         .          .    119:           if len(vals) != 1 {
         .          .    120:                   return nil, fmt.Errorf("exact-match compatibility requires one single value")
         .          .    121:           }
         .       10ms    122:   case ExistsOperator, DoesNotExistOperator:
         .          .    123:           if len(vals) != 0 {
         .          .    124:                   return nil, fmt.Errorf("values set must be empty for exists and does not exist")
         .          .    125:           }
         .          .    126:   default:
         .          .    127:           return nil, fmt.Errorf("operator '%v' is not recognized", op)
         .          .    128:   }
         .          .    129:
         .       20ms    130:   for v := range vals {
         .      320ms    131:           if err := validateLabelValue(v); err != nil {
         .          .    132:                   return nil, err
         .          .    133:           }
         .          .    134:   }
         .       40ms    135:   return &Requirement{key: key, operator: op, strValues: vals}, nil
         .          .    136:}
         .          .    137:
         .          .    138:// Matches returns true if the Requirement matches the input Labels.
         .          .    139:// There is a match in the following cases:
         .          .    140:// (1) The operator is Exists and Labels has the Requirement's key.
(pprof) 

@tomwilkie
Copy link
Contributor Author

We made the gob encoding once-per-report a while back (it used to be
onces-per-report-per-app). That was a significant improvement. If we want
to do gob-over-websocket, we'll have to go back to a separate encoder per
app (as different app connections will have different lifecycles), which I
worry will be more of a regression in CPU usage.

How many probes are there? How many apps? I suspect we might just be
hitting the O(n*m) nature of this architecture. Going for 2 apps and N
probes should be an improvement?

Other things you highlight also look good:

  • rate limit reading of FDs, although due to the race with
    /proc/PID/net/tcp this will have to look like a periodic reading of
    namespaces, with some dynamic-feedback based on the number of fds in a
    namespace, which might be complicated
  • kubernetes optimisations

Thanks for working on this fons!

@2opremio
Copy link
Contributor

2opremio commented Feb 4, 2016

We made the gob encoding once-per-report a while back (it used to be onces-per-report-per-app). That was a significant improvement. If we want to do gob-over-websocket, we'll have to go back to a separate encoder per app (as different app connections will have different lifecycles), which I worry will be more of a regression in CPU usage.

Thanks, I didn't think about the (apps * probes) scenario. This use-case involves multiple probes (5) reporting to a single app (which should be the most common one ... if we were not doing (apps*probes) in standalone mode).

However, take into account that, with websockets, the type-sending part of the encoding (aparently the most expensive one) will only be done once per app. With http, it's done once per report. My intuition says that with gob, until certain number (N) of reports, it will be cheaper to encode a report multiple types with existing encoder than encoding it only once with a new encoder each time.

Either way, the fact that we need to make clear-cut performance choices for the gob codec between ...

  • Apps connecting to multiple probes
  • Probes connecting to multiple apps.

... is another sign that gob is a bad fit due to it's stateful nature. We should be probably be using a codec which lets you optimize globally instead of per-stream.

How many probes are there? How many apps? I suspect we might just be hitting the O(n*m) nature of this architecture. Going for 2 apps and N probes should be an improvement?

5 probes 1 app. Yep, splitting the app would help performance-wise, but it would make Scope unusable on Kubernetes since the user normally doesn't have a saying about where the apps end up.

Using a tree-structure with intermediate report-merging nodes would also do but it may probe tricky to deploy. Either way, I would say 5 probes for a single app should be more than reasonable.

Thanks for working on this fons!

No problem, thanks for the feedback

The rate limiting from #912 helps quite a bit

probe_rate_limit

Now the profile is dominated by garbage collection and the kubernetes problem.

@2opremio
Copy link
Contributor

2opremio commented Feb 4, 2016

After cherry-picking @paul's https://github.com/weaveworks/scope/tree/k8s-efficiency on top of the rate limiting, the CPU consumption is between 25% and 35%:

probe_rate_limit_k8s

Now the profile is completely dominated by encoding, copying and garbage collection. I am going to work on the generated json encoding, clean and get #912 in and then see how we can improve the garbage collection dominance if it's still there.

@2opremio
Copy link
Contributor

After the codec improvements, the probes (a38db2e) still consume ~20% CPU in the service.

pprof.localhost:4041.samples.cpu.001.pb.gz

probe

The main two places where to improve are:

  • Report copying. It consumes a lot of CPU and generates a lot of objects, putting a lot pressure in the GC
  • The codec. Yes, it still takes 17% of the time, but I think the solution here is to simply reduce the size of the reports or their frequency. I have created Send incremental reports probe->app #985

@2opremio
Copy link
Contributor

@tomwilkie @paulbellamy @peterbourgon Any ideas on how to reduce or optimize the copying?

@peterbourgon
Copy link
Contributor

We can save a ton of allocs and CPU time by leveraging sync.Pool. For sure we can use pools of []byte or *bytes.Buffer when we serialize reports (and deserialize on the other end). That's easy and may get us some quick wins.

Beyond that, it'd be great to use pools for the reports themselves, i.e. don't allocate a new report and fill it, but fetch one from the pool, reset it, and write into it. But, we'd need implement a Reset method, and I'm not sure how simple that would be given all of our nested maps. This also assumes that Resetting a map will maintain its allocated capacity; I think that's true but without a cap to test I'm not positive.

Other thoughts?

@paulbellamy
Copy link
Contributor

It's a bit unclear to me why copying is taking that much CPU/making that much garbage. Copy should be almost a no-op. Unless there is stuff left which is being copied? It looks a bit from the graph like Metrics?

Alternatively we could ditch the whole immutability/copying thing and rethink how the pipeline works. Go doesn't exactly make working with immutable datastructures easy/elegant anyway.

@peterbourgon
Copy link
Contributor

Alternatively we could ditch the whole immutability/copying thing and rethink how the pipeline works. Go doesn't exactly make working with immutable datastructures easy/elegant anyway.

👍 to this. The current report structure and pipeline architecture was designed for ease of use and expediency. If it's no longer easy to use (i.e. maintain and mutate with changing product requirements) and it's a performance bottleneck with no obvious fix, it's absolutely appropriate to revisit the overall design.

@2opremio
Copy link
Contributor

We can save a ton of allocs and CPU time by leveraging sync.Pool. For sure we can use pools of []byte or *bytes.Buffer when we serialize reports (and deserialize on the other end).

AFAIK the new codec library already uses pools.

However, I believe that the fromIntermediate/toIntermediate methods used for custom serialization (e.g. the ones needed for mndrix/ps like LatestMap) may be expensive

Beyond that, it'd be great to use pools for the reports themselves

This sounds more like it!

@peterbourgon
Copy link
Contributor

This sounds more like it!

Unfortunately, after a bit of research and discussion with the community, I don't think the current report structure is going to lend itself to Pooling very well. In the end it's a collection of maps, and maps only 'reset' in linear time, i.e.

for _, topo := range report.Topologies() {
    for k := range topo {
        delete(topo, k)
    }
}

It's worth a profile, of course, but the cost of reset increases as the map gets bigger, making it less and less appealing vs. GC. My intuition is that if we want to Pool reports, we need to get rid of most if not all of the maps, in favor of slices or something else — IMO worth considering together with @paulbellamy's bag-of-nodes approach to the pipelines.

@2opremio
Copy link
Contributor

Actually, after having a closer look at the profile, one of the nodes (17.76%) spent in the codec doesn't come from Encoding but Decoding, so it can't originate from the reports being sent to the app. 10% of that time is spent reading http bodies, so the decoding itself is only 7%.

Docker/kubernetes info? It's hard to tell because apparently it's being done in a separate goroutine. I'll investigate but we may be pooling something way too often.

If needed, to accelerate the decoding of that information we can always generate Selfers for those types in the vendored directory.

@2opremio
Copy link
Contributor

Also, to get a better sense of where we can reduce garbage generation, I just found out that pprof lets you obtain heap profiles with --alloc_objects. This will tell you how many objects have been allocated overall (not just in use). I will give that a try.

@2opremio
Copy link
Contributor

Docker/kubernetes info? It's hard to tell because apparently it's being done in a separate goroutine. I'll investigate but we may be pooling something way too often.

Yep, it's docker stats being gathered:

(pprof) list StartGatheringStats
Total: 6.87s
ROUTINE ======================== github.com/weaveworks/scope/probe/docker.(*container).StartGatheringStats.func1 in /go/src/github.com/weaveworks/scope/probe/docker/container.go
         0      1.23s (flat, cum) 17.90% of Total
         .          .    202:           c.statsConn = nil
         .          .    203:       }()
         .          .    204:
         .          .    205:       var stats docker.Stats
         .          .    206:       decoder := codec.NewDecoder(resp.Body, &codec.JsonHandle{})
         .      1.23s    207:       for err := decoder.Decode(&stats); err != io.EOF; err = decoder.Decode(&stats) {
         .          .    208:           if err != nil {
         .          .    209:               log.Errorf("docker container: error reading event, did container stop? %v", err)
         .          .    210:               return
         .          .    211:           }
         .          .    212:

And I don't think it was done in a separate goroutine, pprof just omitted the parents in the tree.

@paulbellamy
Copy link
Contributor

go-dockerclient (and docker API) has a streaming stats thing that might reduce this.

@tomwilkie
Copy link
Contributor Author

Nah, there code is basically identical.

On Friday, 19 February 2016, Paul Bellamy [email protected] wrote:

go-dockerclient (and docker API) has a streaming stats thing
https://godoc.org/github.com/fsouza/go-dockerclient#Client.Stats that
might reduce this.


Reply to this email directly or view it on GitHub
#812 (comment).

@2opremio
Copy link
Contributor

Here's the --alloc_objects heap profile for a probe (d744278) running in the service:

pprof.localhost:4041.alloc_objects.alloc_space.003.pb.gz

probe_obj_allocs

@2opremio
Copy link
Contributor

After #1000, the probes are running stably under 20% (whilst kubelet 1.1 is running >=70% in all nodes) and the garbage generated is tolerable. However, the /proc read/stats are up at 26% . I will try to tweak the background reader a bit to see if I can lower that number.

CPU profile:

pprof.localhost:4041.samples.cpu.002.pb.gz

probe

Objects allocations profile:

pprof.localhost:4041.alloc_objects.alloc_space.001.pb.gz

app_object_alloc

@2opremio
Copy link
Contributor

After making another measurement and looking at the syscalls (-focus 'Syscall$'), I can see that the Syscall load is shared between the connection walker (probe/endpoint/procspy/proc_linux.go) and the process walker (probe/process/walker_linux.go), which after #961 also inspects the fd directories:

pprof.localhost:4041.samples.cpu.003.pb.gz

probe_syscall

@2opremio
Copy link
Contributor

Closing in favor of #1010 and #1011

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants