Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client APIs for reporting resource usage of host and allocations #1189

Merged
merged 69 commits into from
May 31, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
1cae57a
Add the Stats api to driverhandle
diptanu Apr 28, 2016
01e0ae7
Added a client API to display resource usage of an allocation
diptanu Apr 29, 2016
a485a38
Added cpu stats
diptanu Apr 29, 2016
50250b1
Added the nomad stats command
diptanu Apr 29, 2016
eda53e3
Added a ringbuff datastructure
diptanu May 9, 2016
f390261
Reporting time series of stats
diptanu May 9, 2016
445b181
Updated gopsutil
diptanu May 9, 2016
fe8f640
Collecting host stats
diptanu May 9, 2016
7d8196d
Added some documentation
diptanu May 9, 2016
1f12e90
Returning nil if peek is called before any value is enqueued
diptanu May 9, 2016
3f0c42c
Making the conversion to Stats simpler
diptanu May 9, 2016
72c60d6
Added some docs to resource stats endpoint
diptanu May 9, 2016
0fdff61
Implemented stats for raw_exec
diptanu May 11, 2016
c85b4de
Adding a query param to return time series of stats
diptanu May 18, 2016
9a851a1
Added the PidStats method on the executor
diptanu May 19, 2016
ea1370d
Fixed implementation of the docker stats
diptanu May 19, 2016
30cbfe1
Implemented cpu stats
diptanu May 19, 2016
f693545
Vendoring go-ps
diptanu May 19, 2016
cbe6f6a
updating the CpuStats api
diptanu May 19, 2016
7f016a7
Reporting percentage usage of cpu in nomad stats
diptanu May 19, 2016
0af3e7e
Using humanize for showing memory
diptanu May 19, 2016
9806867
Implemented nomad cpu percentage calculator
diptanu May 20, 2016
b7158be
Added locks to RingBuf
diptanu May 20, 2016
df68129
Added some docs
diptanu May 20, 2016
3f336f4
Added missing vendored dependencies
diptanu May 20, 2016
16f298f
Fixed the percentage calculation for cgroups
diptanu May 20, 2016
3dc28bd
Stopping stats collection of tasks which has been destroyed
diptanu May 21, 2016
458b701
Added a test for calculating cpu stats
diptanu May 21, 2016
3192e31
Renamed CpuUsage to CpuStats
diptanu May 21, 2016
6132ccc
Added pidstats in task resource usage struct
diptanu May 21, 2016
33f2d0c
Added a stats api for retreiving node stats
diptanu May 22, 2016
aee9db0
Showing host resource usage stats
diptanu May 22, 2016
1183037
Added uptime to node stats
diptanu May 22, 2016
4491c2b
Added disk usage to node status
diptanu May 22, 2016
b273eb8
Making task a flag in the stats command
diptanu May 24, 2016
b755ab9
Changed the stats endpoints
diptanu May 24, 2016
2b1f389
Acquiring locks before iterating allocations and tasks
diptanu May 24, 2016
95a3ca8
Changed the signature of ResourceUsageTS
diptanu May 25, 2016
cf8861e
Renamed monitorUsage method
diptanu May 25, 2016
bf6c034
Making the stats collection interval and number of data points to kee…
diptanu May 25, 2016
584c1e3
Incorporated review comments for executor
diptanu May 25, 2016
3a2cce2
Simplified the docker stats collection
diptanu May 25, 2016
31af4e0
Changed signature of Allocation Stats Reporter
diptanu May 25, 2016
73f0594
Refactored the api for NewHostStatsCollector
diptanu May 25, 2016
3a8b152
Added comments
diptanu May 25, 2016
f59ad3d
Fixed the logic of scanpids
diptanu May 25, 2016
2c52338
simplified the stats method in basic executor
diptanu May 26, 2016
71d3361
creating the host cpu percent calculator lazily
diptanu May 26, 2016
c99733e
Fixed the compilation on linux
diptanu May 26, 2016
7870912
Fixed the node status cli command
diptanu May 26, 2016
25be2da
Using the api client for querying nomad client endpoints
diptanu May 26, 2016
1c635ea
Fixing the alloc runner tests
diptanu May 26, 2016
b51b08b
Added a test to validate we are collecting stats
diptanu May 26, 2016
03c9d94
Making the call to Stats on a go-routine
diptanu May 26, 2016
f765e82
Stopping the metrics collector timers using defer and starting to col…
diptanu May 26, 2016
f16191c
comments
diptanu May 26, 2016
455c759
Fixed the docs for the stats command
diptanu May 26, 2016
2571ea6
Initializing the ring buffer with no cells
diptanu May 27, 2016
0b868e0
Initializing the ring buffer with no cells
diptanu May 27, 2016
68ec1f3
Implemented the resource usage ts since a time
diptanu May 27, 2016
15e79c3
Changing the api of the stats endpoints
diptanu May 27, 2016
06f8d58
Making the cli use new apis
diptanu May 27, 2016
4611540
Added a test for the clients stats endpoint
diptanu May 28, 2016
0782803
Added a test for alloc stats api endpoint
diptanu May 28, 2016
993675d
Added a test for docker
diptanu May 28, 2016
951b553
Added a test to make sure task runner is collecting stats
diptanu May 29, 2016
c760d59
Renamed error message in alloc endpoint
diptanu May 29, 2016
e00e203
updated the docker system dependency
diptanu May 29, 2016
f1a5348
Fixed a test
diptanu May 30, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions api/allocations.go
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
package api

import (
"fmt"
"sort"
"time"

"github.com/hashicorp/go-cleanhttp"
)

// Allocations is used to query the alloc-related endpoints.
Expand Down Expand Up @@ -40,6 +43,30 @@ func (a *Allocations) Info(allocID string, q *QueryOptions) (*Allocation, *Query
return &resp, qm, nil
}

func (a *Allocations) Stats(alloc *Allocation, q *QueryOptions) (map[string]*TaskResourceUsage, error) {
node, _, err := a.client.Nodes().Info(alloc.NodeID, q)
if err != nil {
return nil, err
}
if node.HTTPAddr == "" {
return nil, fmt.Errorf("http addr of the node where alloc %q is running is not advertised", alloc.ID)
}
client, err := NewClient(&Config{
Address: fmt.Sprintf("http://%s", node.HTTPAddr),
HttpClient: cleanhttp.DefaultClient(),
})
if err != nil {
return nil, err
}
resp := make(map[string][]*TaskResourceUsage)
client.query("/v1/client/allocation/"+alloc.ID+"/stats", &resp, nil)
res := make(map[string]*TaskResourceUsage)
for task, ru := range resp {
res[task] = ru[0]
}
return res, nil
}

// Allocation is used for serialization of allocations.
type Allocation struct {
ID string
Expand Down
58 changes: 58 additions & 0 deletions api/nodes.go
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
package api

import (
"fmt"
"sort"
"strconv"

"github.com/hashicorp/go-cleanhttp"
)

// Nodes is used to query node-related API endpoints
Expand Down Expand Up @@ -71,6 +74,29 @@ func (n *Nodes) ForceEvaluate(nodeID string, q *WriteOptions) (string, *WriteMet
return resp.EvalID, wm, nil
}

func (n *Nodes) Stats(nodeID string, q *QueryOptions) (*HostStats, error) {
node, _, err := n.client.Nodes().Info(nodeID, q)
if err != nil {
return nil, err
}
if node.HTTPAddr == "" {
return nil, fmt.Errorf("http addr of the node %q is running is not advertised", nodeID)
}
client, err := NewClient(&Config{
Address: fmt.Sprintf("http://%s", node.HTTPAddr),
HttpClient: cleanhttp.DefaultClient(),
})
if err != nil {
return nil, err
}
var resp []HostStats
if _, err := client.query("/v1/client/stats/", &resp, nil); err != nil {
return nil, err
}

return &resp[0], nil
}

// Node is used to deserialize a node entry.
type Node struct {
ID string
Expand All @@ -90,6 +116,38 @@ type Node struct {
ModifyIndex uint64
}

// HostStats represents resource usage stats of the host running a Nomad client
type HostStats struct {
Memory *HostMemoryStats
CPU []*HostCPUStats
DiskStats []*HostDiskStats
Uptime uint64
}

type HostMemoryStats struct {
Total uint64
Available uint64
Used uint64
Free uint64
}

type HostCPUStats struct {
CPU string
User float64
System float64
Idle float64
}

type HostDiskStats struct {
Device string
Mountpoint string
Size uint64
Used uint64
Available uint64
UsedPercent float64
InodesUsedPercent float64
}

// NodeListStub is a subset of information returned during
// node list operations.
type NodeListStub struct {
Expand Down
33 changes: 33 additions & 0 deletions api/tasks.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,39 @@ import (
"time"
)

// MemoryStats holds memory usage related stats
type MemoryStats struct {
RSS uint64
Cache uint64
Swap uint64
MaxUsage uint64
KernelUsage uint64
KernelMaxUsage uint64
}

// CpuStats holds cpu usage related stats
type CpuStats struct {
SystemMode float64
UserMode float64
ThrottledPeriods uint64
ThrottledTime uint64
Percent float64
}

// ResourceUsage holds information related to cpu and memory stats
type ResourceUsage struct {
MemoryStats *MemoryStats
CpuStats *CpuStats
}

// TaskResourceUsage holds aggregated resource usage of all processes in a Task
// and the resource usage of the individual pids
type TaskResourceUsage struct {
ResourceUsage *ResourceUsage
Timestamp int64
Pids map[string]*ResourceUsage
}

// RestartPolicy defines how the Nomad client restarts
// tasks in a taskgroup when they fail
type RestartPolicy struct {
Expand Down
34 changes: 34 additions & 0 deletions client/alloc_runner.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,11 @@ const (
// AllocStateUpdater is used to update the status of an allocation
type AllocStateUpdater func(alloc *structs.Allocation)

// AllocStatsReporter exposes stats related APIs of an allocation runner
type AllocStatsReporter interface {
AllocStats() map[string]TaskStatsReporter
}

// AllocRunner is used to wrap an allocation and provide the execution context.
type AllocRunner struct {
config *config.Config
Expand Down Expand Up @@ -471,6 +476,35 @@ func (r *AllocRunner) Update(update *structs.Allocation) {
}
}

// StatsReporter returns an interface to query resource usage statistics of an
// allocation
func (r *AllocRunner) StatsReporter() AllocStatsReporter {
return r
}

// AllocStats returns the stats reporter of all the tasks running in the
// allocation
func (r *AllocRunner) AllocStats() map[string]TaskStatsReporter {
r.taskLock.RLock()
defer r.taskLock.RUnlock()
res := make(map[string]TaskStatsReporter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the lock

for task, tr := range r.tasks {
res[task] = tr.StatsReporter()
}
return res
}

// TaskStats returns the stats reporter for a specific task running in the
// allocation
func (r *AllocRunner) TaskStats(task string) (TaskStatsReporter, error) {
tr, ok := r.tasks[task]
if !ok {
return nil, fmt.Errorf("task %q not running in allocation %v", task, r.alloc.ID)
}

return tr.StatsReporter(), nil
}

// shouldUpdate takes the AllocModifyIndex of an allocation sent from the server and
// checks if the current running allocation is behind and should be updated.
func (r *AllocRunner) shouldUpdate(serverIndex uint64) bool {
Expand Down
135 changes: 126 additions & 9 deletions client/client.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ import (
"github.com/hashicorp/nomad/client/consul"
"github.com/hashicorp/nomad/client/driver"
"github.com/hashicorp/nomad/client/fingerprint"
"github.com/hashicorp/nomad/client/stats"
"github.com/hashicorp/nomad/nomad"
"github.com/hashicorp/nomad/nomad/structs"
"github.com/mitchellh/hashstructure"
Expand Down Expand Up @@ -76,11 +77,27 @@ const (
// DefaultConfig returns the default configuration
func DefaultConfig() *config.Config {
return &config.Config{
LogOutput: os.Stderr,
Region: "global",
LogOutput: os.Stderr,
Region: "global",
StatsDataPoints: 60,
StatsCollectionInterval: 1 * time.Second,
}
}

// ClientStatsReporter exposes all the APIs related to resource usage of a Nomad
// Client
type ClientStatsReporter interface {
// AllocStats returns a map of alloc ids and their corresponding stats
// collector
AllocStats() map[string]AllocStatsReporter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments


// HostStats returns resource usage stats for the host
HostStats() []*stats.HostStats

// HostStatsTS returns a time series of host resource usage stats
HostStatsTS(since int64) []*stats.HostStats
}

// Client is used to implement the client interaction with Nomad. Clients
// are expected to register as a schedulable node to the servers, and to
// run allocations as determined by the servers.
Expand Down Expand Up @@ -116,6 +133,11 @@ type Client struct {

consulService *consul.ConsulService

// HostStatsCollector collects host resource usage stats
hostStatsCollector *stats.HostStatsCollector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you get rid of this in the struct and just instantiate it in monitorUsage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantage of this being here is that if we can't create the host stats collector for some reason the client won't start. If we move this down, the client will run but we might not be able to collect stats after the client has started. And I think we shouldn't run the client if we can't create the stats collector since down the line, resource usage stats might feed into scheduling decisions.

resourceUsage *stats.RingBuff
resourceUsageLock sync.RWMutex

shutdown bool
shutdownCh chan struct{}
shutdownLock sync.Mutex
Expand All @@ -126,15 +148,22 @@ func NewClient(cfg *config.Config) (*Client, error) {
// Create a logger
logger := log.New(cfg.LogOutput, "", log.LstdFlags)

resourceUsage, err := stats.NewRingBuff(cfg.StatsDataPoints)
if err != nil {
return nil, err
}

// Create the client
c := &Client{
config: cfg,
start: time.Now(),
connPool: nomad.NewPool(cfg.LogOutput, clientRPCCache, clientMaxStreams, nil),
logger: logger,
allocs: make(map[string]*AllocRunner),
allocUpdates: make(chan *structs.Allocation, 64),
shutdownCh: make(chan struct{}),
config: cfg,
start: time.Now(),
connPool: nomad.NewPool(cfg.LogOutput, clientRPCCache, clientMaxStreams, nil),
logger: logger,
hostStatsCollector: stats.NewHostStatsCollector(),
resourceUsage: resourceUsage,
allocs: make(map[string]*AllocRunner),
allocUpdates: make(chan *structs.Allocation, 64),
shutdownCh: make(chan struct{}),
}

// Initialize the client
Expand Down Expand Up @@ -189,6 +218,9 @@ func NewClient(cfg *config.Config) (*Client, error) {
// Start the client!
go c.run()

// Start collecting stats
go c.collectHostStats()

// Start the consul sync
go c.syncConsul()

Expand Down Expand Up @@ -394,6 +426,67 @@ func (c *Client) Node() *structs.Node {
return c.config.Node
}

// StatsReporter exposes the various APIs related resource usage of a Nomad
// client
func (c *Client) StatsReporter() ClientStatsReporter {
return c
}

// AllocStats returns all the stats reporter of the allocations running on a
// Nomad client
func (c *Client) AllocStats() map[string]AllocStatsReporter {
res := make(map[string]AllocStatsReporter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the lock when iterating over the allocs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dadgar I think I should use the getAllocRunners method, which locks and returns a snapshot of the alloc runners.

allocRunners := c.getAllocRunners()
for alloc, ar := range allocRunners {
res[alloc] = ar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

res[alloc] = ar.StatsReporter()

}
return res
}

// HostStats returns all the stats related to a Nomad client
func (c *Client) HostStats() []*stats.HostStats {
c.resourceUsageLock.RLock()
defer c.resourceUsageLock.RUnlock()
val := c.resourceUsage.Peek()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if val is nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we return nil

ru, _ := val.(*stats.HostStats)
return []*stats.HostStats{ru}
}

func (c *Client) HostStatsTS(since int64) []*stats.HostStats {
c.resourceUsageLock.RLock()
defer c.resourceUsageLock.RUnlock()

values := c.resourceUsage.Values()
low := 0
high := len(values) - 1
var idx int

for {
mid := (low + high) >> 1
midVal, _ := values[mid].(*stats.HostStats)
if midVal.Timestamp < since {
low = mid + 1
} else if midVal.Timestamp > since {
high = mid - 1
} else if midVal.Timestamp == since {
idx = mid
break
}
if low > high {
idx = low
break
}
}
values = values[idx:]
ts := make([]*stats.HostStats, len(values))
for index, val := range values {
ru, _ := val.(*stats.HostStats)
ts[index] = ru
}
return ts

}

// GetAllocFS returns the AllocFS interface for the alloc dir of an allocation
func (c *Client) GetAllocFS(allocID string) (allocdir.AllocDirFS, error) {
ar, ok := c.allocs[allocID]
Expand Down Expand Up @@ -1227,3 +1320,27 @@ func (c *Client) syncConsul() {

}
}

// collectHostStats collects host resource usage stats periodically
func (c *Client) collectHostStats() {
// Start collecting host stats right away and then keep collecting every
// collection interval
next := time.NewTimer(0)
defer next.Stop()
for {
select {
case <-next.C:
ru, err := c.hostStatsCollector.Collect()
if err != nil {
c.logger.Printf("[DEBUG] client: error fetching host resource usage stats: %v", err)
continue
}
c.resourceUsageLock.RLock()
c.resourceUsage.Enqueue(ru)
c.resourceUsageLock.RUnlock()
next.Reset(c.config.StatsCollectionInterval)
case <-c.shutdownCh:
return
}
}
}
Loading