Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add controller performance metrics #391

Merged
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
9283e2f
Added GameServer control performance metrics;
dsmith111 Sep 15, 2022
466f2af
Merge branch 'main' of https://github.com/dsmith111/thundernetes into…
dsmith111 Sep 15, 2022
71fd931
Merge branch 'main' into smithdavi/add-controller-performance-metrics
dsmith111 Sep 15, 2022
f432e82
Update monitoring documentation;
dsmith111 Sep 15, 2022
baa71f4
Merge branch 'smithdavi/add-controller-performance-metrics' of https:…
dsmith111 Sep 15, 2022
41e68da
Update yaml
dsmith111 Sep 17, 2022
e9cd4fb
Fix capitalization
dsmith111 Sep 17, 2022
293efc1
Revert extra changes triggering installfile alret
dsmith111 Sep 17, 2022
dc28b13
Handle dereferencing;
dsmith111 Sep 18, 2022
c2a122e
Add pointers
dsmith111 Sep 19, 2022
38c7592
Decreasing time diff
dsmith111 Sep 19, 2022
dbd8b71
Merge branch 'main' into smithdavi/add-controller-performance-metrics
dsmith111 Sep 19, 2022
ecebe2b
PR Updates;
dsmith111 Sep 19, 2022
d8f0d88
Merge branch 'smithdavi/add-controller-performance-metrics' of github…
dsmith111 Sep 19, 2022
e41d915
Add patching exception
dsmith111 Sep 21, 2022
b7ed55e
Merge branch 'main' into smithdavi/add-controller-performance-metrics
dsmith111 Sep 21, 2022
0be080c
Change metric emission to nodeagent
dsmith111 Sep 25, 2022
20be9c0
Update dashboard
dsmith111 Sep 25, 2022
6e30235
Merge branch 'main' of github.com:dsmith111/thundernetes into smithda…
dsmith111 Sep 25, 2022
e0fc978
Revert test
dsmith111 Sep 25, 2022
a67fbb9
Cleanup deletes
dsmith111 Sep 25, 2022
0cd6e86
Minor tweaks
dsmith111 Sep 25, 2022
c4217f8
Conditional
dsmith111 Sep 25, 2022
77b1c88
PR Suggested changes
dsmith111 Sep 26, 2022
feab86c
Remove spacing added to gameserverbuild
dsmith111 Sep 26, 2022
d0dbb61
Remove empty line in nodeagent
dsmith111 Sep 26, 2022
b3e07d9
Renaming
dsmith111 Sep 26, 2022
fd2fbae
Update dashboard
dsmith111 Sep 26, 2022
9318c35
Remove metric
dsmith111 Sep 26, 2022
1861a24
Merge branch 'main' into smithdavi/add-controller-performance-metrics
dgkanatsios Sep 26, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,7 @@
installfilesdev


.uptodate
.uptodate

# vscode settings
.vscode
4 changes: 4 additions & 0 deletions docs/howtos/monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,9 +86,13 @@ There is a custom Grafana dashboard example that visualizes some of this data in
| connected_players | Gauge | nodeagent |
| gameservers_current_state_per_build | Gauge | controller-manager |
| gameservers_created_total | Counter | controller-manager |
| gameservers_create_duration | Gauge | controller-manager |
dsmith111 marked this conversation as resolved.
Show resolved Hide resolved
| gameservers_reconcile_standby_duration | Gauge | controller-manager |
| gameservers_sessionended_total | Counter | controller-manager |
| gameservers_crashed_total | Counter | controller-manager |
| gameservers_deleted_total | Counter | controller-manager |
| gameservers_end_duration | Gauge | controller-manager |
| gameservers_clean_up_duration | Gauge | controller-manager |
| allocations_total | Counter | controller-manager |

## More pictures
Expand Down
2 changes: 2 additions & 0 deletions pkg/operator/api/v1alpha1/gameserver_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,8 @@ type GameServerStatus struct {
Health GameServerHealth `json:"health,omitempty"`
// State defines the state of the game server (Initializing, StandingBy, Active etc.)
State GameServerState `json:"state,omitempty"`
// The Previously known manually set state
dsmith111 marked this conversation as resolved.
Show resolved Hide resolved
PrevState GameServerState `json:"prevState,omitempty"`
dsmith111 marked this conversation as resolved.
Show resolved Hide resolved
// PublicIP is the PublicIP of the game server
PublicIP string `json:"publicIP,omitempty"`
// Ports is a concatenated list of the ports this game server listens to
Expand Down
13 changes: 13 additions & 0 deletions pkg/operator/controllers/controller_utils.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ import (
"bytes"
"context"
"fmt"
"math"
"math/rand"
"strconv"
"strings"
Expand Down Expand Up @@ -71,6 +72,18 @@ func randString(n int) string {
return string(b)
}

// Determine whether to use an existing saved time variables or the current time for state duration
dsmith111 marked this conversation as resolved.
Show resolved Hide resolved
func getStateDuration(endTime *metav1.Time, startTime *metav1.Time) float64 {
var stateDuration float64
// If the end time state is missing, use the current time
if endTime == nil {
dsmith111 marked this conversation as resolved.
Show resolved Hide resolved
stateDuration = math.Abs(float64(time.Since(startTime.Time).Milliseconds()))
} else {
stateDuration = math.Abs(float64(endTime.Time.Sub(startTime.Time).Milliseconds()))
}
return stateDuration
}

// GetNodeDetails returns the Public IP of the node and the node age in days
// if the Node does not have a Public IP, method returns the internal one
func GetNodeDetails(ctx context.Context, r client.Reader, nodeName string) (string, string, int, error) {
Expand Down
12 changes: 12 additions & 0 deletions pkg/operator/controllers/controller_utils_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ package controllers

import (
"fmt"
"time"

mpsv1alpha1 "github.com/playfab/thundernetes/pkg/operator/api/v1alpha1"
corev1 "k8s.io/api/core/v1"
Expand Down Expand Up @@ -190,5 +191,16 @@ var _ = Describe("Utilities tests", func() {
node.Labels[LabelGameServerNode] = "nottrue"
Expect(isNodeGameServerNode(node)).To(BeFalse())
})
It("should return a positive time duration", func() {
var startTime metav1.Time
startTime.Time = time.Now()

var endTime metav1.Time
endTime.Time = time.Now().Add(5 * time.Second)

Expect(getStateDuration(&endTime, &startTime)).To(BeAssignableToTypeOf(float64(0)))
Expect(getStateDuration(nil, &startTime)).To(BeAssignableToTypeOf(float64(0)))
Expect(getStateDuration(&endTime, &startTime)).To(BeNumerically(">=", float64(0)))
})
})
})
68 changes: 56 additions & 12 deletions pkg/operator/controllers/gameserverbuild_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ import (
"runtime"
"sort"
"sync"
"time"

mpsv1alpha1 "github.com/playfab/thundernetes/pkg/operator/api/v1alpha1"
corev1 "k8s.io/api/core/v1"
Expand Down Expand Up @@ -140,6 +141,14 @@ func (r *GameServerBuildReconciler) Reconcile(ctx context.Context, req ctrl.Requ

// calculate counts by state so we can update .status accordingly
var activeCount, standingByCount, crashesCount, initializingCount, pendingCount int
dsmith111 marked this conversation as resolved.
Show resolved Hide resolved
// Gather sum of time taken to reach standingby phase and server count to produce the recent average gameserver initialization time
var timeToStandBySum float64
var recentStandingByCount int

// Gather current sum of estimated time taken to clean up crashed or pending deletion gameservers
var timeToDeleteBySum float64
dsmith111 marked this conversation as resolved.
Show resolved Hide resolved
var pendingCleanUpCount int

for i := 0; i < len(gameServers.Items); i++ {
gs := gameServers.Items[i]

Expand All @@ -149,16 +158,24 @@ func (r *GameServerBuildReconciler) Reconcile(ctx context.Context, req ctrl.Requ
initializingCount++
} else if gs.Status.State == mpsv1alpha1.GameServerStateStandingBy && gs.Status.Health == mpsv1alpha1.GameServerHealthy {
standingByCount++
if gs.Status.State != gs.Status.PrevState {
timeToStandBySum += getStateDuration(gs.Status.ReachedStandingByOn, &gs.CreationTimestamp)
recentStandingByCount++
}
} else if gs.Status.State == mpsv1alpha1.GameServerStateActive && gs.Status.Health == mpsv1alpha1.GameServerHealthy {
activeCount++
} else if gs.Status.State == mpsv1alpha1.GameServerStateGameCompleted && gs.Status.Health == mpsv1alpha1.GameServerHealthy {
// game server process exited with code 0
if err := r.Delete(ctx, &gs); err != nil {
return ctrl.Result{}, err
}
dsmith111 marked this conversation as resolved.
Show resolved Hide resolved

GameServersSessionEndedCounter.WithLabelValues(gsb.Name).Inc()
r.expectations.addGameServerToUnderDeletionMap(gsb.Name, gs.Name)
r.Recorder.Eventf(&gsb, corev1.EventTypeNormal, "Exited", "GameServer %s session completed", gs.Name)

pendingCleanUpCount++
timeToDeleteBySum += getStateDuration(gs.DeletionTimestamp, &gs.CreationTimestamp)
} else if gs.Status.State == mpsv1alpha1.GameServerStateCrashed {
// game server process exited with code != 0 (crashed)
crashesCount++
Expand All @@ -168,6 +185,9 @@ func (r *GameServerBuildReconciler) Reconcile(ctx context.Context, req ctrl.Requ
GameServersCrashedCounter.WithLabelValues(gsb.Name).Inc()
r.expectations.addGameServerToUnderDeletionMap(gsb.Name, gs.Name)
r.Recorder.Eventf(&gsb, corev1.EventTypeNormal, "Unhealthy", "GameServer %s was deleted because it became unhealthy, state: %s, health: %s", gs.Name, gs.Status.State, gs.Status.Health)

pendingCleanUpCount++
timeToDeleteBySum += getStateDuration(gs.DeletionTimestamp, &gs.CreationTimestamp)
} else if gs.Status.Health == mpsv1alpha1.GameServerUnhealthy {
// all cases where the game server was marked as Unhealthy
crashesCount++
Expand All @@ -177,25 +197,39 @@ func (r *GameServerBuildReconciler) Reconcile(ctx context.Context, req ctrl.Requ
GameServersUnhealthyCounter.WithLabelValues(gsb.Name).Inc()
r.expectations.addGameServerToUnderDeletionMap(gsb.Name, gs.Name)
r.Recorder.Eventf(&gsb, corev1.EventTypeNormal, "Crashed", "GameServer %s was deleted because it crashed, state: %s, health: %s", gs.Name, gs.Status.State, gs.Status.Health)

pendingCleanUpCount++
timeToDeleteBySum += getStateDuration(gs.DeletionTimestamp, &gs.CreationTimestamp)
}
if gs.Status.State != gs.Status.PrevState {
gs.Status.PrevState = gs.Status.State
}
}

if recentStandingByCount > 0 {
GameServersCreatedDuration.WithLabelValues(gsb.Name).Set(timeToStandBySum / float64(recentStandingByCount))
}

if pendingCleanUpCount > 0 {
GameServersCleanUpDuration.WithLabelValues(gsb.Name).Set(timeToDeleteBySum / float64(pendingCleanUpCount))
}

// calculate the total amount of servers not in the active state
nonActiveGameServersCount := standingByCount + initializingCount + pendingCount

// Evaluate desired number of servers against actual
var totalNumberOfGameServersToDelete int = 0

// user has decreased standingBy numbers
if nonActiveGameServersCount > gsb.Spec.StandingBy {
totalNumberOfGameServersToDelete := int(math.Min(float64(nonActiveGameServersCount-gsb.Spec.StandingBy), maxNumberOfGameServersToDelete))
err := r.deleteNonActiveGameServers(ctx, &gsb, &gameServers, totalNumberOfGameServersToDelete)
if err != nil {
return ctrl.Result{}, err
}
totalNumberOfGameServersToDelete += int(math.Min(float64(nonActiveGameServersCount-gsb.Spec.StandingBy), maxNumberOfGameServersToDelete))
}

// we need to check if we are above the max
// we also need to check if we are above the max
// this can happen if the user modifies the spec.Max during the GameServerBuild's lifetime
if nonActiveGameServersCount+activeCount > gsb.Spec.Max {
totalNumberOfGameServersToDelete := int(math.Min(float64(nonActiveGameServersCount+activeCount-gsb.Spec.Max), maxNumberOfGameServersToDelete))
totalNumberOfGameServersToDelete = int(math.Min(float64(totalNumberOfGameServersToDelete+(nonActiveGameServersCount+activeCount-gsb.Spec.Max)), maxNumberOfGameServersToDelete))
}
if totalNumberOfGameServersToDelete > 0 {
err := r.deleteNonActiveGameServers(ctx, &gsb, &gameServers, totalNumberOfGameServersToDelete)
if err != nil {
return ctrl.Result{}, err
Expand All @@ -207,13 +241,16 @@ func (r *GameServerBuildReconciler) Reconcile(ctx context.Context, req ctrl.Requ
// we attempt to create the missing number of game servers, but we don't want to create more than the max
// an error channel for the go routines to write errors
errCh := make(chan error, maxNumberOfGameServersToAdd)

// Time how long it takes to trigger new standby gameservers
standByReconcileStartTime := time.Now()
dsmith111 marked this conversation as resolved.
Show resolved Hide resolved
// a waitgroup for async create calls
var wg sync.WaitGroup
for i := 0; i < gsb.Spec.StandingBy-nonActiveGameServersCount &&
i+nonActiveGameServersCount+activeCount < gsb.Spec.Max &&
i < maxNumberOfGameServersToAdd; i++ {
wg.Add(1)
go func() {
go func(standByStartTime time.Time) {
defer wg.Done()
newgs, err := NewGameServerForGameServerBuild(&gsb, r.PortRegistry)
if err != nil {
Expand All @@ -224,12 +261,15 @@ func (r *GameServerBuildReconciler) Reconcile(ctx context.Context, req ctrl.Requ
errCh <- err
return
}
newgs.Status.PrevState = mpsv1alpha1.GameServerStateInitializing
dsmith111 marked this conversation as resolved.
Show resolved Hide resolved
r.expectations.addGameServerToUnderCreationMap(gsb.Name, newgs.Name)
GameServersCreatedCounter.WithLabelValues(gsb.Name).Inc()
r.Recorder.Eventf(&gsb, corev1.EventTypeNormal, "Creating", "Creating GameServer %s", newgs.Name)
}()
GameServersStandByReconcileDuration.WithLabelValues(gsb.Name).Set(float64(time.Since(standByStartTime).Milliseconds()))
}(standByReconcileStartTime)
}
wg.Wait()

if len(errCh) > 0 {
return ctrl.Result{}, <-errCh
}
Expand Down Expand Up @@ -325,6 +365,8 @@ func (r *GameServerBuildReconciler) deleteNonActiveGameServers(ctx context.Conte
// a waitgroup for async deletion calls
var wg sync.WaitGroup
deletionCalls := 0
deletionStartTime := time.Now()

// we sort the GameServers by state so that we can delete the ones that are empty state or Initializing before we delete the StandingBy ones (if needed)
// this is to make sure we don't fall below the desired number of StandingBy during scaling down
sort.Sort(ByState(gameServers.Items))
Expand All @@ -334,7 +376,7 @@ func (r *GameServerBuildReconciler) deleteNonActiveGameServers(ctx context.Conte
if gs.Status.State == "" || gs.Status.State == mpsv1alpha1.GameServerStateInitializing || gs.Status.State == mpsv1alpha1.GameServerStateStandingBy {
deletionCalls++
wg.Add(1)
go func() {
go func(deletionStartTime time.Time) {
defer wg.Done()
if err := r.deleteGameServer(ctx, &gs); err != nil {
if apierrors.IsConflict(err) { // this GameServer has been updated, skip it
Expand All @@ -346,7 +388,9 @@ func (r *GameServerBuildReconciler) deleteNonActiveGameServers(ctx context.Conte
GameServersDeletedCounter.WithLabelValues(gsb.Name).Inc()
r.expectations.addGameServerToUnderDeletionMap(gsb.Name, gs.Name)
r.Recorder.Eventf(gsb, corev1.EventTypeNormal, "GameServer deleted", "GameServer %s deleted", gs.Name)
}()
duration := time.Since(deletionStartTime).Milliseconds()
GameServersEndedDuration.WithLabelValues(gsb.Name).Set(float64(duration))
}(deletionStartTime)
}
}
wg.Wait()
Expand Down
32 changes: 32 additions & 0 deletions pkg/operator/controllers/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,22 @@ var (
},
[]string{"BuildName"},
)
GameServersCreatedDuration = registry.NewGaugeVec(
prometheus.GaugeOpts{
Namespace: "thundernetes",
Name: "gameservers_create_duration",
Help: "Average time it took to create a the newest set of GameServers",
},
[]string{"BuildName"},
)
GameServersStandByReconcileDuration = registry.NewGaugeVec(
dsmith111 marked this conversation as resolved.
Show resolved Hide resolved
prometheus.GaugeOpts{
Namespace: "thundernetes",
Name: "gameservers_reconcile_standby_duration",
Help: "Time it took to begin initialization for all new GameServers",
},
[]string{"BuildName"},
)
GameServersSessionEndedCounter = registry.NewCounterVec(
prometheus.CounterOpts{
Namespace: "thundernetes",
Expand All @@ -32,6 +48,22 @@ var (
},
[]string{"BuildName"},
)
GameServersEndedDuration = registry.NewGaugeVec(
prometheus.GaugeOpts{
Namespace: "thundernetes",
Name: "gameservers_end_duration",
Help: "Time it took to delete a set of non-active GameServers",
},
[]string{"BuildName"},
)
GameServersCleanUpDuration = registry.NewGaugeVec(
prometheus.GaugeOpts{
Namespace: "thundernetes",
Name: "gameservers_clean_up_duration",
Help: "Average time it took to clean up all completed or unhealthy GameServers",
},
[]string{"BuildName"},
)
GameServersCrashedCounter = registry.NewCounterVec(
prometheus.CounterOpts{
Namespace: "thundernetes",
Expand Down
Loading