Skip to content

Commit

Permalink
manual updates
Browse files Browse the repository at this point in the history
  • Loading branch information
jmdeal committed Nov 30, 2024
1 parent b44bb90 commit e630feb
Show file tree
Hide file tree
Showing 19 changed files with 798 additions and 25 deletions.
1 change: 1 addition & 0 deletions charts/karpenter/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ cosign verify public.ecr.aws/karpenter/karpenter:1.1.0 \
| settings.clusterCABundle | string | `""` | Cluster CA bundle for TLS configuration of provisioned nodes. If not set, this is taken from the controller's TLS configuration for the API server. |
| settings.clusterEndpoint | string | `""` | Cluster endpoint. If not set, will be discovered during startup (EKS only) |
| settings.clusterName | string | `""` | Cluster name. |
| settings.eksControlPlane | bool | `false` | Marking this true means that your cluster is running with an EKS control plane and Karpenter should attempt to discover cluster details from the DescribeCluster API |
| settings.featureGates | object | `{"nodeRepair":false,"spotToSpotConsolidation":false}` | Feature Gate configuration values. Feature Gates will follow the same graduation process and requirements as feature gates in Kubernetes. More information here https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/#feature-gates-for-alpha-or-beta-features |
| settings.featureGates.nodeRepair | bool | `false` | nodeRepair is ALPHA and is disabled by default. Setting this to true will enable node repair. |
| settings.featureGates.spotToSpotConsolidation | bool | `false` | spotToSpotConsolidation is ALPHA and is disabled by default. Setting this to true will enable spot replacement consolidation for both single and multi-node consolidation. |
Expand Down
4 changes: 2 additions & 2 deletions website/content/en/docs/concepts/nodeclasses.md
Original file line number Diff line number Diff line change
Expand Up @@ -207,7 +207,7 @@ status:
status: "True"
type: Ready
```
Refer to the [NodePool docs]({{<ref "./nodepools" >}}) for settings applicable to all providers. To explore various `EC2NodeClass` configurations, refer to the examples provided [in the Karpenter Github repository](https://github.com/aws/karpenter/blob/main/examples/v1/).
Refer to the [NodePool docs]({{<ref "./nodepools" >}}) for settings applicable to all providers. To explore various `EC2NodeClass` configurations, refer to the examples provided [in the Karpenter Github repository](https://github.com/aws/karpenter/blob/v1.1.0/examples/v1/).


## spec.kubelet
Expand Down Expand Up @@ -1041,7 +1041,7 @@ spec:
chown -R ec2-user ~ec2-user/.ssh
```

For more examples on configuring fields for different AMI families, see the [examples here](https://github.com/aws/karpenter/blob/main/examples/v1).
For more examples on configuring fields for different AMI families, see the [examples here](https://github.com/aws/karpenter/blob/v1.1.0/examples/v1).

Karpenter will merge the userData you specify with the default userData for that AMIFamily. See the [AMIFamily]({{< ref "#specamifamily" >}}) section for more details on these defaults. View the sections below to understand the different merge strategies for each AMIFamily.

Expand Down
16 changes: 8 additions & 8 deletions website/content/en/docs/concepts/nodepools.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Here are things you should know about NodePools:
Objects for setting Kubelet features have been moved from the NodePool spec to the EC2NodeClasses spec, to not require other Karpenter providers to support those features.
{{% /alert %}}

For some example `NodePool` configurations, see the [examples in the Karpenter GitHub repository](https://github.com/aws/karpenter/blob/main/examples/v1/).
For some example `NodePool` configurations, see the [examples in the Karpenter GitHub repository](https://github.com/aws/karpenter/blob/v1.1.0/examples/v1/).

```yaml
apiVersion: karpenter.sh/v1
Expand Down Expand Up @@ -72,16 +72,16 @@ spec:
# Avoiding long-running Nodes helps to reduce security vulnerabilities as well as to reduce the chance of issues that can plague Nodes with long uptimes such as file fragmentation or memory leaks from system processes
# You can choose to disable expiration entirely by setting the string value 'Never' here

# Note: changing this value in the nodepool will drift the nodeclaims.
# Note: changing this value in the nodepool will drift the nodeclaims.
expireAfter: 720h | Never

# The amount of time that a node can be draining before it's forcibly deleted. A node begins draining when a delete call is made against it, starting
# its finalization flow. Pods with TerminationGracePeriodSeconds will be deleted preemptively before this terminationGracePeriod ends to give as much time to cleanup as possible.
# its finalization flow. Pods with TerminationGracePeriodSeconds will be deleted preemptively before this terminationGracePeriod ends to give as much time to cleanup as possible.
# If your pod's terminationGracePeriodSeconds is larger than this terminationGracePeriod, Karpenter may forcibly delete the pod
# before it has its full terminationGracePeriod to cleanup.
# before it has its full terminationGracePeriod to cleanup.

# Note: changing this value in the nodepool will drift the nodeclaims.
terminationGracePeriod: 48h
# Note: changing this value in the nodepool will drift the nodeclaims.
terminationGracePeriod: 48h

# Requirements that constrain the parameters of provisioned nodes.
# These requirements are combined with pod.spec.topologySpreadConstraints, pod.spec.affinity.nodeAffinity, pod.spec.affinity.podAffinity, and pod.spec.nodeSelector rules.
Expand Down Expand Up @@ -183,12 +183,12 @@ See [Taints and Tolerations](https://kubernetes.io/docs/concepts/scheduling-evic
## spec.template.spec.startupTaints
Taints that are added to nodes to indicate that a certain condition must be met, such as starting an agent or setting up networking, before the node is can be initialized.
Taints that are added to nodes to indicate that a certain condition must be met, such as starting an agent or setting up networking, before the node is can be initialized.
These taints must be cleared before pods can be deployed to a node.
## spec.template.spec.expireAfter
The amount of time a Node can live on the cluster before being deleted by Karpenter. Nodes will begin draining once it's expiration has been hit.
The amount of time a Node can live on the cluster before being deleted by Karpenter. Nodes will begin draining once it's expiration has been hit.
## spec.template.spec.terminationGracePeriod
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
curl -fsSL https://raw.githubusercontent.com/aws/karpenter-provider-aws/main/website/content/en/preview/getting-started/getting-started-with-karpenter/cloudformation.yaml > $TEMPOUT \
curl -fsSL https://raw.githubusercontent.com/aws/karpenter-provider-aws/v"${KARPENTER_VERSION}"/website/content/en/preview/getting-started/getting-started-with-karpenter/cloudformation.yaml > $TEMPOUT \
&& aws cloudformation deploy \
--stack-name "Karpenter-${CLUSTER_NAME}" \
--template-file "${TEMPOUT}" \
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
curl -fsSL https://raw.githubusercontent.com/aws/karpenter-provider-aws/main/website/content/en/preview/getting-started/getting-started-with-karpenter/cloudformation.yaml > "${TEMPOUT}" \
curl -fsSL https://raw.githubusercontent.com/aws/karpenter-provider-aws/v"${KARPENTER_VERSION}"/website/content/en/preview/getting-started/getting-started-with-karpenter/cloudformation.yaml > "${TEMPOUT}" \
&& aws cloudformation deploy \
--stack-name "Karpenter-${CLUSTER_NAME}" \
--template-file "${TEMPOUT}" \
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
TEMPOUT="$(mktemp)"

curl -fsSL https://raw.githubusercontent.com/aws/karpenter-provider-aws/main/website/content/en/preview/getting-started/getting-started-with-karpenter/cloudformation.yaml > "${TEMPOUT}" \
curl -fsSL https://raw.githubusercontent.com/aws/karpenter-provider-aws/v1.1.0/website/content/en/preview/getting-started/getting-started-with-karpenter/cloudformation.yaml > "${TEMPOUT}" \
&& aws cloudformation deploy \
--stack-name "Karpenter-${CLUSTER_NAME}" \
--template-file "${TEMPOUT}" \
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
kubectl create namespace "${KARPENTER_NAMESPACE}" || true
kubectl create -f \
"https://raw.githubusercontent.com/aws/karpenter-provider-aws/main/pkg/apis/crds/karpenter.sh_nodepools.yaml"
"https://raw.githubusercontent.com/aws/karpenter-provider-aws/v${KARPENTER_VERSION}/pkg/apis/crds/karpenter.sh_nodepools.yaml"}
kubectl create -f \
"https://raw.githubusercontent.com/aws/karpenter-provider-aws/main/pkg/apis/crds/karpenter.k8s.aws_ec2nodeclasses.yaml"
"https://raw.githubusercontent.com/aws/karpenter-provider-aws/v${KARPENTER_VERSION}/pkg/apis/crds/karpenter.k8s.aws_ec2nodeclasses.yaml"
kubectl create -f \
"https://raw.githubusercontent.com/aws/karpenter-provider-aws/main/pkg/apis/crds/karpenter.sh_nodeclaims.yaml"
"https://raw.githubusercontent.com/aws/karpenter-provider-aws/v${KARPENTER_VERSION}/pkg/apis/crds/karpenter.sh_nodeclaims.yaml"}
kubectl apply -f karpenter.yaml
248 changes: 248 additions & 0 deletions website/content/en/docs/reference/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,226 @@ description: >
---
<!-- this document is generated from hack/docs/metrics_gen_docs.go -->
Karpenter makes several metrics available in Prometheus format to allow monitoring cluster provisioning status. These metrics are available by default at `karpenter.karpenter.svc.cluster.local:8080/metrics` configurable via the `METRICS_PORT` environment variable documented [here](../settings)
### `karpenter_ignored_pod_count`
Number of pods ignored during scheduling by Karpenter
- Stability Level: ALPHA

### `karpenter_build_info`
A metric with a constant '1' value labeled by version from which karpenter was built.
- Stability Level: STABLE

## Nodeclaims Metrics

### `karpenter_nodeclaims_termination_duration_seconds`
Duration of NodeClaim termination in seconds.
- Stability Level: BETA

### `karpenter_nodeclaims_terminated_total`
Number of nodeclaims terminated in total by Karpenter. Labeled by the owning nodepool.
- Stability Level: STABLE

### `karpenter_nodeclaims_instance_termination_duration_seconds`
Duration of CloudProvider Instance termination in seconds.
- Stability Level: BETA

### `karpenter_nodeclaims_disrupted_total`
Number of nodeclaims disrupted in total by Karpenter. Labeled by reason the nodeclaim was disrupted and the owning nodepool.
- Stability Level: ALPHA

### `karpenter_nodeclaims_created_total`
Number of nodeclaims created in total by Karpenter. Labeled by reason the nodeclaim was created and the owning nodepool.
- Stability Level: STABLE

## Nodes Metrics

### `karpenter_nodes_total_pod_requests`
Node total pod requests are the resources requested by pods bound to nodes, including the DaemonSet pods.
- Stability Level: BETA

### `karpenter_nodes_total_pod_limits`
Node total pod limits are the resources specified by pod limits, including the DaemonSet pods.
- Stability Level: BETA

### `karpenter_nodes_total_daemon_requests`
Node total daemon requests are the resource requested by DaemonSet pods bound to nodes.
- Stability Level: BETA

### `karpenter_nodes_total_daemon_limits`
Node total daemon limits are the resources specified by DaemonSet pod limits.
- Stability Level: BETA

### `karpenter_nodes_termination_duration_seconds`
The time taken between a node's deletion request and the removal of its finalizer
- Stability Level: BETA

### `karpenter_nodes_terminated_total`
Number of nodes terminated in total by Karpenter. Labeled by owning nodepool.
- Stability Level: STABLE

### `karpenter_nodes_system_overhead`
Node system daemon overhead are the resources reserved for system overhead, the difference between the node's capacity and allocatable values are reported by the status.
- Stability Level: BETA

### `karpenter_nodes_lifetime_duration_seconds`
The lifetime duration of the nodes since creation.
- Stability Level: ALPHA

### `karpenter_nodes_eviction_requests_total`
The total number of eviction requests made by Karpenter
- Stability Level: ALPHA

### `karpenter_nodes_drained_total`
The total number of nodes drained by Karpenter
- Stability Level: ALPHA

### `karpenter_nodes_current_lifetime_seconds`
Node age in seconds
- Stability Level: ALPHA

### `karpenter_nodes_created_total`
Number of nodes created in total by Karpenter. Labeled by owning nodepool.
- Stability Level: STABLE

### `karpenter_nodes_allocatable`
Node allocatable are the resources allocatable by nodes.
- Stability Level: BETA

## Pods Metrics

### `karpenter_pods_state`
Pod state is the current state of pods. This metric can be used several ways as it is labeled by the pod name, namespace, owner, node, nodepool name, zone, architecture, capacity type, instance type and pod phase.
- Stability Level: BETA

### `karpenter_pods_startup_duration_seconds`
The time from pod creation until the pod is running.
- Stability Level: STABLE

## Termination Metrics

### `operator_termination_duration_seconds`
The amount of time taken by an object to terminate completely.
- Stability Level: ALPHA

### `operator_termination_current_time_seconds`
The current amount of time in seconds that an object has been in terminating state.
- Stability Level: ALPHA

## Voluntary Disruption Metrics

### `karpenter_voluntary_disruption_queue_failures_total`
The number of times that an enqueued disruption decision failed. Labeled by disruption method.
- Stability Level: BETA

### `karpenter_voluntary_disruption_eligible_nodes`
Number of nodes eligible for disruption by Karpenter. Labeled by disruption reason.
- Stability Level: BETA

### `karpenter_voluntary_disruption_decisions_total`
Number of disruption decisions performed. Labeled by disruption decision, reason, and consolidation type.
- Stability Level: STABLE

### `karpenter_voluntary_disruption_decision_evaluation_duration_seconds`
Duration of the disruption decision evaluation process in seconds. Labeled by method and consolidation type.
- Stability Level: BETA

### `karpenter_voluntary_disruption_consolidation_timeouts_total`
Number of times the Consolidation algorithm has reached a timeout. Labeled by consolidation type.
- Stability Level: BETA

## Scheduler Metrics

### `karpenter_scheduler_scheduling_duration_seconds`
Duration of scheduling simulations used for deprovisioning and provisioning in seconds.
- Stability Level: STABLE

### `karpenter_scheduler_queue_depth`
The number of pods currently waiting to be scheduled.
- Stability Level: BETA

## Nodepools Metrics

### `karpenter_nodepools_usage`
The amount of resources that have been provisioned for a nodepool. Labeled by nodepool name and resource type.
- Stability Level: ALPHA

### `karpenter_nodepools_limit`
Limits specified on the nodepool that restrict the quantity of resources provisioned. Labeled by nodepool name and resource type.
- Stability Level: ALPHA

### `karpenter_nodepools_allowed_disruptions`
The number of nodes for a given NodePool that can be concurrently disrupting at a point in time. Labeled by NodePool. Note that allowed disruptions can change very rapidly, as new nodes may be created and others may be deleted at any point.
- Stability Level: ALPHA

## Interruption Metrics

### `karpenter_interruption_received_messages_total`
Count of messages received from the SQS queue. Broken down by message type and whether the message was actionable.
- Stability Level: STABLE

### `karpenter_interruption_message_queue_duration_seconds`
Amount of time an interruption message is on the queue before it is processed by karpenter.
- Stability Level: STABLE

### `karpenter_interruption_deleted_messages_total`
Count of messages deleted from the SQS queue.
- Stability Level: STABLE

## Cluster Metrics

### `karpenter_cluster_utilization_percent`
Utilization of allocatable resources by pod requests
- Stability Level: ALPHA

## Cluster State Metrics

### `karpenter_cluster_state_unsynced_time_seconds`
The time for which cluster state is not synced
- Stability Level: STABLE

### `karpenter_cluster_state_synced`
Returns 1 if cluster state is synced and 0 otherwise. Synced checks that nodeclaims and nodes that are stored in the APIServer have the same representation as Karpenter's cluster state
- Stability Level: STABLE

### `karpenter_cluster_state_node_count`
Current count of nodes in cluster state
- Stability Level: STABLE

## Cloudprovider Metrics

### `karpenter_cloudprovider_instance_type_offering_price_estimate`
Instance type offering estimated hourly price used when making informed decisions on node cost calculation, based on instance type, capacity type, and zone.
- Stability Level: BETA

### `karpenter_cloudprovider_instance_type_offering_available`
Instance type offering availability, based on instance type, capacity type, and zone
- Stability Level: BETA

### `karpenter_cloudprovider_instance_type_memory_bytes`
Memory, in bytes, for a given instance type.
- Stability Level: BETA

### `karpenter_cloudprovider_instance_type_cpu_cores`
VCPUs cores for a given instance type.
- Stability Level: BETA

### `karpenter_cloudprovider_errors_total`
Total number of errors returned from CloudProvider calls.
- Stability Level: BETA

### `karpenter_cloudprovider_duration_seconds`
Duration of cloud provider method calls. Labeled by the controller, method name and provider.
- Stability Level: BETA

## Cloudprovider Batcher Metrics

### `karpenter_cloudprovider_batcher_batch_time_seconds`
Duration of the batching window per batcher
- Stability Level: BETA

### `karpenter_cloudprovider_batcher_batch_size`
Size of the request batch per batcher
- Stability Level: BETA

## Controller Runtime Metrics

### `controller_runtime_terminal_reconcile_errors_total`
Expand Down Expand Up @@ -68,6 +288,34 @@ Current depth of workqueue
Total number of adds handled by workqueue
- Stability Level: STABLE

## Status Condition Metrics

### `operator_status_condition_transitions_total`
The count of transitions of a given object, type and status.
- Stability Level: BETA

### `operator_status_condition_transition_seconds`
The amount of time a condition was in a given state before transitioning. e.g. Alarm := P99(Updated=False) > 5 minutes
- Stability Level: BETA

### `operator_status_condition_current_status_seconds`
The current amount of time in seconds that a status condition has been in a specific state. Alarm := P99(Updated=Unknown) > 5 minutes
- Stability Level: BETA

### `operator_status_condition_count`
The number of an condition for a given object, type and status. e.g. Alarm := Available=False > 0
- Stability Level: BETA

## Client Go Metrics

### `client_go_request_total`
Number of HTTP requests, partitioned by status code and method.
- Stability Level: STABLE

### `client_go_request_duration_seconds`
Request latency in seconds. Broken down by verb, group, version, kind, and subresource.
- Stability Level: STABLE

## AWS SDK Go Metrics

### `aws_sdk_go_request_total`
Expand Down
7 changes: 7 additions & 0 deletions website/content/en/docs/upgrading/upgrade-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,13 @@ WHEN CREATING A NEW SECTION OF THE UPGRADE GUIDANCE FOR NEWER VERSIONS, ENSURE T

### Upgrading to `1.1.0`+

{{% alert title="Warning" color="warning" %}}
Karpenter `1.1.0` drops the support for `v1beta1` APIs.
**Do not** upgrade to `1.1.0`+ without following the [Migration Guide]({{<ref "../../v1.0/upgrading/v1-migration.md#before-upgrading-to-v110">}}).
{{% /alert %}}

* Support for the `v1beta1` compatiblity annotations have been dropped. Ensure you have completed migration before upgrading to `v1.1.0`. Refer to the [migration guide]({{<ref "../../v1.0/upgrading/v1-migration.md#kubelet-configuration-migration">}}) for more details.
* `nodeClassRef.group` and `nodeClassRef.kind` are strictly required. Ensure these values are set for all `NodePools` / `NodeClaims` before upgrading.
* Bottlerocket AMIFamily now supports `instanceStorePolicy: RAID0`. This means that Karpenter will auto-generate userData to RAID0 your instance store volumes (similar to AL2 and AL2023) when specifying this value.
* Note: This userData configuration is _only_ valid on Bottlerocket v1.22.0+. If you are using an earlier version of a Bottlerocket image (< v1.22.0) with `amiFamily: Bottlerocket` and `instanceStorePolicy: RAID0`, nodes will fail to join the cluster.
* The AWS Neuron accelerator well known name label (`karpenter.k8s.aws/instance-accelerator-name`) values now reflect their correct names of `trainium`, `inferentia`, and `inferentia2`. Previously, all Neuron accelerators were assigned the label name of `inferentia`.
Expand Down
Loading

0 comments on commit e630feb

Please sign in to comment.