-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Observability stack #11
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
|
||
# Observability Stack | ||
|
||
This directory contains an observability implementation based on Grafana tooling | ||
|
||
## Caveats | ||
1) reliance on ref-implementation for SSO | ||
- This is possible to work around by removing the `auth.generic_oauth` section from `prometheus.yaml` and removing the `grafana-config.yaml` and `grafana-external-secret.yaml` files | ||
2) using `tls_skip_verify_insecure` for oauth | ||
- This is due to using the ingress certificate. Once this is addressed, we can remove this | ||
3) Bigger memory requirement required for kind cluster | ||
- Due to using a more robust loki deployment, the memory limits have been increased. 16 GB seems to work while leaving ample room in the cluster. | ||
|
||
## Components | ||
The observability stack is built upon: | ||
- Prometheus - metrics | ||
- Loki - logging | ||
- Promtail - log delivery | ||
- Opencost - cost accounting | ||
- Grafana - visualization | ||
- Alertmanager - alerting | ||
|
||
## Installation | ||
Note: The stack is configured to use Keycloak for SSO; therefore, the ref-implementation is required for this to work. | ||
|
||
`idpbuilder create --use-path-routing --package-dir ./ref-implementation --package-dir ./observability` | ||
|
||
A `grafana-config` job will be deployed into the keycloak namespace to create/patch some of the keycloak components. If deployed at the same time as the `ref-implementation`, this job will fail until the `config` job succeeds. This is normal |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
apiVersion: argoproj.io/v1alpha1 | ||
kind: Application | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if we could move these to applicationsets to make it easier for folks to move to adopt in production easier now that idpbuilder supports appsets? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is definitely something Manabu and I spoke about. I was waiting for the example to see how to conform this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know if it was already discussed, But I would question Loki too? Loki is GPL I would preferred OpenSearch is Apache 2 That's a typical stack that I see in fully open source for logs+traces fluentbit+opensearch For metrics I see opentelemetry-collector-daemonset+prometheus There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @csantanapr thanks for raising this. I am not a lawyer, so please correct me if I'm wrong: my understanding with AGPL is that we can't modify the source of the application without copy left. I believe many have built observability stacks on top of the Grafana stack, which is AGPL for the core components (Loki, Grafana, Tempo, Mimir). As we are not modifying the source, we should be ok to use it. Again, please please please correct me if I'm wrong. This is a valid concern and I believe the flexibility of working in stacks allows us to create another implementation that relies on other tooling. What I do believe is that we need to come to an agreement on what our opinionated stack is. If this is not it, I'm ok with that, but let's discuss this during the next community meeting so we can figure out how we want to proceed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we discussed this and yes since Grafana is already AGPL, choosing Loki as another AGPL project is less of a concern. That said, I agree with the discussion above that we should also think about using OpenSearch given its popularity. Publishing it as an alternative observability stack sounds good. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will say we do use the Otel Collector as a daemonset for logs and prometheus metrics and eventually for traces. I do understand that the community is much more invested in fluentbit for logging so I think a standard of OpenSearch + FluentBit with OpenTelemetry Collector Daemonset for Prometheus Metrics/Otel Traces seems to be a good pattern for me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If It makes sense, I suggest:
We can have both live under I'm ok working on all of this and would be happy to put the otel stack together as well. Does that make sense to do? |
||
metadata: | ||
name: loki | ||
namespace: argocd | ||
labels: | ||
env: dev | ||
finalizers: | ||
- resources-finalizer.argocd.argoproj.io | ||
spec: | ||
project: default | ||
sources: | ||
- repoURL: 'https://grafana.github.io/helm-charts' | ||
targetRevision: 6.6.3 | ||
helm: | ||
releaseName: loki | ||
values: | | ||
deploymentMode: SingleBinary | ||
loki: | ||
commonConfig: | ||
replication_factor: 1 | ||
storage: | ||
type: 'filesystem' | ||
schemaConfig: | ||
configs: | ||
- from: "2024-01-01" | ||
store: tsdb | ||
index: | ||
prefix: loki_index_ | ||
period: 24h | ||
object_store: filesystem # we're storing on filesystem so there's no real persistence here. | ||
schema: v13 | ||
singleBinary: | ||
replicas: 1 | ||
read: | ||
replicas: 0 | ||
backend: | ||
replicas: 0 | ||
write: | ||
replicas: 0 | ||
chart: loki | ||
destination: | ||
server: "https://kubernetes.default.svc" | ||
namespace: monitoring | ||
syncPolicy: | ||
syncOptions: | ||
- CreateNamespace=true | ||
- ServerSideApply=true | ||
automated: | ||
selfHeal: true |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
apiVersion: argoproj.io/v1alpha1 | ||
kind: Application | ||
metadata: | ||
name: opencost | ||
namespace: argocd | ||
labels: | ||
env: dev | ||
finalizers: | ||
- resources-finalizer.argocd.argoproj.io | ||
spec: | ||
project: default | ||
sources: | ||
- repoURL: 'https://opencost.github.io/opencost-helm-chart' | ||
targetRevision: 1.38.1 | ||
helm: | ||
releaseName: opencost | ||
values: | | ||
opencost: | ||
prometheus: | ||
internal: | ||
serviceName: prometheus-kube-prometheus-prometheus | ||
namespaceName: monitoring | ||
port: 9090 | ||
chart: opencost | ||
destination: | ||
server: "https://kubernetes.default.svc" | ||
namespace: monitoring | ||
syncPolicy: | ||
syncOptions: | ||
- CreateNamespace=true | ||
- ServerSideApply=true | ||
automated: | ||
selfHeal: true |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
apiVersion: argoproj.io/v1alpha1 | ||
kind: Application | ||
metadata: | ||
name: prometheus | ||
namespace: argocd | ||
labels: | ||
env: dev | ||
finalizers: | ||
- resources-finalizer.argocd.argoproj.io | ||
spec: | ||
project: default | ||
sources: | ||
- repoURL: 'https://prometheus-community.github.io/helm-charts' | ||
targetRevision: 57.2.0 | ||
helm: | ||
releaseName: prometheus | ||
values: | | ||
grafana: | ||
envFromSecret: grafana-oidc | ||
additionalDataSources: | ||
- name: loki | ||
access: proxy | ||
orgId: 1 | ||
type: loki | ||
url: http://loki-gateway | ||
jsonData: | ||
httpHeaderName1: X-Scope-OrgID | ||
secureJsonData: | ||
httpHeaderValue1: '1' | ||
grafana.ini: | ||
server: | ||
root_url: https://cnoe.localtest.me:8443/grafana | ||
serve_from_sub_path: true | ||
auth.generic_oauth: | ||
enabled: true | ||
name: grafana | ||
allow_sign_up: true | ||
auth_url: https://cnoe.localtest.me:8443/keycloak/realms/cnoe/protocol/openid-connect/auth | ||
token_url: https://cnoe.localtest.me:8443/keycloak/realms/cnoe/protocol/openid-connect/token | ||
api_url: https://cnoe.localtest.me:8443/keycloak/realms/cnoe/protocol/openid-connect/userinfo | ||
scopes: openid email profile offline_access roles | ||
role_attribute_path: contains(resource_access.grafana.roles[*], 'admin') && 'GrafanaAdmin' || contains(resource_access.grafana.roles[*], 'admin') && 'Admin' || contains(resource_access.grafana.roles[*], 'editor') && 'Editor' || 'Viewer' | ||
allow_assign_grafana_admin: true | ||
role_attribute_strict: true | ||
auto_login: true | ||
tls_skip_verify_insecure: true | ||
chart: kube-prometheus-stack | ||
- repoURL: cnoe://prometheus | ||
targetRevision: HEAD | ||
path: "manifests" | ||
destination: | ||
server: "https://kubernetes.default.svc" | ||
namespace: monitoring | ||
syncPolicy: | ||
syncOptions: | ||
- CreateNamespace=true | ||
- ServerSideApply=true | ||
automated: | ||
selfHeal: true |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,200 @@ | ||
--- | ||
apiVersion: v1 | ||
kind: ConfigMap | ||
metadata: | ||
name: grafana-config-job | ||
namespace: keycloak | ||
data: | ||
client-role-admin-payload.json: | | ||
{"name": "admin"} | ||
client-role-editor-payload.json: | | ||
{"name": "editor"} | ||
client-role-viewer-payload.json: | | ||
{"name": "viewer"} | ||
admin-role-assignment-payload.json: | | ||
[ | ||
{ | ||
"id": "$ADMIN_ROLE_ID", | ||
"name": "admin" | ||
} | ||
] | ||
roles-mapper-payload.json: | | ||
{ | ||
"id":"$CLIENT_ROLES_MAPPER_ID", | ||
"name": "client roles", | ||
"protocol":"openid-connect", | ||
"protocolMapper":"oidc-usermodel-client-role-mapper", | ||
"config": { | ||
"access.token.claim":"true", | ||
"claim.name":"resource_access.${client_id}.roles", | ||
"jsonType.label":"String", | ||
"multivalued":"true", | ||
"id.token.claim": "true", | ||
"userinfo.token.claim": "true" | ||
} | ||
} | ||
grafana-client-payload.json: | | ||
{ | ||
"protocol": "openid-connect", | ||
"clientId": "grafana", | ||
"name": "Grafana Client", | ||
"description": "Used for Grafana SSO", | ||
"publicClient": false, | ||
"authorizationServicesEnabled": false, | ||
"serviceAccountsEnabled": false, | ||
"implicitFlowEnabled": false, | ||
"directAccessGrantsEnabled": true, | ||
"standardFlowEnabled": true, | ||
"frontchannelLogout": true, | ||
"attributes": { | ||
"saml_idp_initiated_sso_url_name": "", | ||
"oauth2.device.authorization.grant.enabled": false, | ||
"oidc.ciba.grant.enabled": false | ||
}, | ||
"alwaysDisplayInConsole": false, | ||
"rootUrl": "", | ||
"baseUrl": "", | ||
"redirectUris": [ | ||
"https://cnoe.localtest.me:8443/grafana/login/generic_oauth" | ||
], | ||
"webOrigins": [ | ||
"/*" | ||
] | ||
} | ||
|
||
--- | ||
apiVersion: batch/v1 | ||
kind: Job | ||
metadata: | ||
name: grafana-config | ||
namespace: keycloak | ||
spec: | ||
template: | ||
metadata: | ||
generateName: grafana-config | ||
spec: | ||
serviceAccountName: keycloak-config | ||
restartPolicy: Never | ||
volumes: | ||
- name: keycloak-config | ||
secret: | ||
secretName: keycloak-config | ||
- name: config-payloads | ||
configMap: | ||
name: grafana-config-job | ||
containers: | ||
- name: kubectl | ||
image: docker.io/library/ubuntu:22.04 | ||
volumeMounts: | ||
- name: keycloak-config | ||
readOnly: true | ||
mountPath: "/var/secrets/" | ||
- name: config-payloads | ||
readOnly: true | ||
mountPath: "/var/config/" | ||
command: ["/bin/bash", "-c"] | ||
args: | ||
- | | ||
#! /bin/bash | ||
set -ex -o pipefail | ||
apt -qq update && apt -qq install curl jq gettext-base -y | ||
|
||
curl -sS -LO "https://dl.k8s.io/release/v1.28.3//bin/linux/amd64/kubectl" | ||
chmod +x kubectl | ||
|
||
echo "checking if we're ready to start" | ||
set +e | ||
./kubectl get secret -n keycloak keycloak-clients &> /dev/null | ||
if [ $? -ne 0 ]; then | ||
exit 1 | ||
fi | ||
set -e | ||
|
||
ADMIN_PASSWORD=$(cat /var/secrets/KEYCLOAK_ADMIN_PASSWORD) | ||
|
||
KEYCLOAK_URL=http://keycloak.keycloak.svc.cluster.local:8080/keycloak | ||
|
||
KEYCLOAK_TOKEN=$(curl -sS --fail-with-body -X POST -H "Content-Type: application/x-www-form-urlencoded" \ | ||
--data-urlencode "username=cnoe-admin" \ | ||
--data-urlencode "password=${ADMIN_PASSWORD}" \ | ||
--data-urlencode "grant_type=password" \ | ||
--data-urlencode "client_id=admin-cli" \ | ||
${KEYCLOAK_URL}/realms/master/protocol/openid-connect/token | jq -e -r '.access_token') | ||
|
||
set +e | ||
|
||
curl --fail-with-body -H "Authorization: bearer ${KEYCLOAK_TOKEN}" "${KEYCLOAK_URL}/admin/realms/cnoe" &> /dev/null | ||
if [ $? -ne 0 ]; then | ||
exit 0 | ||
fi | ||
set -e | ||
|
||
echo "creating Grafana client" | ||
curl -sS -H "Content-Type: application/json" \ | ||
-H "Authorization: bearer ${KEYCLOAK_TOKEN}" \ | ||
-X POST --data @/var/config/grafana-client-payload.json \ | ||
${KEYCLOAK_URL}/admin/realms/cnoe/clients | ||
|
||
CLIENT_ID=$(curl -sS -H "Content-Type: application/json" \ | ||
-H "Authorization: bearer ${KEYCLOAK_TOKEN}" \ | ||
-X GET ${KEYCLOAK_URL}/admin/realms/cnoe/clients | jq -e -r '.[] | select(.clientId == "grafana") | .id') | ||
|
||
CLIENT_SCOPE_GROUPS_ID=$(curl -sS -H "Content-Type: application/json" -H "Authorization: bearer ${KEYCLOAK_TOKEN}" -X GET ${KEYCLOAK_URL}/admin/realms/cnoe/client-scopes | jq -e -r '.[] | select(.name == "groups") | .id') | ||
curl -sS -H "Content-Type: application/json" -H "Authorization: bearer ${KEYCLOAK_TOKEN}" -X PUT ${KEYCLOAK_URL}/admin/realms/cnoe/clients/${CLIENT_ID}/default-client-scopes/${CLIENT_SCOPE_GROUPS_ID} | ||
|
||
GRAFANA_CLIENT_SECRET=$(curl -sS -H "Content-Type: application/json" \ | ||
-H "Authorization: bearer ${KEYCLOAK_TOKEN}" \ | ||
-X GET ${KEYCLOAK_URL}/admin/realms/cnoe/clients/${CLIENT_ID} | jq -e -r '.secret') | ||
|
||
# Add Grafana roles to client | ||
curl -sS -H "Content-Type: application/json" \ | ||
-H "Authorization: bearer ${KEYCLOAK_TOKEN}" \ | ||
-X POST --data @/var/config/client-role-admin-payload.json \ | ||
${KEYCLOAK_URL}/admin/realms/cnoe/clients/${CLIENT_ID}/roles | ||
|
||
curl -sS -H "Content-Type: application/json" \ | ||
-H "Authorization: bearer ${KEYCLOAK_TOKEN}" \ | ||
-X POST --data @/var/config/client-role-editor-payload.json \ | ||
${KEYCLOAK_URL}/admin/realms/cnoe/clients/${CLIENT_ID}/roles | ||
|
||
curl -sS -H "Content-Type: application/json" \ | ||
-H "Authorization: bearer ${KEYCLOAK_TOKEN}" \ | ||
-X POST --data @/var/config/client-role-viewer-payload.json \ | ||
${KEYCLOAK_URL}/admin/realms/cnoe/clients/${CLIENT_ID}/roles | ||
|
||
export ADMIN_ROLE_ID=$(curl -sS -H "Content-Type: application/json" \ | ||
-H "Authorization: bearer ${KEYCLOAK_TOKEN}" "${KEYCLOAK_URL}/admin/realms/cnoe/clients/${CLIENT_ID}/roles/admin" | jq -r '.id') | ||
|
||
# Assign admin role to user1 | ||
USER1_USERID=$(curl -sS -H "Content-Type: application/json" \ | ||
-H "Authorization: bearer ${KEYCLOAK_TOKEN}" "${KEYCLOAK_URL}/admin/realms/cnoe/users?lastName=one" | jq -r '.[0].id') | ||
|
||
envsubst < /var/config/admin-role-assignment-payload.json | curl -k -sS -H 'Content-Type: application/json' \ | ||
-H "Authorization: bearer ${KEYCLOAK_TOKEN}" \ | ||
-X POST --data @- \ | ||
${KEYCLOAK_URL}/admin/realms/cnoe/users/${USER1_USERID}/role-mappings/clients/${CLIENT_ID} | ||
|
||
# Add role to token | ||
CLIENT_SCOPE_ROLES_ID=$(curl -sS -H "Content-Type: application/json" \ | ||
-H "Authorization: bearer ${KEYCLOAK_TOKEN}" \ | ||
-X GET ${KEYCLOAK_URL}/admin/realms/cnoe/client-scopes | jq -e -r '.[] | select(.name == "roles") | .id') | ||
|
||
export CLIENT_ROLES_MAPPER_ID=$(curl -sS -H "Content-Type: application/json" \ | ||
-H "Authorization: bearer ${KEYCLOAK_TOKEN}" \ | ||
-X GET ${KEYCLOAK_URL}/admin/realms/cnoe/client-scopes/${CLIENT_SCOPE_ROLES_ID}/protocol-mappers/models | jq -e -r '.[] | select(.name == "client roles") | .id') | ||
|
||
cat /var/config/roles-mapper-payload.json | envsubst '$CLIENT_ROLES_MAPPER_ID' | curl -sS -H "Content-Type: application/json" \ | ||
-H "Authorization: bearer ${KEYCLOAK_TOKEN}" \ | ||
-X PUT --data @- \ | ||
${KEYCLOAK_URL}/admin/realms/cnoe/client-scopes/${CLIENT_SCOPE_ROLES_ID}/protocol-mappers/models/${CLIENT_ROLES_MAPPER_ID} | ||
|
||
./kubectl patch secret -n keycloak keycloak-clients --type=json \ | ||
-p='[{ | ||
"op" : "add" , | ||
"path" : "/data/GRAFANA_CLIENT_SECRET" , | ||
"value" : "'$(echo -n "$GRAFANA_CLIENT_SECRET" | base64 -w 0)'" | ||
},{ | ||
"op" : "add" , | ||
"path" : "/data/GRAFANA_CLIENT_ID" , | ||
"value" : "'$(echo -n "grafana" | base64 -w 0)'" | ||
}]' |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
apiVersion: external-secrets.io/v1beta1 | ||
kind: ExternalSecret | ||
metadata: | ||
name: keycloak-oidc | ||
namespace: monitoring | ||
spec: | ||
secretStoreRef: | ||
name: keycloak | ||
kind: ClusterSecretStore | ||
target: | ||
name: grafana-oidc | ||
data: | ||
- secretKey: GF_AUTH_GENERIC_OAUTH_CLIENT_ID | ||
remoteRef: | ||
key: keycloak-clients | ||
property: GRAFANA_CLIENT_ID | ||
- secretKey: GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET | ||
remoteRef: | ||
key: keycloak-clients | ||
property: GRAFANA_CLIENT_SECRET |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interested in why promtail is used and something like fluentbit or opentelemetry collector isn't being used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Promtail is really easy to use if only Loki is being used. That being said, I may look at switching to fluent bit as I've used it in another project recently.
Is there a compelling reason to move from promtail to either? Grafana Agent could be another option as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fluentbit is more popular among EKS end users also supported when using fargate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opentelemety for logs is very new and not a lot end users have adopted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be a pretty simple swap for fluebtbit. Let me take a look at.