Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pin prometheus chart version to an older version #720

Merged
merged 1 commit into from
Oct 1, 2021

Conversation

GeorgianaElena
Copy link
Member

Fixes #616

@GeorgianaElena
Copy link
Member Author

GeorgianaElena commented Sep 29, 2021

I'm currently trying to deploy it locally with:

python3 deployer deploy-support 2i2c

to see if it works.

Update:
It failed with Error: UPGRADE FAILED: context deadline exceeded, but the new support-prometheus-server pod it's in ContainerCreating state for 38m.

Hmm, not sure it's normal for it to take this much time.

@GeorgianaElena
Copy link
Member Author

Running kubectl describe on that pod showed this error:

Multi-Attach error for volume "pvc-dae5a17d-763f-4dd5-b73a-3006d436a7ed" Volume is already used by pod(s) support-p
rometheus-server-59b9b6f98c-9x462

So, I guess I need to manually delete the old pod?

@yuvipanda
Copy link
Member

@GeorgianaElena yep, you should manually delete it. Prometheus should have config set like this to prevent this from happening. When the strategy is set to Recreate, the first pod is fully deleted (thus releasing the PVC) before the new pod comes up. The default is RollingUpdate which tries to bring up the new pod before old pod is deleted - causing this issue wrt PVCs. This only happens sometimes I think - I'm not so sure why, perhaps only if the new pod is in a different node?

@GeorgianaElena
Copy link
Member Author

Thanks @yuvipanda! The deployment went ok after deleting the pod 🎉

However, it doesn't look like anything's changed in https://grafana.pilot.2i2c.cloud. Is there anything else I should do, or the dashboards should just have magically appeared if this prometheus pin would had fixed the issue?

@yuvipanda
Copy link
Member

@GeorgianaElena I think github.com/jupyterhub/jupyterhub-grafana need to be deployed there to see if it works. Can you give that a shot too?

@GeorgianaElena
Copy link
Member Author

@GeorgianaElena I think github.com/jupyterhub/jupyterhub-grafana need to be deployed there to see if it works. Can you give that a shot too?

I just did and I think they work 🎉
grafana

@yuvipanda
Copy link
Member

Yay awesome!

We need to also automate deployment of rhe gradana dashboard too

@choldgraf
Copy link
Member

yesssss dashboards!!! thanks @GeorgianaElena

@yuvipanda I assume you mean that automation should be tackled in the future, not as part of this PR?

@GeorgianaElena GeorgianaElena merged commit da5cd99 into 2i2c-org:master Oct 1, 2021
@yuvipanda
Copy link
Member

@choldgraf yep!

@damianavila
Copy link
Contributor

I assume you mean that automation should be tackled in the future, not as part of this PR?

I quickly searched about grafana and prometheus open issues and there are several of them.
Maybe it is time to group those together under the topic "infra to get info about our infra" 😉 .
I know this could be view as something belonging to the "Managed JupyterHubs Infrastructure" column in the deliverables board, but maybe we need more granularity there?

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this pull request Oct 5, 2021
During deployment of 2i2c-org#720,
sometimes CI would fail because the prometheus pod would be
stuck in 'ContainerCreating', as the old pod was holding on to
the persistent disk the new pod needs to start. This was temporarily
fixed by deleting the prometheus pod, but this tells kubernetes
to delete the old pod properly first before starting the new one.

Ref 2i2c-org#720 (comment)
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this pull request Oct 5, 2021
During deployment of 2i2c-org#720,
sometimes CI would fail because the prometheus pod would be
stuck in 'ContainerCreating', as the old pod was holding on to
the persistent disk the new pod needs to start. This was temporarily
fixed by deleting the prometheus pod, but this tells kubernetes
to delete the old pod properly first before starting the new one.

Ref 2i2c-org#720 (comment)
@choldgraf
Copy link
Member

@damianavila I don't think that I follow. Do you mean have a high-level issue about grafana/prometheus specifically?

@yuvipanda what are the things that we need to do in order to automatically deploy Grafana for our hubs? Is that something we should track in a new issue?

@yuvipanda
Copy link
Member

@choldgraf i opened #739 to track that

@damianavila
Copy link
Contributor

@damianavila I don't think that I follow. Do you mean have a high-level issue about grafana/prometheus specifically?

I was thinking more about a sort of "topic" label? The Managed JupyterHubs Infrastructure column is pretty big so some filtering by label would help to visibilize related stuff...

@choldgraf
Copy link
Member

@damianavila ahhh - yes, I totally agree. That column has way more stuff in general than all of the other columns and I also find it hard to parse 😅. Another option is to create a label like

🏷️ reporting

Would that make sense?

@damianavila
Copy link
Contributor

reporting? How that would work? It is not clear to me...

@choldgraf
Copy link
Member

I was proposing that the topic we use to describe "infra to get info about our infra" is "reporting"

@damianavila
Copy link
Contributor

Thanks for the clarification... reporting is too wide and prone to be confusing, IMHO, but I guess it is OK since I can come up with a better word 😉 .

@choldgraf
Copy link
Member

choldgraf commented Oct 8, 2021

fair enough - what about 🏷️ ops reporting or 🏷️ infrastructure reporting?

@damianavila
Copy link
Contributor

infrastructure reporting, I guess?

@yuvipanda
Copy link
Member

yuvipanda commented Oct 10, 2021

I think 'monitoring' is a more common industry standard term for this. https://sre.google/sre-book/monitoring-distributed-systems/ is a nice read

@damianavila
Copy link
Contributor

Yep, 💯 to monitoring.

@choldgraf
Copy link
Member

ah I knew there was a better word haha, thanks - will go with that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix our Grafana / Prometheus data connection
4 participants