-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: store podname in nodestatus #12503
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: isubasinghe <[email protected]>
0293772
to
5977da2
Compare
Signed-off-by: Isitha Subasinghe <[email protected]>
Signed-off-by: Isitha Subasinghe <[email protected]>
@isubasinghe why do we need the pod name in node status? Can you provide more details about usecase/scenario? |
@sarabala1979 the issue isn't created yet, but this PR addressed this comment by Alex where we decided to store the podName in the Node status: #10267 (comment) |
if node.PodName != nil { | ||
podName = *node.PodName | ||
} else { | ||
expectedPodName := util.GeneratePodName(wfName, node.Name, templateName, node.ID, podNameVersion) | ||
panic("podName absent expected " + expectedPodName + " for " + node.Name) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably not the desirable behaviour, there might be cases where the pod has not been created yet.
What should we do in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only took a quick glance at this, but doesn't this need to modify the UI code as well?
Also, for backward-compatibility, we'd still have to generate the pod names for Workflows created before this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to test this thoroughly this time
I thought I replied to this, must have just not pressed the comment button I suppose, this was on purpose, wanted to make the backend changes first and follow it up with the frontend changes, but happy to make the UI changes here as well if you'd like me to.
Hmmm yeah you are right. What I might do is also keep the previous pod name generation code. |
Bumping this, @isubasinghe do you think you can rebase and resolve the conflicts? @terrytangyuan this is something we use in kubeflow pipelines to retrieve pod names (we are currently planning to use a workaround you prescribed here), we are very much interested in seeing this merged, so we can rely on Argo to provide this value, what can we do to get this some more traction? |
We can discuss this in the upcoming contributors meeting to prioritize. |
@HumairAK this is currently a feature, so it wouldn't land until the next minor (currently 3.6). If KFP is only bumping to 3.4, this PR wouldn't solve your problem |
@agilgur5 that is fine, hence why we plan to go with the work around for now, until we can upgrade to the version that (hopefully) has this change included, at which point we'll switch to relying on argo for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this certainly simplifies things moving forward (not during a backward-compat window though), but in recent months I had found older comments from Alex C about trying to store as little as possible in the status
due to our existing issues with large Workflows (nodeStatusOffload
+ general memory usage), such as: #9193 (comment):
Please don't add new fields to
NodeStatus
as each new fields needsO(n)
extra storage whenn
is the number of nodes. It is better to traversestatus
and build a new structure if need. A one-time traverse would boO(n)
.
I'm wondering if that was perhaps the original rationale behind some of the derivations in the codebase -- derive the field instead of storing it in status
.
Effectively, this is a trade-off between storage and determinism. It actually reminds me quite heavily of the evolution of lockfiles in the JS ecosystem.
There though, determinism is simple to favor over storage, as hard disk/FS storage is cheap and plentiful (although VCS storage and diffs are a bit more complicated), compared to in-memory or etcd storage, where we have a hard limit
@HumairAK ok. That does affect our prioritization though, as that means KFP won't be using this functionality in the short or likely near-term. |
To be clear, I don't have a strong opinion on that per se; but I think we should be explicit about that -- larger workflows would be affected by this feature. I probably lean toward agreeing with this feature, as it would simplify areas of the codebase where we do have some bugs / edge-cases currently that are handled in different ways in different places.
The deprecation window is actually a bit complicated now that I think about it. If we ever want to remove the pod name generation code, we would break backward-compat with Workflows that were created in a version that doesn't contain the pod name in the For instance, say 3.6 puts pod names into the Then say, in 3.7 we were to remove the backward-compat -- all Workflows created with Argo <3.6 would no longer be processable, e.g. for retries or Pod logs in the UI and possibly other features. With that in mind, the pod name derivation code is going to still have to be around for a while. Defaulting to this feature means that'd we could eliminate bugs for newer Workflows, but might still have them present in older Workflows. So the frequency of the old bugs would decrease, but we may want to still fix them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One concern is that this will increase the object size.
Discussed in today's Contributor Meeting. The storage concern that I brought up earlier was still relevant, especially as we've had more users recently reporting etcd getting full (particularly on managed k8s providers where you cannot configure your etcd space), such as #12802. But given that I brought it up and I personally think the trade-off is worthwhile in this case -- determinism over storage and treat storage as a separate problem to optimize independent of any one field -- this has the green light to go. We do need to implement the backward compat in this PR as well as decide whether pod names will be on all nodes or just pod nodes. |
This PR has been automatically marked as stale because it has not had recent activity and needs further changes. It will be closed if no further activity occurs. |
This PR has been closed due to inactivity and lack of changes. If you would like to still work on this PR, please address the review comments and re-open. |
hmm would still be interested in seeing this change @isubasinghe do you have the bandwidth to continue work on this? |
@HumairAK will continue on this when I get the chance |
Fixes #12528
Motivation
Modifications
The changes are relatively simple, I reduced the pod name generation code as much as possible and stored the PodnName in
the node itself.
Verification
Verified via testing, this change introduces a slight change to existing behaviour however. The change here is that we must have created a pod before a podName is assigned.