-
Notifications
You must be signed in to change notification settings - Fork 53
Apply default pod template to PytorchJob pods #297
Apply default pod template to PytorchJob pods #297
Conversation
Codecov Report
@@ Coverage Diff @@
## master #297 +/- ##
==========================================
+ Coverage 62.09% 62.29% +0.20%
==========================================
Files 145 145
Lines 11497 11564 +67
==========================================
+ Hits 7139 7204 +65
Misses 3815 3815
- Partials 543 545 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
d25aade
to
2ec7da8
Compare
@hamersaw I compiled an image and ran it in our cluster, it works. However, how should I add tests for this? Ideally I would want to duplicate this test but apply a pod template. However, for this I would have to mock/patch This function which is the equivalent for normal pods is not tested in the case of an existing pod template either from what I can see. Methods for mocking like mentioned here wouldn't work without refactoring and unfortunately I don't understand how you employ mockery. Do you see a way to mock |
All great questions. TL;DR I think testing the
You could manually inject a
So mockery works by generating stubs for golang interfaces. Currently the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks great. A few nits to maybe look into.
Also I see the PodTemplateSpec
here and here have an ObjectMeta
field. In the BuildPodWithSpec
we maintain the metadata to bring things like labels and annotations over. Do you know if we set ObjectMeta
on the PodTemplateSpec
will be be maintained in the Pods
launched from the operator?
No, the operator doesn't carry them over. I created a apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: pytorch-simple
namespace: test
labels:
foo: bar
annotations:
afoo: abar It results in Pods that look like this: apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2022-11-30T10:07:42Z"
labels:
group-name: kubeflow.org
job-name: pytorch-simple
job-role: master
replica-index: "0"
replica-type: master
training.kubeflow.org/job-name: pytorch-simple
training.kubeflow.org/job-role: master
training.kubeflow.org/operator-name: pytorchjob-controller
training.kubeflow.org/replica-index: "0"
training.kubeflow.org/replica-type: master
name: pytorch-simple-master-0
namespace: test
ownerReferences:
- apiVersion: kubeflow.org/v1
blockOwnerDeletion: true
controller: true
kind: PyTorchJob
name: pytorch-simple
uid: 84b4e76d-de63-46a5-a829-c1595b7fd4bd
resourceVersion: "30970195"
uid: 7e5232d1-0deb-4013-bbba-cf350232c1db |
@hamersaw can you please take a look at the new changes? I tried to make sensible commits that make the review easy, maybe look at them one by one. About unit tests: The existing test
Having added these lines ... if podTemplate != nil {
mergedPodSpec, err := flytek8s.MergePodSpecs( ... reduced the test coverage though. Do you have a suggestion how we can make the build green without mocking anything here? And do we have to? |
Oh sorry, I don't think I explained in-depth enough. I was thinking the |
IIUC the updated default container names means that we need to create a container in the default PodTemplate called |
No, I'm currently using this pod template with a single container named apiVersion: v1
kind: PodTemplate
metadata:
name: flyte-template
namespace: flyte
template:
spec:
containers:
- image: foo
imagePullPolicy: Always
name: default
volumeMounts:
- mountPath: /dev/shm
name: dshm
volumes:
- emptyDir:
medium: Memory
name: dshm With the current implementation this translates into this pod: apiVersion: v1
kind: Pod
...
spec:
containers:
- name: pytorch
...
volumeMounts:
- mountPath: /dev/shm
name: dshm
volumes:
- emptyDir:
medium: Memory
name: dshm
... |
Ah OK, small nit then. If we are not using the |
I just realised that in // merge template Containers
var mergedContainers []v1.Container
var defaultContainerTemplate, primaryContainerTemplate *v1.Container
for i := 0; i < len(podTemplatePodSpecCopy.Containers); i++ {
if podTemplatePodSpecCopy.Containers[i].Name == defaultContainerTemplateName {
defaultContainerTemplate = &podTemplatePodSpecCopy.Containers[i]
} else if podTemplatePodSpecCopy.Containers[i].Name == primaryContainerTemplateName {
primaryContainerTemplate = &podTemplatePodSpecCopy.Containers[i]
}
} That is why Let me take another close look at this and also at setting the |
I tested the Pytorch Job plugin with the following template: apiVersion: v1
kind: PodTemplate
metadata:
name: flyte-template
namespace: flyte
template:
metadata:
labels:
foo: bar # <- new
spec:
containers:
- image: foo
imagePullPolicy: Always
name: default
volumeMounts:
- mountPath: /dev/shm
name: dshm
volumes:
- emptyDir:
medium: Memory
name: dshm The template leads to the following pod: apiVersion: v1
kind: Pod
metadata:
labels:
foo: bar
group-name: kubeflow.org
....
name: avjfhs8nsrq4jqm46hzr-feihpdka-0-master-0
namespace: development
...
spec:
affinity: {}
containers:
- args:
- pyflyte-fast-execute
...
name: pytorch
...
volumeMounts:
- mountPath: /dev/shm
name: dshm
...
volumes:
- emptyDir:
medium: Memory
name: dshm
... |
@@ -67,6 +67,21 @@ func (mpiOperatorResourceHandler) BuildResource(ctx context.Context, taskCtx plu | |||
return nil, flyteerr.Errorf(flyteerr.BadTaskSpecification, "Unable to create pod spec: [%v]", err.Error()) | |||
} | |||
|
|||
common.OverrideDefaultContainerName(taskCtx, podSpec, kubeflowv1.MPIJobDefaultContainerName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting. I'm not seeing the GetDefaultContainerName function being used anywhere. I'm wonder if either, it doesn't care about the container name or if the launcher pod automatically updates container names to reflect this.
cc @bimtauer is this something you know anything about? or can review this update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hamersaw I will try and have a look by the end of the week!
Signed-off-by: Fabio Grätz <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]>
Signed-off-by: Fabio Grätz <[email protected]>
af9d869
to
ffbe158
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Clean up tests and get another pair of eyes on it and lets merge!
@@ -67,6 +67,21 @@ func (mpiOperatorResourceHandler) BuildResource(ctx context.Context, taskCtx plu | |||
return nil, flyteerr.Errorf(flyteerr.BadTaskSpecification, "Unable to create pod spec: [%v]", err.Error()) | |||
} | |||
|
|||
common.OverrideDefaultContainerName(taskCtx, podSpec, kubeflowv1.MPIJobDefaultContainerName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting. I'm not seeing the GetDefaultContainerName function being used anywhere. I'm wonder if either, it doesn't care about the container name or if the launcher pod automatically updates container names to reflect this.
cc @bimtauer is this something you know anything about? or can review this update?
cc @bimtauer - this PR adds the default PodTemplate to the MPI task type. IIUC this should address this issue on IAM roles with annotations for the MPI tasks. Could you take a look through this to sanity check MPI updates? It would be VERY helpful! |
Signed-off-by: Fabio Grätz <[email protected]>
2db21b3
to
b5ed78a
Compare
Renamed my test and deleted the old one 👍 |
* Merge pod template spec with pod spec in separate func Signed-off-by: Fabio Grätz <[email protected]> * Merge pod template spec with pod spec in separate func Signed-off-by: Fabio Grätz <[email protected]> * Apply pod template to pytorch job pod spec Signed-off-by: Fabio Grätz <[email protected]> * Handle nil podspecs before merging Signed-off-by: Fabio Grätz <[email protected]> * Pass both default and primare container name to MergePodSpecs Signed-off-by: Fabio Grätz <[email protected]> * Move podSpec.DeepCopy into MergePodSpecs Signed-off-by: Fabio Grätz <[email protected]> * Add tests Signed-off-by: Fabio Grätz <[email protected]> * Merge pod template into tfjob and mpijob Signed-off-by: Fabio Grätz <[email protected]> * Lint Signed-off-by: Fabio Grätz <[email protected]> * Correct usage of default and primate container (template) name Signed-off-by: Fabio Grätz <[email protected]> * Override mpi default container name Signed-off-by: Fabio Grätz <[email protected]> * Carry over ObjectMeta from pod template Signed-off-by: Fabio Grätz <[email protected]> * Remove old `TestBuildPodWithSpec` test Signed-off-by: Fabio Grätz <[email protected]> Signed-off-by: Fabio Grätz <[email protected]> Co-authored-by: Fabio Grätz <[email protected]>
TL;DR
Currently, the default pod template is applied to Python tasks but not to the kf-operators tasks like the Pytorch task.
This is a problem since e.g. for the pytorch dataloaders, one often has to increase the shared memory, wich on K8s is done by mounting an
emptyDir
volume. Currently there exists no mechanism to do that for Pytorch tasks.In this PR, the
v1.PodSpec
of the default pod template is applied to thePodSpec
of thePytorchJob
.Type
Are all requirements met?
Complete description
Tracking Issue
Remove the 'fixes' keyword if there will be multiple PRs to fix the linked issue
fixes https://github.com/flyteorg/flyte/issues/
Follow-up issue
NA
OR
https://github.com/flyteorg/flyte/issues/