-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add katib studyjob launcher #754
Conversation
/retest kubeflow-pipeline-e2e-test |
f361365
to
d682625
Compare
/assign @richardsliu |
/assign @Ark-kun @gaoning777 |
d682625
to
d1daf0b
Compare
Hello @hougangliu, Do you think it would make sense to implement something more generic that could launch and wait for any CRD? On top of this generic functionality, some convenience higher level methods could be provided (as the Katib one that you are proposing here). WDYT? |
d1daf0b
to
cf28873
Compare
It makes sense |
/cc @hongye-sun Adding Hongye to this thread as he is looking into some related problem. |
cf28873
to
d4e49fc
Compare
/retest continuous-integration |
d4e49fc
to
a9108ed
Compare
@gaoning777 @Ark-kun @hongye-sun any comment? |
a9108ed
to
46fee76
Compare
/retest |
/assign @IronPan |
if wait_response.get("status", {}).get("condition") == "Completed": | ||
succ = True | ||
trial = get_best_trial(wait_response["status"]["bestTrialId"]) | ||
with open('/output.txt', 'w') as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please get this path from the command line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done! thanks!
metricsnames, parameterconfigs, nasConfig, workertemplatepath, mcollectortemplatepath, suggestionspec): | ||
"""_generate_studyjob_yaml generates studyjob yaml file based on hp.template.yaml""" | ||
with open(src_filename, 'r') as f: | ||
content = yaml.load(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can just fully build the spec dict here in code instead of loading it from file and replacing some values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just follow how tfjob-launcher fills spec dict. I will submit another PR to fix the both two components. Thanks!
.cloudbuild.yaml
Outdated
@@ -97,8 +97,12 @@ steps: | |||
id: 'buildDeployer' | |||
- name: 'gcr.io/cloud-builders/docker' | |||
entrypoint: '/bin/bash' | |||
args: ['-c', 'cd /workspace/components/kubeflow/launcher && ./build_image.sh -p $PROJECT_ID -t $COMMIT_SHA'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we not build the container automatically for now? It requires a complicated review process on our side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Behavior here will keep unchanged. I just rename launcher (in fact, it only launches tfjob) to tf-launcher since we can add more launchers (saying pytorch job launcher, studyjob launcher and so on)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @hougangliu. Could we do the renaming in a separate PR? There are several dependencies on this tfjob component so I would like to separate the renaming into another PR in case we need to rollback.
.cloudbuild.yaml
Outdated
@@ -97,8 +97,12 @@ steps: | |||
id: 'buildDeployer' | |||
- name: 'gcr.io/cloud-builders/docker' | |||
entrypoint: '/bin/bash' | |||
args: ['-c', 'cd /workspace/components/kubeflow/launcher && ./build_image.sh -p $PROJECT_ID -t $COMMIT_SHA'] | |||
args: ['-c', 'cd /workspace/components/kubeflow/tf-launcher && ./build_image.sh -p $PROJECT_ID -t $COMMIT_SHA'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the change from launcher to tf-launcher necessary? It seems unrelated to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just rename launcher (in fact, it only launches tfjob) to tf-launcher since we can add more launchers (saying pytorch job launcher, studyjob launcher and so on)
@hongye-sun, could you please have a look at this one? I think you could have some good feedback as you are working on many components. |
.cloudbuild.yaml
Outdated
@@ -97,8 +97,12 @@ steps: | |||
id: 'buildDeployer' | |||
- name: 'gcr.io/cloud-builders/docker' | |||
entrypoint: '/bin/bash' | |||
args: ['-c', 'cd /workspace/components/kubeflow/launcher && ./build_image.sh -p $PROJECT_ID -t $COMMIT_SHA'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @hougangliu. Could we do the renaming in a separate PR? There are several dependencies on this tfjob component so I would like to separate the renaming into another PR in case we need to rollback.
.cloudbuild.yaml
Outdated
id: 'buildLauncher' | ||
- name: 'gcr.io/cloud-builders/docker' | ||
entrypoint: '/bin/bash' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @hougangliu,
This entry is building the new container automatically. Could you please remove this entry? It requires a complicated review process on our side before we can release new containers automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, Done
6bfd44b
to
3eb4b1a
Compare
undo tf-launcher
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vicaire The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* add katib studyjob launcher * delete tmp file * fix link error to tf-laucher * import studyjob client from katib project * specify output file with a parameter undo tf-launcher
This change is