Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Cancel queued AzureML jobs when starting a PR build #640

Merged
merged 26 commits into from
Jan 25, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ loss.

### Added
- ([#594](https://github.com/microsoft/InnerEye-DeepLearning/pull/594)) When supplying a "--tag" argument, the AzureML jobs use that value as the display name, to more easily distinguish run.
- ([#640](https://github.com/microsoft/InnerEye-DeepLearning/pull/640)) Cancel AzureML jobs from previous runs of the PR build in the same branch to reduce AML load
- ([#577](https://github.com/microsoft/InnerEye-DeepLearning/pull/577)) Commandline switch `monitor_gpu` to monitor
GPU utilization via Lightning's `GpuStatsMonitor`, switch `monitor_loading` to check batch loading times via
`BatchTimeCallback`, and `pl_profiler` to turn on the Lightning profiler (`simple`, `advanced`, or `pytorch`)
Expand Down
8 changes: 8 additions & 0 deletions azure-pipelines/azureml-conda-environment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
name: AzureML_SDK
channels:
- defaults
dependencies:
- pip=20.1.1
- python=3.7.3
- pip:
- azureml-sdk==1.36.0
14 changes: 14 additions & 0 deletions azure-pipelines/build-pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,12 @@ variables:
disable.coverage.autogenerate: 'true'

jobs:
- job: CancelPreviousJobs
pool:
vmImage: 'ubuntu-18.04'
steps:
- template: cancel_aml_jobs.yml

- job: Windows
pool:
vmImage: 'windows-2019'
Expand All @@ -30,6 +36,7 @@ jobs:
- template: build.yaml

- job: TrainInAzureML
dependsOn: CancelPreviousJobs
variables:
- name: tag
value: 'TrainBasicModel'
Expand All @@ -48,6 +55,7 @@ jobs:
test_run_title: tests_after_training_single_run

- job: RunGpuTestsInAzureML
dependsOn: CancelPreviousJobs
variables:
- name: tag
value: 'RunGpuTests'
Expand All @@ -70,6 +78,7 @@ jobs:
# is trained, because we use this build to also check the "submit_for_inference" code, that
# presently only handles single channel models.
- job: TrainInAzureMLViaSubmodule
dependsOn: CancelPreviousJobs
variables:
- name: model
value: 'BasicModel2Epochs1Channel'
Expand All @@ -90,6 +99,7 @@ jobs:

# Train a 2-element ensemble model
- job: TrainEnsemble
dependsOn: CancelPreviousJobs
variables:
- name: model
value: 'BasicModelForEnsembleTest'
Expand All @@ -114,6 +124,7 @@ jobs:

# Train a model on 2 nodes
- job: Train2Nodes
dependsOn: CancelPreviousJobs
variables:
- name: model
value: 'BasicModel2EpochsMoreData'
Expand All @@ -135,6 +146,7 @@ jobs:
test_run_title: tests_after_training_2node_run

- job: TrainHelloWorld
dependsOn: CancelPreviousJobs
variables:
- name: model
value: 'HelloWorld'
Expand All @@ -152,6 +164,7 @@ jobs:
# Run HelloContainer on 2 nodes. HelloContainer uses native Lighting test set inference, which can get
# confused after doing multi-node training in the same script.
- job: TrainHelloContainer
dependsOn: CancelPreviousJobs
variables:
- name: model
value: 'HelloContainer'
Expand All @@ -176,6 +189,7 @@ jobs:
# regressions in AML when requesting more than the default amount of memory. This needs to run with all subjects to
# trigger the bug, total runtime 10min
- job: TrainLung
dependsOn: CancelPreviousJobs
variables:
- name: model
value: 'Lung'
Expand Down
2 changes: 2 additions & 0 deletions azure-pipelines/build_data_quality.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
steps:
- template: checkout.yml

- template: prepare_conda.yml

- bash: |
conda env create --file InnerEye-DataQuality/environment.yml --name InnerEyeDataQuality
source activate InnerEyeDataQuality
Expand Down
46 changes: 46 additions & 0 deletions azure-pipelines/cancel_aml_jobs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# ------------------------------------------------------------------------------------------
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License (MIT). See LICENSE in the repo root for license information.
# ------------------------------------------------------------------------------------------
import os

from azureml._restclient.constants import RunStatus
from azureml.core import Experiment, Run, Workspace
from azureml.core.authentication import ServicePrincipalAuthentication


def cancel_running_and_queued_jobs() -> None:
environ = os.environ
print("Authenticating")
auth = ServicePrincipalAuthentication(
tenant_id='72f988bf-86f1-41af-91ab-2d7cd011db47',
ant0nsc marked this conversation as resolved.
Show resolved Hide resolved
service_principal_id=environ["APPLICATION_ID"],
service_principal_password=environ["APPLICATION_KEY"])
print("Getting AML workspace")
workspace = Workspace.get(
name="InnerEye-DeepLearning",
auth=auth,
subscription_id=environ["SUBSCRIPTION_ID"],
resource_group="InnerEye-DeepLearning")
branch = environ["BRANCH"]
print(f"Branch: {branch}")
if not branch.startswith("refs/pull/"):
print("This branch is not a PR branch, hence not cancelling anything.")
exit(0)
experiment_name = branch.replace("/", "_")
print(f"Experiment: {experiment_name}")
experiment = Experiment(workspace, name=experiment_name)
print(f"Retrieved experiment {experiment.name}")
for run in experiment.get_runs(include_children=True, properties={}):
assert isinstance(run, Run)
status_suffix = f"'{run.status}' run {run.id} ({run.display_name})"
if run.status in (RunStatus.COMPLETED, RunStatus.FAILED, RunStatus.FINALIZING, RunStatus.CANCELED,
RunStatus.CANCEL_REQUESTED):
print(f"Skipping {status_suffix}")
else:
print(f"Cancelling {status_suffix}")
run.cancel()


if __name__ == "__main__":
cancel_running_and_queued_jobs()
27 changes: 27 additions & 0 deletions azure-pipelines/cancel_aml_jobs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
steps:
- checkout: self

- template: prepare_conda.yml

# https://docs.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#pythonanaconda
- task: Cache@2
displayName: Use cached Conda environment AzureML_SDK
inputs:
# Beware of changing the cache key or path independently, safest to change in sync
key: 'usr_share_miniconda_azureml_conda | "$(Agent.OS)" | azure-pipelines/azureml-conda-environment.yml'
path: /usr/share/miniconda/envs
cacheHitVar: CONDA_CACHE_RESTORED

- script: conda env create --file azure-pipelines/azureml-conda-environment.yml
displayName: Create Conda environment AzureML_SDK
condition: eq(variables.CONDA_CACHE_RESTORED, 'false')

- bash: |
source activate AzureML_SDK
python azure-pipelines/cancel_aml_jobs.py
displayName: Cancel jobs from previous run
env:
SUBSCRIPTION_ID: $(InnerEyeDevSubscriptionID)
APPLICATION_ID: $(InnerEyeDeepLearningServicePrincipalID)
APPLICATION_KEY: $(InnerEyeDeepLearningServicePrincipalKey)
BRANCH: $(Build.SourceBranch)
18 changes: 0 additions & 18 deletions azure-pipelines/checkout.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,3 @@ steps:
- checkout: self
lfs: true
submodules: true

- bash: |
subdir=bin
echo "Adding this directory to PATH: $CONDA/$subdir"
echo "##vso[task.prependpath]$CONDA/$subdir"
displayName: Add conda to PATH
condition: succeeded()

- bash: |
conda install conda=4.8.3 -y
conda --version
conda list
displayName: Print conda version and initial package list

- bash: |
sudo chown -R $USER /usr/share/miniconda
condition: and(succeeded(), eq( variables['Agent.OS'], 'Linux' ))
displayName: Take ownership of conda installation
2 changes: 2 additions & 0 deletions azure-pipelines/inner_eye_env.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ steps:

- template: store_settings.yml

- template: prepare_conda.yml

# https://docs.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#pythonanaconda
- task: Cache@2
displayName: Use cached Conda environment
Expand Down
12 changes: 12 additions & 0 deletions azure-pipelines/prepare_conda.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
steps:
- bash: |
subdir=bin
echo "Adding this directory to PATH: $CONDA/$subdir"
echo "##vso[task.prependpath]$CONDA/$subdir"
displayName: Add conda to PATH
condition: succeeded()

- bash: |
sudo chown -R $USER /usr/share/miniconda
condition: and(succeeded(), eq( variables['Agent.OS'], 'Linux' ))
displayName: Take ownership of conda installation