Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Migrate to Pytorch Lightning #323

Merged
merged 214 commits into from
Jan 28, 2021
Merged
Changes from 1 commit
Commits
Show all changes
214 commits
Select commit Hold shift + click to select a range
0762a2f
work in progress
ant0nsc Nov 17, 2020
3c5cf48
training loop running
ant0nsc Nov 18, 2020
e1dfbf0
more changes, but training now broken
ant0nsc Nov 19, 2020
90b8db9
fix training
ant0nsc Nov 19, 2020
4b0f5be
HelloWorld running end to end
ant0nsc Nov 19, 2020
c1ab41d
small cleanup
ant0nsc Nov 19, 2020
e69b56d
Set seeds correctly
Shruthi42 Nov 20, 2020
60e6509
Switch to using our set_random_seed function
Shruthi42 Nov 20, 2020
a071cd6
work in progress: scalar models
ant0nsc Nov 20, 2020
90ea22b
Merge remote-tracking branch 'origin/shbannur/pl-patches' into antons…
ant0nsc Nov 20, 2020
12f02a7
Scalar models are training, first regression tests are passing
ant0nsc Nov 20, 2020
aa3a049
making seg run again
ant0nsc Nov 23, 2020
5130fbe
hello world is running
ant0nsc Nov 23, 2020
fd3b996
scalar inference passing
ant0nsc Nov 23, 2020
b9abb50
test_train_2d_classification_model passes
ant0nsc Nov 25, 2020
5a7afdb
Merge remote-tracking branch 'origin/master' into antonsc/pl
ant0nsc Nov 25, 2020
271ea9b
more test fixes
ant0nsc Nov 25, 2020
5b9a25a
trainer error messages
ant0nsc Nov 26, 2020
594e497
changes to enable ddp
ant0nsc Nov 27, 2020
e265762
error message when missing store
ant0nsc Nov 27, 2020
7509f7b
enable DDP via script
ant0nsc Nov 27, 2020
72be461
test on GPU
ant0nsc Nov 27, 2020
7113d09
move logers out of config
ant0nsc Nov 30, 2020
d6bf552
Log to MLFlow, sync_dist
ant0nsc Nov 30, 2020
c06cb99
avoid blobxfer
ant0nsc Nov 30, 2020
bd6e641
setting run ID
ant0nsc Nov 30, 2020
2d34b87
Fix tests
javier-alvarez Dec 1, 2020
26600df
Writing epoch metrics works
ant0nsc Dec 2, 2020
bb066e9
Merge branch 'antonsc/pl' of https://github.com/microsoft/InnerEye-De…
ant0nsc Dec 2, 2020
fe38c5c
create output dir
ant0nsc Dec 2, 2020
f7109be
dtype fix
ant0nsc Dec 2, 2020
26f0c41
Fix temperature_scaling.py
javier-alvarez Dec 2, 2020
fa5c210
Merge branch 'antonsc/pl' of https://github.com/microsoft/InnerEye-De…
Dec 2, 2020
e7297c7
writing epoch metrics re-done
ant0nsc Dec 2, 2020
e1c9e24
Fix major flake8 issues
Dec 2, 2020
54b7658
clean up diagnostics
ant0nsc Dec 2, 2020
ead16d9
Merge branch 'antonsc/pl' of https://github.com/microsoft/InnerEye-De…
ant0nsc Dec 2, 2020
33b660f
Remove blobxfer
javier-alvarez Dec 2, 2020
68a67ae
Update CHANGELOG.md
javier-alvarez Dec 2, 2020
13d6b8a
Remove configs that are not required
javier-alvarez Dec 2, 2020
3e07584
Remove from environment.yml
javier-alvarez Dec 2, 2020
2b36fdc
Fix numba issue
javier-alvarez Dec 2, 2020
73d80eb
Improve CHANGELOG.md
javier-alvarez Dec 2, 2020
480320c
Fix tests
javier-alvarez Dec 2, 2020
7029269
training test is green
ant0nsc Dec 2, 2020
04ce7d0
fix for typo
ant0nsc Dec 2, 2020
741323a
Merge remote-tracking branch 'origin/jaalvare/remove_blobxfer' into a…
ant0nsc Dec 2, 2020
30dfe30
more import fixes
ant0nsc Dec 2, 2020
6bdb33c
more test fixes
ant0nsc Dec 2, 2020
a1fa505
enable sequence models
ant0nsc Dec 2, 2020
43eeae3
delete legacy model steps
ant0nsc Dec 2, 2020
6ffdcd4
print pythonpath
ant0nsc Dec 3, 2020
fe7456d
print cwd
ant0nsc Dec 3, 2020
779ec98
Merge remote-tracking branch 'origin/master' into antonsc/pl
Dec 3, 2020
59f1f75
Working around pickling problem
ant0nsc Dec 3, 2020
66c62cf
Merge branch 'antonsc/pl' of https://github.com/microsoft/InnerEye-De…
ant0nsc Dec 3, 2020
efe13ba
cleanup of args file
ant0nsc Dec 3, 2020
4fcd1ed
sys.path hack
ant0nsc Dec 3, 2020
ef98103
comments
ant0nsc Dec 3, 2020
b8aa84a
absolute path
ant0nsc Dec 3, 2020
9e4889f
reduce workers
ant0nsc Dec 3, 2020
ec7c98e
reduce number of GPUs for tests
ant0nsc Dec 3, 2020
a31c045
restore environment
ant0nsc Dec 4, 2020
683ceb5
import error
ant0nsc Dec 4, 2020
fb88437
wider format
ant0nsc Dec 4, 2020
5715048
adding time as metric
ant0nsc Dec 9, 2020
955f39f
docu
ant0nsc Dec 16, 2020
39eb2c6
Add PL metrics for scalar models (#340)
melanibe Dec 16, 2020
86c2de0
cleanup and tests
ant0nsc Dec 16, 2020
6d1f4e2
Merge branch 'antonsc/pl' of https://github.com/microsoft/InnerEye-De…
ant0nsc Dec 16, 2020
29afa1f
Checkpoint handling in Pytorch Lightning (#337)
Shruthi42 Dec 17, 2020
35295be
Update, but not working yet
ant0nsc Dec 17, 2020
5010021
Merge branch 'antonsc/pl' of https://github.com/microsoft/InnerEye-De…
ant0nsc Dec 17, 2020
199428d
training of segmentation model works, test_valid_model_train passes
ant0nsc Dec 17, 2020
7e5263a
logs the loss in both train and val, but not the other metrics
ant0nsc Dec 17, 2020
31279e1
metrics are working correctly
ant0nsc Dec 17, 2020
0665e9d
timing works for validaiton, but not for training
ant0nsc Dec 18, 2020
f5e4fec
Remove optimizer from ModelAndInfo for move to Pytorch Lightning (#341)
Shruthi42 Dec 18, 2020
33ac4df
more tests working
ant0nsc Dec 19, 2020
803bcb0
Merge branch 'antonsc/pl' of https://github.com/microsoft/InnerEye-De…
ant0nsc Dec 19, 2020
000f91d
avoid rank zero writing stats
ant0nsc Dec 19, 2020
8a7ddb3
test fix
ant0nsc Dec 19, 2020
47817c6
rename. fix voxel count test
ant0nsc Dec 21, 2020
7b353d0
more test fixes
ant0nsc Dec 21, 2020
6d0f98d
Merge remote-tracking branch 'origin/master' into antonsc/pl
ant0nsc Dec 21, 2020
f10beef
r2score test failure
ant0nsc Dec 21, 2020
fec4691
RNN tests working now
ant0nsc Dec 21, 2020
e39224e
fix blocked test
ant0nsc Dec 22, 2020
5a9c473
fix gradcam test
ant0nsc Dec 22, 2020
f2523c5
removing outdated tests
ant0nsc Dec 22, 2020
a27927c
fixing more tests
ant0nsc Dec 22, 2020
dc1cc67
Remove dependence on hardcoded run IDs in tests (#342)
Shruthi42 Dec 22, 2020
c618a62
fixing more tests
ant0nsc Dec 22, 2020
62a61d9
Pin PyJWT package, it causes auth issues
ant0nsc Dec 22, 2020
251da9e
update mlflow and other related packages to resolve hanging job
ant0nsc Dec 22, 2020
49a80ea
redoing checkpoint loading
ant0nsc Dec 22, 2020
2fac5e1
Specific AzureML logger
ant0nsc Dec 22, 2020
22de090
removing blobxfer
ant0nsc Dec 22, 2020
256cbbf
Do not refer to specific epochs in inference code, create random chec…
Shruthi42 Dec 23, 2020
163876e
improving tests
ant0nsc Dec 23, 2020
f0433ae
Merge branch 'antonsc/pl' of https://github.com/microsoft/InnerEye-De…
ant0nsc Dec 23, 2020
6eb3c71
fix more tests
ant0nsc Dec 23, 2020
bc5a3c2
fix more tests
ant0nsc Dec 23, 2020
976d3ed
fix output_to issue for test suite
ant0nsc Dec 23, 2020
830a3d0
removing weird duplication of code
ant0nsc Dec 23, 2020
98ce4ce
Avoid file upload when using single GPU training, as we do in test suite
ant0nsc Dec 23, 2020
ff7aba3
diag
ant0nsc Dec 23, 2020
90f41f9
Further cleanup
ant0nsc Dec 23, 2020
a89fb0c
Fix LR Scheduler loading from state dict, change save_step_epochs nam…
Shruthi42 Dec 24, 2020
e7d82fc
Merge branch 'master' into antonsc/pl
javier-alvarez Jan 4, 2021
7764c47
Migrate to Pytorch Lightning: Flake8 and mypy fixes (#353)
Shruthi42 Jan 6, 2021
9496107
Migrate to Pytorch Lightning - Remove ModelAndInfo and fix tests (#352)
Shruthi42 Jan 7, 2021
cfb2949
reduce batch size and workers
ant0nsc Jan 8, 2021
e5b380e
ddp
ant0nsc Jan 8, 2021
9eff3c5
remove manual dataloader initialization
ant0nsc Jan 11, 2021
c337055
run either pytest or training
ant0nsc Jan 11, 2021
b5fe93e
Merge remote-tracking branch 'origin/master' into antonsc/pl
ant0nsc Jan 11, 2021
1234466
fix mark
ant0nsc Jan 11, 2021
5e3d755
fix import error
ant0nsc Jan 11, 2021
d16a7fa
cleanup PR build, add tags
ant0nsc Jan 11, 2021
25845fb
simplify code
ant0nsc Jan 11, 2021
be16d71
fix metrics off-by-one bug
ant0nsc Jan 13, 2021
f5ae7a4
Metrics now get correctly averaged, but not yet batch weightedd
ant0nsc Jan 14, 2021
051049a
fix sync issues
ant0nsc Jan 14, 2021
61f6455
fix subject count aggregation
ant0nsc Jan 14, 2021
160ca7e
fix sync issue
ant0nsc Jan 15, 2021
e27df44
remove diag
ant0nsc Jan 15, 2021
f9fa352
Merge branch 'master' into antonsc/pl
Shruthi42 Jan 15, 2021
ba05f87
Refactoring to use custom Dice computer
ant0nsc Jan 15, 2021
8adb101
Merge branch 'antonsc/pl' of https://github.com/microsoft/InnerEye-De…
ant0nsc Jan 15, 2021
a485cad
Test for TrackedMetrics
ant0nsc Jan 18, 2021
b4f128a
partitioning model
ant0nsc Jan 18, 2021
4bf304d
Refactoring of checkpoint loading
ant0nsc Jan 18, 2021
6a3af2f
adjust batch size to PL
ant0nsc Jan 18, 2021
341f933
cleanup, but still fails with OOM
ant0nsc Jan 18, 2021
88dcf85
import fix
ant0nsc Jan 19, 2021
fa7ee38
adding no_grad
ant0nsc Jan 19, 2021
03104be
Merge remote-tracking branch 'origin/master' into antonsc/pl
ant0nsc Jan 19, 2021
84663fb
fix import errors
ant0nsc Jan 19, 2021
b74ed27
test fixes
ant0nsc Jan 19, 2021
9e8b997
test fixes
ant0nsc Jan 19, 2021
d0d279d
test and flake8 fixes
ant0nsc Jan 19, 2021
dfcc49e
mypy
ant0nsc Jan 19, 2021
632026f
mypy (#364)
Shruthi42 Jan 19, 2021
605fb2d
clean up inference tests
ant0nsc Jan 19, 2021
d87ee11
Merge branch 'antonsc/pl' of https://github.com/microsoft/InnerEye-De…
ant0nsc Jan 19, 2021
5f5b84a
test fix
ant0nsc Jan 19, 2021
4b7337f
test fix
ant0nsc Jan 19, 2021
73c0b1c
mypy
ant0nsc Jan 19, 2021
d5d1ab5
mypy
ant0nsc Jan 19, 2021
0712e1a
flake
ant0nsc Jan 19, 2021
34b02b3
test fixes
ant0nsc Jan 20, 2021
a3ec586
reformatting
ant0nsc Jan 20, 2021
5b3206f
Merge branch 'antonsc/pl' of https://github.com/microsoft/InnerEye-De…
ant0nsc Jan 20, 2021
60ee117
Merge branch 'antonsc/pl' of https://github.com/microsoft/InnerEye-De…
ant0nsc Jan 20, 2021
57ee1ac
Avoid complex dependencies for TrackedMetrics
ant0nsc Jan 20, 2021
21bde3c
old ref
ant0nsc Jan 20, 2021
2e4ed42
mypy fixes
ant0nsc Jan 20, 2021
568416f
model_id fixes
ant0nsc Jan 20, 2021
19549e5
test fix and cleanup
ant0nsc Jan 20, 2021
6820f84
Merge remote-tracking branch 'origin/master' into antonsc/pl
ant0nsc Jan 20, 2021
359d746
import fix
ant0nsc Jan 20, 2021
d034597
flake
ant0nsc Jan 20, 2021
69c90f4
DRY
ant0nsc Jan 20, 2021
84daa5a
more test and mypy fixes
ant0nsc Jan 20, 2021
b01ed5e
test fixes
ant0nsc Jan 21, 2021
9e3e4ac
ensemble refactoring
ant0nsc Jan 21, 2021
5639865
crossval fixes
ant0nsc Jan 21, 2021
52211d3
better length check
ant0nsc Jan 21, 2021
9b4d099
build to nc12
ant0nsc Jan 21, 2021
0f82a7c
Removing dead code
ant0nsc Jan 22, 2021
534b6b9
Removing dead code
ant0nsc Jan 22, 2021
73b5a2f
Removing dead code
ant0nsc Jan 22, 2021
df683ee
Removing dead code
ant0nsc Jan 22, 2021
e7aca79
remove dead code
ant0nsc Jan 22, 2021
f2c13c9
test fixes: Don't run notebooks on sequence models
ant0nsc Jan 22, 2021
1335946
printing epoch diagnostics when loading
ant0nsc Jan 25, 2021
d298f14
Using last epoch checkpoint as best
ant0nsc Jan 25, 2021
c833b97
flake
ant0nsc Jan 25, 2021
6f05002
logging cleanup
ant0nsc Jan 25, 2021
7885767
Merge remote-tracking branch 'origin/master' into antonsc/pl
ant0nsc Jan 25, 2021
4cc668a
test fixes
ant0nsc Jan 25, 2021
6133747
docu
ant0nsc Jan 25, 2021
2968ec8
only 1 recovery checkpoint
ant0nsc Jan 26, 2021
36075ca
fix failing tests, refactor hardcoded run IDs
ant0nsc Jan 26, 2021
a5f5cf7
Fix recovery paths
ant0nsc Jan 26, 2021
745b224
sleep to avoid test failures
ant0nsc Jan 26, 2021
d724dd1
diagnostics
ant0nsc Jan 26, 2021
970a979
docu
ant0nsc Jan 26, 2021
bf97b4b
upload file path fix
ant0nsc Jan 26, 2021
81fd40f
syntax fix
ant0nsc Jan 26, 2021
96d7d19
flake
ant0nsc Jan 26, 2021
32a71b9
cleaning up IO logging
ant0nsc Jan 26, 2021
9bee3df
Test fixes
ant0nsc Jan 26, 2021
fe4679d
update to latest runs
ant0nsc Jan 26, 2021
6244b90
fix tests for checkpoint handling
ant0nsc Jan 26, 2021
f81ba49
fix rest of the tests
ant0nsc Jan 26, 2021
0e58589
iml file
ant0nsc Jan 26, 2021
3b3e9df
cleanup and changelog
ant0nsc Jan 26, 2021
65588c9
increase timeout
ant0nsc Jan 26, 2021
37eeaae
lightning 1.1.6
ant0nsc Jan 26, 2021
2ab12f8
Refactoring to include all time columns
ant0nsc Jan 27, 2021
a40f462
PL 1.1.6 -> 1.0.6 again
ant0nsc Jan 27, 2021
7dc62e7
mixed precision
ant0nsc Jan 28, 2021
b4da62c
remove TODOs and dead code
ant0nsc Jan 28, 2021
1776d5a
more PR comments
ant0nsc Jan 28, 2021
f38c1f3
more PR comments
ant0nsc Jan 28, 2021
7f1c257
more PR comments
ant0nsc Jan 28, 2021
1309792
more PR comments
ant0nsc Jan 28, 2021
b6a2c2b
move
ant0nsc Jan 28, 2021
8f40358
split lightning_models.py into pieces
ant0nsc Jan 28, 2021
85693a2
docu update
ant0nsc Jan 28, 2021
06a2c8b
PR updates
ant0nsc Jan 28, 2021
10eaa66
avoid 16bit
ant0nsc Jan 28, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
fix training
ant0nsc committed Nov 19, 2020

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
commit 90b8db9f4b3986d30713395ff106eab30977e38b
4 changes: 1 addition & 3 deletions InnerEye/ML/model_training_steps.py
Original file line number Diff line number Diff line change
@@ -475,9 +475,7 @@ def __init__(self, config: SegmentationModelBase, *args, **kwargs) -> None:
# Metrics for all epochs
self.train_metrics_per_epoch: List[MetricsDict] = []
self.validation_metrics_per_epoch: List[MetricsDict] = []
# Initialize these fields to dummy values, that are only populated correctly in prepare_for_training
self.config = DeepLearningConfig(should_validate=False)
self.loss_fn = torch.nn.Module()
# This will be initialized correctly in epoch_start
self.metrics = MetricsDict()

def configure_optimizers(self):