-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
training with MPS - ARM64 Mac #93
Comments
Having found where a
Traceback (most recent call last):
File "/Users/tks32/research/mace-tmp/Al2O3/../scripts/run_train.py", line 563, in <module>
main()
File "/Users/tks32/research/mace-tmp/Al2O3/../scripts/run_train.py", line 429, in main
model=AveragedModel(model),
File "/Users/tks32/research/mace-tmp/venv/lib/python3.9/site-packages/torch/optim/swa_utils.py", line 104, in __init__
self.module = deepcopy(model)
File "/opt/homebrew/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/opt/homebrew/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/copy.py", line 270, in _reconstruct
state = deepcopy(state, memo)
File "/opt/homebrew/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/copy.py", line 146, in deepcopy
y = copier(x, memo)
File "/opt/homebrew/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/copy.py", line 230, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/opt/homebrew/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/opt/homebrew/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/copy.py", line 296, in _reconstruct
value = deepcopy(value, memo)
File "/opt/homebrew/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/opt/homebrew/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/copy.py", line 270, in _reconstruct
state = deepcopy(state, memo)
File "/opt/homebrew/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/copy.py", line 146, in deepcopy
y = copier(x, memo)
File "/opt/homebrew/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/copy.py", line 230, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/opt/homebrew/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/opt/homebrew/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/copy.py", line 296, in _reconstruct
value = deepcopy(value, memo)
File "/opt/homebrew/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/opt/homebrew/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/copy.py", line 270, in _reconstruct
state = deepcopy(state, memo)
File "/opt/homebrew/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/copy.py", line 146, in deepcopy
y = copier(x, memo)
File "/opt/homebrew/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/copy.py", line 230, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/opt/homebrew/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/opt/homebrew/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/copy.py", line 296, in _reconstruct
value = deepcopy(value, memo)
File "/opt/homebrew/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/opt/homebrew/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/copy.py", line 272, in _reconstruct
y.__setstate__(state)
File "/Users/tks32/research/mace-tmp/venv/lib/python3.9/site-packages/e3nn/util/codegen/_mixin.py", line 109, in __setstate__
smod = torch.jit.load(buffer)
File "/Users/tks32/research/mace-tmp/venv/lib/python3.9/site-packages/torch/jit/_serialization.py", line 164, in load
cpp_module = torch._C.import_ir_module_from_buffer(
RuntimeError: supported devices include CPU, CUDA and HPU, however got MPS |
The above tries to make a copy of the model, which then breaks some lower level loading function. Oddly if you construct a dummy This runs with no issues: import numpy as np
import torch
from e3nn import o3
from torch.optim.swa_utils import AveragedModel
from mace.modules import ScaleShiftMACE, RealAgnosticInteractionBlock
mps_device = torch.device("mps")
def main_93():
# check MPS device
if not torch.backends.mps.is_available():
raise
mps_device = torch.device("mps")
# construct model
model = ScaleShiftMACE(
r_max=3.0,
num_bessel=10,
num_polynomial_cutoff=10,
max_ell=4,
interaction_cls=RealAgnosticInteractionBlock,
interaction_cls_first=RealAgnosticInteractionBlock,
num_interactions=2,
num_elements=2,
hidden_irreps=o3.Irreps("16x0e"),
MLP_irreps=o3.Irreps("16x0e"),
atomic_energies=np.zeros(2),
avg_num_neighbors=10.0,
atomic_numbers=[1, 2],
correlation=2,
atomic_inter_scale=1.0,
atomic_inter_shift=1.0,
gate=None,
)
# move to device
model.to(mps_device)
# try AveragedModel
average_model = AveragedModel(model)
if __name__ == "__main__":
main_93() |
Digging in the traceback, here is the object it is crying about. This happens here in the traceback. File "[...]/e3nn/util/codegen/_mixin.py", line 109, in __setstate__
smod = torch.jit.load(buffer) {'_backward_hooks': OrderedDict(),
'_backward_pre_hooks': OrderedDict(),
'_buffers': OrderedDict([('weight', tensor([], device='mps:0')),
('output_mask',
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], device='mps:0'))]),
'_forward_hooks': OrderedDict(),
'_forward_hooks_with_kwargs': OrderedDict(),
'_forward_pre_hooks': OrderedDict(),
'_forward_pre_hooks_with_kwargs': OrderedDict(),
'_in1_dim': 64,
'_in2_dim': 16,
'_is_full_backward_hook': None,
'_load_state_dict_post_hooks': OrderedDict(),
'_load_state_dict_pre_hooks': OrderedDict(),
'_modules': OrderedDict(),
'_non_persistent_buffers_set': set(),
'_optimize_einsums': True,
'_parameters': OrderedDict(),
'_profiling_str': 'TensorProduct(16x0e+16x1o x 1x0e+1x1o+1x2e+1x3o -> '
'32x0e+48x1o+48x2e+32x3o | 160 paths | 160 weights)',
'_specialized_code': True,
'_state_dict_hooks': OrderedDict(),
'_state_dict_pre_hooks': OrderedDict(),
'instructions': [Instruction(i_in1=0, i_in2=0, i_out=0, connection_mode='uvu', has_weight=True, path_weight=1.0, path_shape=(16, 1)),
Instruction(i_in1=1, i_in2=1, i_out=1, connection_mode='uvu', has_weight=True, path_weight=1.0, path_shape=(16, 1)),
Instruction(i_in1=0, i_in2=1, i_out=2, connection_mode='uvu', has_weight=True, path_weight=1.7320508075688772, path_shape=(16, 1)),
Instruction(i_in1=1, i_in2=0, i_out=3, connection_mode='uvu', has_weight=True, path_weight=1.7320508075688772, path_shape=(16, 1)),
Instruction(i_in1=1, i_in2=2, i_out=4, connection_mode='uvu', has_weight=True, path_weight=1.7320508075688772, path_shape=(16, 1)),
Instruction(i_in1=0, i_in2=2, i_out=5, connection_mode='uvu', has_weight=True, path_weight=2.23606797749979, path_shape=(16, 1)),
Instruction(i_in1=1, i_in2=1, i_out=6, connection_mode='uvu', has_weight=True, path_weight=2.23606797749979, path_shape=(16, 1)),
Instruction(i_in1=1, i_in2=3, i_out=7, connection_mode='uvu', has_weight=True, path_weight=2.23606797749979, path_shape=(16, 1)),
Instruction(i_in1=0, i_in2=3, i_out=8, connection_mode='uvu', has_weight=True, path_weight=2.6457513110645907, path_shape=(16, 1)),
Instruction(i_in1=1, i_in2=2, i_out=9, connection_mode='uvu', has_weight=True, path_weight=2.6457513110645907, path_shape=(16, 1))],
'internal_weights': False,
'irreps_in1': 16x0e+16x1o,
'irreps_in2': 1x0e+1x1o+1x2e+1x3o,
'irreps_out': 16x0e+16x0e+16x1o+16x1o+16x1o+16x2e+16x2e+16x2e+16x3o+16x3o,
'shared_weights': False,
'training': True,
'weight_numel': 160} |
closed as merged in PR #95 |
Describe the bug
Training of model cannot be initialised on M2 Mac, using MPS acceleration. Since apple GPUs don't support 64 bit floats, so one needs to set
default_dtype=float32
which is likely the issue.To Reproduce
Steps to reproduce the behavior:
--device=mps --default_dtype=float32
Expected behavior
The training should "just work" like elsewhere or on CPU.
Desktop (please complete the following information):
Additional context
training args used:
output:
The text was updated successfully, but these errors were encountered: