You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Jan 04 22:30:14.302 [INFO] test-run-1112b accumulated 10 samples forepoch #0 from 2 peers. ETA 0.00 sec (refreshin 0.50 sec)
Jan 04 22:30:14.476 [INFO] Beginning optimizer step #0
Jan 04 22:31:26.924 [ERROR] [hivemind.optim.power_sgd_averager._aggregate_with_group:187] Expected out tensor to have dtype c10::BFloat16, but got float instead
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/hivemind/optim/power_sgd_averager.py", line 159, in _aggregate_with_group
torch.matmul(m.reshape(-1, q.size(0)), q, out=p)
RuntimeError: Expected out tensor to have dtype c10::BFloat16, but got float instead
Jan 04 22:31:26.925 [WARN] [hivemind.optim.power_sgd_averager._register_allreduce_group:129] All-reduce group b'\xc4\xaf\xafv&\xcd\xd6q\xa3\xea\x9d-\x13\x0f\xa4hNQ\xf6>PHASE_P' did not finish.
Jan 04 22:31:26.925 [WARN] [hivemind.optim.power_sgd_averager._register_allreduce_group:129] All-reduce group b'\xc4\xaf\xafv&\xcd\xd6q\xa3\xea\x9d-\x13\x0f\xa4hNQ\xf6>PHASE_Q' did not finish.
Jan 04 22:31:26.925 [WARN] [hivemind.averaging.averager._step:482] PowerSGDGradientAverager caught MatchmakingException('Unable to run All-Reduce: Expected out tensor to have dtype c10::BFloat16, but got float instead'), retrying
Jan 04 22:35:47.094 [ERROR] [hivemind.optim.power_sgd_averager._aggregate_with_group:187] Expected out tensor to have dtype c10::BFloat16, but got float instead
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/hivemind/optim/power_sgd_averager.py", line 159, in _aggregate_with_group
torch.matmul(m.reshape(-1, q.size(0)), q, out=p)
RuntimeError: Expected out tensor to have dtype c10::BFloat16, but got float instead
Jan 04 22:35:47.094 [WARN] [hivemind.optim.power_sgd_averager._register_allreduce_group:129] All-reduce group b'Q\x84\x8d\xa9\xf3\x90\xd4\xdf\xcc]\x153\x0c+\x9e\x90|\xed|\x8ePHASE_P' did not finish.
Jan 04 22:35:47.094 [WARN] [hivemind.optim.power_sgd_averager._register_allreduce_group:129] All-reduce group b'Q\x84\x8d\xa9\xf3\x90\xd4\xdf\xcc]\x153\x0c+\x9e\x90|\xed|\x8ePHASE_Q' did not finish.
Jan 04 22:35:47.095 [WARN] [hivemind.averaging.averager._step:482] PowerSGDGradientAverager caught MatchmakingException('Unable to run All-Reduce: Expected out tensor to have dtype c10::BFloat16, but got float instead'), retrying
Jan 04 22:40:07.221 [ERROR] [hivemind.optim.power_sgd_averager._aggregate_with_group:187] Expected out tensor to have dtype c10::BFloat16, but got float instead
Please update config/distributed.yaml to include the peers address in the hivemind section, before starting the second peer.
Environment
Collecting environment information...
PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35
Python version: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.24.1
[pip3] pytorch-lightning==1.8.6
[pip3] torch==1.13.1
[pip3] torch-ema==0.3
[pip3] torchmetrics==0.11.0
[pip3] torchvision==0.14.1
[pip3] hivemind==1.1.4
[conda] Could not collect
The text was updated successfully, but these errors were encountered:
Hi! Thanks for the detailed report! It is indeed a bug, and we'll fix it in the nearest release.
In the meantime, i'm afrain that the only override is to keep float32 params with hivemind.Optimizer - while the on-device model is in bfloat16.
blurry-mood
changed the title
[BUG] Enable to train a bloat16-compressed model
[BUG] Unable to train a bloat16-compressed model
Jan 10, 2023
Describe the bug
To Reproduce
git clone https://github.com/the-beee/naifu-diffusion cd naifu-diffusion pip install -r requirements.txt python trainer.py
Please update
config/distributed.yaml
to include the peers address in thehivemind
section, before starting the second peer.Environment
The text was updated successfully, but these errors were encountered: