RuntimeError: CUDA error: invalid device ordinal #3

elter-tef · 2022-11-15T21:35:00Z

When I load model I have this error.

Traceback (most recent call last):
File "", line 1, in
File "test/env/lib/python3.9/site-packages/galai/init.py", line 39, in load_model
model._load_checkpoint(checkpoint_path=get_checkpoint_path(name))
File "test/env/lib/python3.9/site-packages/galai/model.py", line 63, in _load_checkpoint
load_checkpoint_and_dispatch(
File "test/env/lib/python3.9/site-packages/accelerate/big_modeling.py", line 366, in load_checkpoint_and_dispatch
load_checkpoint_in_model(
File "test/env/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 701, in load_checkpoint_in_model
set_module_tensor_to_device(model, param_name, param_device, value=param)
File "test/env/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 124, in set_module_tensor_to_device
new_value = value.to(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

dginev · 2022-11-15T22:14:58Z

Trying this with

model = galai.load_model("base")

it looks like there is a device map that expects 8 GPUs, if I'm seeing this right:

{'decoder.embed_tokens': 0,
 'decoder.embed_positions': 0,
 'decoder.layer_norm': 0,
 'decoder.layers.0': 0,
 'decoder.layers.1': 0,
 'decoder.layers.2': 0,
 'decoder.layers.3': 1,
 'decoder.layers.4': 1,
 'decoder.layers.5': 1,
 'decoder.layers.6': 2,
 'decoder.layers.7': 2,
 'decoder.layers.8': 2,
 'decoder.layers.9': 3,
 'decoder.layers.10': 3,
 'decoder.layers.11': 3,
 'decoder.layers.12': 4,
 'decoder.layers.13': 4,
 'decoder.layers.14': 4,
 'decoder.layers.15': 5,
 'decoder.layers.16': 5,
 'decoder.layers.17': 5,
 'decoder.layers.18': 6,
 'decoder.layers.19': 6,
 'decoder.layers.20': 6,
 'decoder.layers.21': 7,
 'decoder.layers.22': 7,
 'decoder.layers.23': 7}

ZQ-Dev8 · 2022-11-15T22:17:53Z

If you have less than the default number of GPUs (8), you have to specify how many when you load the model. Try:
model = gal.load_model(name = 'base', num_gpus = 1)

dginev · 2022-11-15T22:20:14Z

Thanks @dcruiz01 that worked out like a charm.
Unsure if it deserves a mention in the README, but much appreciated for letting us know! We can probably close this issue.

metaphorz · 2022-11-16T20:57:48Z

Confirmed. Had same error and num_gpus = 1 resolved it.

KnutJaegersberg · 2022-11-17T19:50:57Z

Please mention that in your documentation / readme.

KnutJaegersberg · 2022-11-17T19:53:17Z

A model size between base and standard would be nice. I barely can't fit standard on my RTX 3090, I think.

KnutJaegersberg · 2022-11-17T19:54:00Z

Do you offer 8 bit versions/compatibility, like BLOOM?

KnutJaegersberg · 2022-11-17T19:56:48Z

I see, dtype='float16' does the job sorry. Please mention in readme. Many folks will want to try on a local gpu as well.

KnutJaegersberg · 2022-11-17T19:59:37Z

Hmm.. 8 bit would still be handy to play with larger models. Is that possible?

zzj0402 · 2022-11-18T06:29:33Z

Num of GPUs defaults to None.

Bachstelze · 2022-11-18T16:38:55Z

If you have less than the default number of GPUs (8)

Who has a default number of 8 GPUs?

ZQ-Dev8 · 2022-11-18T17:43:31Z

If you have less than the default number of GPUs (8)

Who has a default number of 8 GPUs?

people that work at Meta AI, probably XD

FurkanGozukara · 2022-11-19T13:47:21Z

If you have less than the default number of GPUs (8), you have to specify how many when you load the model. Try: model = gal.load_model(name = 'base', num_gpus = 1)

why this isnt written on main page

mkardas · 2022-12-09T10:41:39Z

galai 1.1.0 uses all available GPUs by default which should fix the issue. One can still manually specify the number of GPUs using num_gpus parameter. Setting num_gpus=0 (or keeping the default None if no GPUs are available) will load the model to RAM. 8 bit inference is not supported yet. Please reopen if you still experience any issues.

Feature/path or model

maziyarpanahi mentioned this issue Nov 16, 2022

Adding more information regarding load_model #16

Open

mkardas closed this as completed Dec 9, 2022

mkardas pushed a commit that referenced this issue Feb 6, 2023

Merge pull request #3 from paperswithcode/feature/path_or_model

c818124

Feature/path or model

KimMeen mentioned this issue Feb 8, 2024

RuntimeError: expected scalar type Float but found BFloat16 KimMeen/Time-LLM#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: invalid device ordinal #3

RuntimeError: CUDA error: invalid device ordinal #3

elter-tef commented Nov 15, 2022

dginev commented Nov 15, 2022 •

edited

Loading

ZQ-Dev8 commented Nov 15, 2022

dginev commented Nov 15, 2022

metaphorz commented Nov 16, 2022

KnutJaegersberg commented Nov 17, 2022

KnutJaegersberg commented Nov 17, 2022

KnutJaegersberg commented Nov 17, 2022

KnutJaegersberg commented Nov 17, 2022

KnutJaegersberg commented Nov 17, 2022

zzj0402 commented Nov 18, 2022

Bachstelze commented Nov 18, 2022 •

edited

Loading

ZQ-Dev8 commented Nov 18, 2022

FurkanGozukara commented Nov 19, 2022

mkardas commented Dec 9, 2022

RuntimeError: CUDA error: invalid device ordinal #3

RuntimeError: CUDA error: invalid device ordinal #3

Comments

elter-tef commented Nov 15, 2022

dginev commented Nov 15, 2022 • edited Loading

ZQ-Dev8 commented Nov 15, 2022

dginev commented Nov 15, 2022

metaphorz commented Nov 16, 2022

KnutJaegersberg commented Nov 17, 2022

KnutJaegersberg commented Nov 17, 2022

KnutJaegersberg commented Nov 17, 2022

KnutJaegersberg commented Nov 17, 2022

KnutJaegersberg commented Nov 17, 2022

zzj0402 commented Nov 18, 2022

Bachstelze commented Nov 18, 2022 • edited Loading

ZQ-Dev8 commented Nov 18, 2022

FurkanGozukara commented Nov 19, 2022

mkardas commented Dec 9, 2022

dginev commented Nov 15, 2022 •

edited

Loading

Bachstelze commented Nov 18, 2022 •

edited

Loading