Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Default process group has not been initialized, please make sure to call init_process_group. #602

Open
lidisi8520 opened this issue Dec 25, 2024 · 4 comments

Comments

@lidisi8520
Copy link

我是第一次使用这个工具,我在使用新手训练的时候,我选择了图片路径以及基础模型,其他参数都没有改动,直接训练,报错:ValueError: Default process group has not been initialized, please make sure to call init_process_group.
请问这个问题应该怎么解决呢?

@lidisi8520
Copy link
Author

  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 1115, in <module>
    trainer.train(args)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 226, in train
    accelerator = train_util.prepare_accelerator(args)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\library\train_util.py", line 4307, in prepare_accelerator
    accelerator = Accelerator(
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\accelerator.py", line 383, in __init__
    self.state = AcceleratorState(
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 846, in __init__
    PartialState(cpu, **kwargs)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 270, in __init__
    self.num_processes = torch.distributed.get_world_size()
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 1832, in get_world_size
    return _get_group_size(group)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 864, in _get_group_size
    default_pg = _get_default_group()
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 1025, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.```
这里是具体的报错信息

@lidisi8520
Copy link
Author

我在添加你所说的代码之后,接着进行训练,还是有报错。以下是报错信息:

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\anyio\streams\memory.py", line 94, in receive
    return self.receive_nowait()
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\anyio\streams\memory.py", line 89, in receive_nowait
    raise WouldBlock
anyio.WouldBlock

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\starlette\middleware\base.py", line 78, in call_next
    message = await recv_stream.receive()
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\anyio\streams\memory.py", line 114, in receive
    raise EndOfStream
anyio.EndOfStream

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\uvicorn\protocols\http\h11_impl.py", line 428, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\uvicorn\middleware\proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\fastapi\applications.py", line 276, in __call__
    await super().__call__(scope, receive, send)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\starlette\applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\starlette\middleware\errors.py", line 184, in __call__
    raise exc
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\starlette\middleware\errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\starlette\middleware\base.py", line 108, in __call__
    response = await self.dispatch_func(request, call_next)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\mikazuki\app\application.py", line 74, in add_cache_control_header
    response = await call_next(request)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\starlette\middleware\base.py", line 84, in call_next
    raise app_exc
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\starlette\middleware\base.py", line 70, in coro
    await self.app(scope, receive_or_disconnect, send_no_error)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\starlette\middleware\exceptions.py", line 79, in __call__
    raise exc
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\starlette\middleware\exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\fastapi\middleware\asyncexitstack.py", line 21, in __call__
    raise e
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\fastapi\middleware\asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\starlette\routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\starlette\routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\starlette\routing.py", line 66, in app
    response = await func(request)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\fastapi\routing.py", line 237, in app
    raw_response = await run_endpoint_function(
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\fastapi\routing.py", line 163, in run_endpoint_function
    return await dependant.call(**values)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\mikazuki\app\api.py", line 122, in create_toml_file
    dist.init_process_group(backend='gloo')
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper
    func_return = func(*args, **kwargs)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\rendezvous.py", line 246, in _env_rendezvous_handler
    rank = int(_get_env_or_raise("RANK"))
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\rendezvous.py", line 231, in _get_env_or_raise
    raise _env_error(env_var)
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
2024-12-25 10:19:56 INFO     Loading settings from                                                    train_util.py:3745
                             D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20241
                             225-101943.toml...
2024-12-25 10:20:14 INFO     D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20241 train_util.py:3764
                             225-101943
2024-12-25 10:20:14 INFO     prepare tokenizer                                                        train_util.py:4228
2024-12-25 10:20:15 INFO     update token length: 255                                                 train_util.py:4245
                    INFO     Using DreamBooth method.                                               train_network.py:172
                    INFO     prepare images.                                                          train_util.py:1573
                    INFO     found directory                                                          train_util.py:1520
                             D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\train\10_people\1_zkz
                             contains 169 image files
                    INFO     169 train images with repeating.                                         train_util.py:1614
                    INFO     0 reg images.                                                            train_util.py:1617
                    WARNING  no regularization images / 正則化画像が見つかりませんでした              train_util.py:1622
                    INFO     [Dataset 0]                                                              config_util.py:565
                               batch_size: 1
                               resolution: (512, 512)
                               enable_bucket: True
                               network_multiplier: 1.0
                               min_bucket_reso: 256
                               max_bucket_reso: 1024
                               bucket_reso_steps: 64
                               bucket_no_upscale: False

                               [Subset 0 of Dataset 0]
                                 image_dir:
                             "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\train\10_people\1_zk
                             z"
                                 image_count: 169
                                 num_repeats: 1
                                 shuffle_caption: True
                                 keep_tokens: 0
                                 keep_tokens_separator:
                                 secondary_separator: None
                                 enable_wildcard: False
                                 caption_dropout_rate: 0.0
                                 caption_dropout_every_n_epoches: 0
                                 caption_tag_dropout_rate: 0.0
                                 caption_prefix: None
                                 caption_suffix: None
                                 color_aug: False
                                 flip_aug: False
                                 face_crop_aug_range: None
                                 random_crop: False
                                 token_warmup_min: 1,
                                 token_warmup_step: 0,
                                 is_reg: False
                                 class_tokens: zkz
                                 caption_extension: .txt


                    INFO     [Dataset 0]                                                              config_util.py:571
                    INFO     loading image sizes.                                                      train_util.py:854
100%|█████████████████████████████████████████████████████████████████████████████| 169/169 [00:00<00:00, 11267.30it/s]
                    INFO     make buckets                                                              train_util.py:860
                    INFO     number of images (including repeats) /                                    train_util.py:906
                             各bucketの画像枚数(繰り返し回数を含む)
                    INFO     bucket 0: resolution (320, 704), count: 8                                 train_util.py:911
                    INFO     bucket 1: resolution (320, 768), count: 1                                 train_util.py:911
                    INFO     bucket 2: resolution (384, 640), count: 71                                train_util.py:911
                    INFO     bucket 3: resolution (448, 576), count: 68                                train_util.py:911
                    INFO     bucket 4: resolution (512, 512), count: 17                                train_util.py:911
                    INFO     bucket 5: resolution (576, 448), count: 2                                 train_util.py:911
                    INFO     bucket 6: resolution (640, 384), count: 2                                 train_util.py:911
                    INFO     mean ar error (without repeats): 0.043133719759432164                     train_util.py:916
                    INFO     preparing accelerator                                                  train_network.py:225
Traceback (most recent call last):
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 1115, in <module>
    trainer.train(args)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 226, in train
    accelerator = train_util.prepare_accelerator(args)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\library\train_util.py", line 4307, in prepare_accelerator
    accelerator = Accelerator(
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\accelerator.py", line 383, in __init__
    self.state = AcceleratorState(
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 846, in __init__
    PartialState(cpu, **kwargs)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 270, in __init__
    self.num_processes = torch.distributed.get_world_size()
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 1832, in get_world_size
    return _get_group_size(group)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 864, in _get_group_size
    default_pg = _get_default_group()
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 1025, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.

请问这个应该怎么解决呢?

@lidisi8520
Copy link
Author

我好像找到了解决这个问题的答案,是因为我没有选择要进行训练的显卡,在我选择了之后,能正常的进行训练。但是,我有四张显卡,如果我都选择了的话,也还是会报错。就是,我不能同时选择4张显卡,我不知道这是为什么。只能使用一张显卡进行训练

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@lidisi8520 and others