Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constrained system/cpu RAM prohibits loading even with enough GPU Memory #413

Closed
2 of 4 tasks
ptschandl opened this issue Jun 5, 2023 · 3 comments
Closed
2 of 4 tasks
Labels

Comments

@ptschandl
Copy link

System Info

The issue occurred with OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 on a machine with 32GB RAM and a single RTX 6000 Ada (48GB), where the shard loading aborts, but loading with raw huggingface commands works without 8bit:

Does not work

$ model=OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5
$ num_shard=1
$ volume=$PWD/data
$ docker run --gpus all --shm-size 2g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id $model --num-shard $num_shard --env

Works

Python 3.8.11 (default, Aug  3 2021, 15:09:35)                                                                                                                                                             
[GCC 7.5.0] :: Anaconda, Inc. on linux                                                                                                                                                                     
Type "help", "copyright", "credits" or "license" for more information.                                                                                                                                     
>>> from transformers import AutoModelForCausalLM                                                                                                                                                          
>>> model_name = "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5"                                                                                                                                          
>>> model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")                                                                                                                            
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:59<00:00, 19.76s/it]
>>> model                                                                                                                                                                                                  
GPTNeoXForCausalLM(                                                                                                                                                                                        
  (gpt_neox): GPTNeoXModel( 
   ...
$ nvidia-smi
Mon Jun  5 18:07:48 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX 6000...  Off  | 00000000:01:00.0  On |                  Off |
| 30%   45C    P8    26W / 300W |  47325MiB / 49140MiB |     15%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2680      G   /usr/lib/xorg/Xorg                 69MiB |
|    0   N/A  N/A      5034      C   python                          47252MiB |
+-----------------------------------------------------------------------------+

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

For reproducibility, these commands with artifically constrained container --memory should showcase the situation in an easier fashion with bigscience/bloom-560m:

Works

$ model=bigscience/bloom-560m
$ num_shard=1
$ volume=$PWD/data

$ docker run --memory=16g --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id $model --num-shard $num_shard --env
2023-06-05T16:09:59.218696Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.69.0
Commit sha: e7248fe90e27c7c8e39dd4cac5874eb9f96ab182
Docker label: sha-e7248fe
nvidia-smi:
Mon Jun  5 16:09:59 2023       
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  NVIDIA RTX 6000...  Off  | 00000000:01:00.0  On |                  Off |
   | 30%   44C    P8    26W / 300W |     70MiB / 49140MiB |      3%      Default |
   |                               |                      |                  N/A |
   +-------------------------------+----------------------+----------------------+
                                                                                  
   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   +-----------------------------------------------------------------------------+
2023-06-05T16:09:59.218719Z  INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, sharded: None, num_shard: Some(1), quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: true }
2023-06-05T16:09:59.218845Z  INFO text_generation_launcher: Starting download process.
2023-06-05T16:10:02.259356Z  INFO download: text_generation_launcher: Files are already present on the host. Skipping download.

2023-06-05T16:10:02.824149Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-06-05T16:10:02.824265Z  INFO text_generation_launcher: Starting shard 0
2023-06-05T16:10:12.835066Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:10:22.844677Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:10:32.853804Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:10:42.864204Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:10:52.874863Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:10:55.290781Z  INFO shard-manager: text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
 rank=0
2023-06-05T16:10:55.377836Z  INFO text_generation_launcher: Shard 0 ready in 52.553095127s
2023-06-05T16:10:55.471066Z  INFO text_generation_launcher: Starting Webserver
2023-06-05T16:10:56.788354Z  INFO text_generation_router: router/src/main.rs:178: Connected
$ docker stats
CONTAINER ID   NAME                CPU %     MEM USAGE / LIMIT   MEM %     NET I/O         BLOCK I/O   PIDS
de4cafd33d62   distracted_banach   0.00%     2.652GiB / 16GiB    16.58%    15MB / 81.2kB   0B / 0B     34

Does not work

$ model=bigscience/bloom-560m
$ num_shard=1
$ volume=$PWD/data

$ docker run --memory=1g --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id $model --num-shard $num_shard --env
2023-06-05T16:13:41.108681Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.69.0
Commit sha: e7248fe90e27c7c8e39dd4cac5874eb9f96ab182
Docker label: sha-e7248fe
nvidia-smi:
Mon Jun  5 16:13:40 2023       
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  NVIDIA RTX 6000...  Off  | 00000000:01:00.0  On |                  Off |
   | 30%   46C    P8    26W / 300W |     70MiB / 49140MiB |      3%      Default |
   |                               |                      |                  N/A |
   +-------------------------------+----------------------+----------------------+
                                                                                  
   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   +-----------------------------------------------------------------------------+
2023-06-05T16:13:41.108699Z  INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, sharded: None, num_shard: Some(1), quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: true }
2023-06-05T16:13:41.108774Z  INFO text_generation_launcher: Starting download process.
2023-06-05T16:13:44.665356Z  INFO download: text_generation_launcher: Files are already present on the host. Skipping download.

2023-06-05T16:13:44.913616Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-06-05T16:13:44.913968Z  INFO text_generation_launcher: Starting shard 0
2023-06-05T16:13:54.925231Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:14:04.936122Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:14:14.945694Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:14:24.957642Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:14:31.054288Z ERROR text_generation_launcher: Shard 0 failed to start:

2023-06-05T16:14:31.054824Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart
$ docker stats
CONTAINER ID   NAME             CPU %     MEM USAGE / LIMIT   MEM %     NET I/O           BLOCK I/O   PIDS
7f0768e9f43a   modest_vaughan   0.00%     1023MiB / 1GiB      99.91%    20.8kB / 3.74kB   0B / 0B     6

Expected behavior

The model loads directly onto the GPU that is large enough even with constrained system/cpu RAM.

@OlivierDehaene
Copy link
Member

Hello!
We are aware of this issue and are trying to move away from from_pretrained to avoid this category of problem in its entirety.

This will be fixed by #344.

@ptschandl
Copy link
Author

Fantastic - thanks!

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jul 31, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants