Constrained system/cpu RAM prohibits loading even with enough GPU Memory #413

ptschandl · 2023-06-05T16:22:58Z

System Info

The issue occurred with OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 on a machine with 32GB RAM and a single RTX 6000 Ada (48GB), where the shard loading aborts, but loading with raw huggingface commands works without 8bit:

Does not work

$ model=OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5
$ num_shard=1
$ volume=$PWD/data
$ docker run --gpus all --shm-size 2g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id $model --num-shard $num_shard --env

Works

Python 3.8.11 (default, Aug  3 2021, 15:09:35)                                                                                                                                                             
[GCC 7.5.0] :: Anaconda, Inc. on linux                                                                                                                                                                     
Type "help", "copyright", "credits" or "license" for more information.                                                                                                                                     
>>> from transformers import AutoModelForCausalLM                                                                                                                                                          
>>> model_name = "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5"                                                                                                                                          
>>> model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")                                                                                                                            
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:59<00:00, 19.76s/it]
>>> model                                                                                                                                                                                                  
GPTNeoXForCausalLM(                                                                                                                                                                                        
  (gpt_neox): GPTNeoXModel( 
   ...

$ nvidia-smi
Mon Jun  5 18:07:48 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX 6000...  Off  | 00000000:01:00.0  On |                  Off |
| 30%   45C    P8    26W / 300W |  47325MiB / 49140MiB |     15%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2680      G   /usr/lib/xorg/Xorg                 69MiB |
|    0   N/A  N/A      5034      C   python                          47252MiB |
+-----------------------------------------------------------------------------+

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

For reproducibility, these commands with artifically constrained container --memory should showcase the situation in an easier fashion with bigscience/bloom-560m:

Works

$ model=bigscience/bloom-560m
$ num_shard=1
$ volume=$PWD/data

$ docker run --memory=16g --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id $model --num-shard $num_shard --env
2023-06-05T16:09:59.218696Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.69.0
Commit sha: e7248fe90e27c7c8e39dd4cac5874eb9f96ab182
Docker label: sha-e7248fe
nvidia-smi:
Mon Jun  5 16:09:59 2023       
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  NVIDIA RTX 6000...  Off  | 00000000:01:00.0  On |                  Off |
   | 30%   44C    P8    26W / 300W |     70MiB / 49140MiB |      3%      Default |
   |                               |                      |                  N/A |
   +-------------------------------+----------------------+----------------------+
                                                                                  
   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   +-----------------------------------------------------------------------------+
2023-06-05T16:09:59.218719Z  INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, sharded: None, num_shard: Some(1), quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: true }
2023-06-05T16:09:59.218845Z  INFO text_generation_launcher: Starting download process.
2023-06-05T16:10:02.259356Z  INFO download: text_generation_launcher: Files are already present on the host. Skipping download.

2023-06-05T16:10:02.824149Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-06-05T16:10:02.824265Z  INFO text_generation_launcher: Starting shard 0
2023-06-05T16:10:12.835066Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:10:22.844677Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:10:32.853804Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:10:42.864204Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:10:52.874863Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:10:55.290781Z  INFO shard-manager: text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
 rank=0
2023-06-05T16:10:55.377836Z  INFO text_generation_launcher: Shard 0 ready in 52.553095127s
2023-06-05T16:10:55.471066Z  INFO text_generation_launcher: Starting Webserver
2023-06-05T16:10:56.788354Z  INFO text_generation_router: router/src/main.rs:178: Connected

$ docker stats
CONTAINER ID   NAME                CPU %     MEM USAGE / LIMIT   MEM %     NET I/O         BLOCK I/O   PIDS
de4cafd33d62   distracted_banach   0.00%     2.652GiB / 16GiB    16.58%    15MB / 81.2kB   0B / 0B     34

Does not work

$ model=bigscience/bloom-560m
$ num_shard=1
$ volume=$PWD/data

$ docker run --memory=1g --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id $model --num-shard $num_shard --env
2023-06-05T16:13:41.108681Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.69.0
Commit sha: e7248fe90e27c7c8e39dd4cac5874eb9f96ab182
Docker label: sha-e7248fe
nvidia-smi:
Mon Jun  5 16:13:40 2023       
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  NVIDIA RTX 6000...  Off  | 00000000:01:00.0  On |                  Off |
   | 30%   46C    P8    26W / 300W |     70MiB / 49140MiB |      3%      Default |
   |                               |                      |                  N/A |
   +-------------------------------+----------------------+----------------------+
                                                                                  
   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   +-----------------------------------------------------------------------------+
2023-06-05T16:13:41.108699Z  INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, sharded: None, num_shard: Some(1), quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: true }
2023-06-05T16:13:41.108774Z  INFO text_generation_launcher: Starting download process.
2023-06-05T16:13:44.665356Z  INFO download: text_generation_launcher: Files are already present on the host. Skipping download.

2023-06-05T16:13:44.913616Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-06-05T16:13:44.913968Z  INFO text_generation_launcher: Starting shard 0
2023-06-05T16:13:54.925231Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:14:04.936122Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:14:14.945694Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:14:24.957642Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:14:31.054288Z ERROR text_generation_launcher: Shard 0 failed to start:

2023-06-05T16:14:31.054824Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

$ docker stats
CONTAINER ID   NAME             CPU %     MEM USAGE / LIMIT   MEM %     NET I/O           BLOCK I/O   PIDS
7f0768e9f43a   modest_vaughan   0.00%     1023MiB / 1GiB      99.91%    20.8kB / 3.74kB   0B / 0B     6

Expected behavior

The model loads directly onto the GPU that is large enough even with constrained system/cpu RAM.

The text was updated successfully, but these errors were encountered:

OlivierDehaene · 2023-06-05T16:35:03Z

Hello!
We are aware of this issue and are trying to move away from from_pretrained to avoid this category of problem in its entirety.

This will be fixed by #344.

ptschandl · 2023-06-05T17:36:38Z

Fantastic - thanks!

github-actions · 2024-07-31T01:44:38Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions bot added the Stale label Jul 31, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constrained system/cpu RAM prohibits loading even with enough GPU Memory #413

Constrained system/cpu RAM prohibits loading even with enough GPU Memory #413

ptschandl commented Jun 5, 2023

OlivierDehaene commented Jun 5, 2023

ptschandl commented Jun 5, 2023

github-actions bot commented Jul 31, 2024

Constrained system/cpu RAM prohibits loading even with enough GPU Memory #413

Constrained system/cpu RAM prohibits loading even with enough GPU Memory #413

Comments

ptschandl commented Jun 5, 2023

System Info

Does not work

Works

Information

Tasks

Reproduction

Works

Does not work

Expected behavior

OlivierDehaene commented Jun 5, 2023

ptschandl commented Jun 5, 2023

github-actions bot commented Jul 31, 2024