You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The issue occurred with OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 on a machine with 32GB RAM and a single RTX 6000 Ada (48GB), where the shard loading aborts, but loading with raw huggingface commands works without 8bit:
Python 3.8.11 (default, Aug 3 2021, 15:09:35)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license"for more information.
>>> from transformers import AutoModelForCausalLM
>>> model_name = "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5"
>>> model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:59<00:00, 19.76s/it]
>>> model
GPTNeoXForCausalLM(
(gpt_neox): GPTNeoXModel(
...
$ nvidia-smi
Mon Jun 5 18:07:48 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 ||-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |||| MIG M. ||===============================+======================+======================|| 0 NVIDIA RTX 6000... Off | 00000000:01:00.0 On | Off || 30% 45C P8 26W / 300W | 47325MiB / 49140MiB | 15% Default |||| N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage ||=============================================================================|| 0 N/A N/A 2680 G /usr/lib/xorg/Xorg 69MiB || 0 N/A N/A 5034 C python 47252MiB |
+-----------------------------------------------------------------------------+
Information
Docker
The CLI directly
Tasks
An officially supported command
My own modifications
Reproduction
For reproducibility, these commands with artifically constrained container --memory should showcase the situation in an easier fashion with bigscience/bloom-560m:
Works
$ model=bigscience/bloom-560m
$ num_shard=1
$ volume=$PWD/data
$ docker run --memory=16g --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id $model --num-shard $num_shard --env
2023-06-05T16:09:59.218696Z INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.69.0
Commit sha: e7248fe90e27c7c8e39dd4cac5874eb9f96ab182
Docker label: sha-e7248fe
nvidia-smi:
Mon Jun 5 16:09:59 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 ||-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |||| MIG M. ||===============================+======================+======================|| 0 NVIDIA RTX 6000... Off | 00000000:01:00.0 On | Off || 30% 44C P8 26W / 300W | 70MiB / 49140MiB | 3% Default |||| N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage ||=============================================================================|
+-----------------------------------------------------------------------------+
2023-06-05T16:09:59.218719Z INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, sharded: None, num_shard: Some(1), quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: true }
2023-06-05T16:09:59.218845Z INFO text_generation_launcher: Starting download process.
2023-06-05T16:10:02.259356Z INFO download: text_generation_launcher: Files are already present on the host. Skipping download.
2023-06-05T16:10:02.824149Z INFO text_generation_launcher: Successfully downloaded weights.
2023-06-05T16:10:02.824265Z INFO text_generation_launcher: Starting shard 0
2023-06-05T16:10:12.835066Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:10:22.844677Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:10:32.853804Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:10:42.864204Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:10:52.874863Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:10:55.290781Z INFO shard-manager: text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
rank=0
2023-06-05T16:10:55.377836Z INFO text_generation_launcher: Shard 0 ready in 52.553095127s
2023-06-05T16:10:55.471066Z INFO text_generation_launcher: Starting Webserver
2023-06-05T16:10:56.788354Z INFO text_generation_router: router/src/main.rs:178: Connected
$ docker stats
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
de4cafd33d62 distracted_banach 0.00% 2.652GiB / 16GiB 16.58% 15MB / 81.2kB 0B / 0B 34
Does not work
$ model=bigscience/bloom-560m
$ num_shard=1
$ volume=$PWD/data
$ docker run --memory=1g --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id $model --num-shard $num_shard --env
2023-06-05T16:13:41.108681Z INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.69.0
Commit sha: e7248fe90e27c7c8e39dd4cac5874eb9f96ab182
Docker label: sha-e7248fe
nvidia-smi:
Mon Jun 5 16:13:40 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 ||-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |||| MIG M. ||===============================+======================+======================|| 0 NVIDIA RTX 6000... Off | 00000000:01:00.0 On | Off || 30% 46C P8 26W / 300W | 70MiB / 49140MiB | 3% Default |||| N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage ||=============================================================================|
+-----------------------------------------------------------------------------+
2023-06-05T16:13:41.108699Z INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, sharded: None, num_shard: Some(1), quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: true }
2023-06-05T16:13:41.108774Z INFO text_generation_launcher: Starting download process.
2023-06-05T16:13:44.665356Z INFO download: text_generation_launcher: Files are already present on the host. Skipping download.
2023-06-05T16:13:44.913616Z INFO text_generation_launcher: Successfully downloaded weights.
2023-06-05T16:13:44.913968Z INFO text_generation_launcher: Starting shard 0
2023-06-05T16:13:54.925231Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:14:04.936122Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:14:14.945694Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:14:24.957642Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:14:31.054288Z ERROR text_generation_launcher: Shard 0 failed to start:
2023-06-05T16:14:31.054824Z INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart
$ docker stats
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
7f0768e9f43a modest_vaughan 0.00% 1023MiB / 1GiB 99.91% 20.8kB / 3.74kB 0B / 0B 6
Expected behavior
The model loads directly onto the GPU that is large enough even with constrained system/cpu RAM.
The text was updated successfully, but these errors were encountered:
System Info
The issue occurred with
OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5
on a machine with 32GB RAM and a single RTX 6000 Ada (48GB), where the shard loading aborts, but loading with raw huggingface commands works without 8bit:Does not work
Works
Information
Tasks
Reproduction
For reproducibility, these commands with artifically constrained container
--memory
should showcase the situation in an easier fashion withbigscience/bloom-560m
:Works
Does not work
Expected behavior
The model loads directly onto the GPU that is large enough even with constrained system/cpu RAM.
The text was updated successfully, but these errors were encountered: