-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] IPEX-LLM + Axolotl Docker Image #10821
Comments
Hi @kwaa Back to your docker file, can you share the error messages during docker building? Maybe we can fix this problem first. :) |
Okay. The problem is that the installation of the dependency fails: # Install requirements
RUN pip install -e . && \
pip install transformers==4.36.0 podman -v # podman version 5.0.2
sudo podman compose build Downloading smmap-5.0.1-py3-none-any.whl (24 kB)
Downloading svgwrite-1.4.3-py3-none-any.whl (67 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67.1/67.1 kB 7.1 MB/s eta 0:00:00
Building wheels for collected packages: optimum, rouge-score, fire, ffmpy, wavedrom
Building wheel for optimum (pyproject.toml): started
Building wheel for optimum (pyproject.toml): finished with status 'done'
Created wheel for optimum: filename=optimum-1.13.2-py3-none-any.whl size=395599 sha256=38896e176613a1c92028a9c2383cfe66ab8aef2f86d9540096930717ef731afb
Stored in directory: /root/.cache/pip/wheels/c7/36/5c/712f2d963d6d312afee816293b58610a3442d1a1de2182e651
Building wheel for rouge-score (setup.py): started
Building wheel for rouge-score (setup.py): finished with status 'done'
Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24955 sha256=2f55fb6e5a68b745b71c26dd809038bc0d897f941d83d80cbaa0d0f031a4ec11
Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Building wheel for fire (setup.py): started
Building wheel for fire (setup.py): finished with status 'done'
Created wheel for fire: filename=fire-0.6.0-py2.py3-none-any.whl size=117047 sha256=f2fc59bf03786a40ae6b203d36fd6cd26335e2987e9598f6b8a1a1f4cc368d49
Stored in directory: /root/.cache/pip/wheels/6a/f3/0c/fa347dfa663f573462c6533d259c2c859e97e103d1ce21538f
Building wheel for ffmpy (setup.py): started
Building wheel for ffmpy (setup.py): finished with status 'done'
Created wheel for ffmpy: filename=ffmpy-0.3.2-py3-none-any.whl size=5600 sha256=9f04bfa5e3cc1ac776f4482a61ec953af63db0bb0ded8f0dd7bc7457e2301df6
Stored in directory: /root/.cache/pip/wheels/55/3c/f2/f6e34046bac0d57c13c7d08123b85872423b89c8f59bafda51
Building wheel for wavedrom (setup.py): started
Building wheel for wavedrom (setup.py): finished with status 'done'
Created wheel for wavedrom: filename=wavedrom-2.0.3.post3-py2.py3-none-any.whl size=30071 sha256=dadea55437d57655f9f7d85627c29ddc43c7629a4bb5eb9330129dd0f36b71df
Stored in directory: /root/.cache/pip/wheels/23/cf/3b/4dcf6b22fa41c5ece715fa5f4e05afd683e7b0ce0f2fcc7bb6
Successfully built optimum rouge-score fire ffmpy wavedrom
Installing collected packages: wcwidth, pytz, pydub, nh3, ffmpy, appdirs, aniso8601, addict, xxhash, wrapt, werkzeug, websockets, tzdata, toolz, tomlkit, threadpoolctl, termcolor, tensorboard-data-server, svgwrite, sqlparse, smmap, shtab, shortuuid, shellingham, setproctitle, sentry-sdk, semantic-version, scipy, ruff, rpds-py, querystring-parser, python-multipart, python-dateutil, pynvml, pygments, pyasn1, pyarrow-hotfix, pyarrow, protobuf, prompt-toolkit, packaging, orjson, multidict, mdurl, markdown2, markdown, Mako, llvmlite, kiwisolver, joblib, jmespath, itsdangerous, importlib-resources, humanfriendly, httpcore, hf_transfer, grpcio, greenlet, graphql-core, google-crc32c, frozenlist, fonttools, entrypoints, docstring-parser, docker-pycreds, dill, decorator, cycler, contourpy, cloudpickle, cachetools, blinker, attrs, art, aioitertools, aiofiles, absl-py, yarl, wavedrom, tensorboard, sqlalchemy, scikit-learn, rsa, responses, requests-oauthlib, referencing, pyasn1-modules, proto-plus, pandas, numba, nltk, multiprocess, matplotlib, markdown-it-py, httpx, gunicorn, graphql-relay, googleapis-common-protos, google-resumable-media, gitdb, Flask, fire, docker, coloredlogs, botocore, aiosignal, rouge-score, rich, jsonschema-specifications, graphene, gradio-client, google-auth, gitpython, bitsandbytes, alembic, aiohttp, accelerate, wandb, tyro, typer, mlflow, jsonschema, google-auth-oauthlib, google-api-core, fschat, aiobotocore, s3fs, peft, google-cloud-core, datasets, bert-score, altair, trl, optimum, gradio, google-cloud-storage, evaluate, gcsfs, axolotl
Attempting uninstall: websockets
Found existing installation: websockets 12.0
Uninstalling websockets-12.0:
Successfully uninstalled websockets-12.0
Attempting uninstall: protobuf
Found existing installation: protobuf 5.27.0rc1
Uninstalling protobuf-5.27.0rc1:
Successfully uninstalled protobuf-5.27.0rc1
Attempting uninstall: packaging
Found existing installation: packaging 24.0
Uninstalling packaging-24.0:
Successfully uninstalled packaging-24.0
Attempting uninstall: blinker
Found existing installation: blinker 1.4
ERROR: Cannot uninstall 'blinker'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.
Error: building at STEP "RUN pip install -e . && pip install transformers==4.36.0": while running runtime: exit status 1 |
This error is caused by trying to uninstall python libs installed by OS package manager (e.g., apt). In your example, it's A simple workaround is to add pip install transformers==4.36.0 --ignore-installed blinker If this command doesn't work, please uninstall os install package first. apt remove python3-blinker
pip install transformers==4.36.0 |
Update: The image builds successfully after adding |
If I try to enable DeepSpeed via Do you want to enable dynamic shape tracing? [yes/NO]:
Do you want to use DeepSpeed? [yes/NO]: yes
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/config/config.py", line 67, in config_command
config = get_user_input()
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/config/config.py", line 40, in get_user_input
config = get_cluster_input()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/config/cluster.py", line 192, in get_cluster_input
is_deepspeed_available()
AssertionError: DeepSpeed is not installed => run `pip3 install deepspeed` or build it from source
exit code: 1 I skipped DeepSpeed and generated this compute_environment: LOCAL_MACHINE
debug: false
distributed_type: 'NO'
downcast_bf16: 'no'
dynamo_config:
dynamo_backend: IPEX
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false |
Yes. We omit deepspeed in requirements (so does axolotl). You can answer |
Currently axolotl reports error when saving the prepared dataset: /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
2024-04-22 16:47:55,676 - INFO - intel_extension_for_pytorch auto imported
2024-04-22 16:47:55,683 - WARNING - The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
[2024-04-22 16:47:56,906] [INFO] [datasets.<module>:58] [PID:46] PyTorch version 2.1.0a0+cxx11.abi available.
dP dP dP
88 88 88
.d8888b. dP. .dP .d8888b. 88 .d8888b. d8888P 88
88' `88 `8bd8' 88' `88 88 88' `88 88 88
88. .88 .d88b. 88. .88 88 88. .88 88 88
`88888P8 dP' `dP `88888P' dP `88888P' dP dP
[2024-04-22 16:47:57,920] [WARNING] [axolotl.scripts.finetune.do_cli:60] [PID:46] [RANK:0] scripts/finetune.py will be replaced with calling axolotl.cli.train
[2024-04-22 16:47:57,922] [WARNING] [axolotl.validate_config:263] [PID:46] [RANK:0] We recommend setting `load_in_8bit: true` for LORA finetuning
[2024-04-22 16:47:57,923] [INFO] [axolotl.normalize_config:169] [PID:46] [RANK:0] GPU memory usage baseline: 0.000GB ()
[2024-04-22 16:47:57,924] [WARNING] [axolotl.scripts.check_accelerate_default_config:363] [PID:46] [RANK:0] accelerate config file found at /root/.cache/huggingface/accelerate/default_config.yaml. This can lead to unexpected errors
[2024-04-22 16:47:57,924] [INFO] [axolotl.scripts.check_user_token:371] [PID:46] [RANK:0] Skipping HuggingFace token verification because HF_HUB_OFFLINE is set to True. Only local files will be used.
[2024-04-22 16:47:58,135] [DEBUG] [axolotl.load_tokenizer:216] [PID:46] [RANK:0] EOS: 128256 / <|im_end|>
[2024-04-22 16:47:58,135] [DEBUG] [axolotl.load_tokenizer:217] [PID:46] [RANK:0] BOS: 128000 / <|begin_of_text|>
[2024-04-22 16:47:58,135] [DEBUG] [axolotl.load_tokenizer:218] [PID:46] [RANK:0] PAD: 128001 / <|end_of_text|>
[2024-04-22 16:47:58,135] [DEBUG] [axolotl.load_tokenizer:219] [PID:46] [RANK:0] UNK: None / None
[2024-04-22 16:47:58,135] [INFO] [axolotl.load_tokenized_prepared_datasets:181] [PID:46] [RANK:0] Unable to find prepared dataset in /workspace/last_run_prepared/0468ae86c6bad72780d77c7d538dd375
[2024-04-22 16:47:58,135] [INFO] [axolotl.load_tokenized_prepared_datasets:182] [PID:46] [RANK:0] Loading raw datasets...
[2024-04-22 16:47:58,135] [WARNING] [axolotl.load_tokenized_prepared_datasets:184] [PID:46] [RANK:0] Processing datasets during training can lead to VRAM instability. Please pre-process your dataset.
[2024-04-22 16:47:58,135] [INFO] [axolotl.load_tokenized_prepared_datasets:191] [PID:46] [RANK:0] No seed provided, using default seed of 42
[2024-04-22 16:48:18,281] [INFO] [axolotl.load_tokenized_prepared_datasets:394] [PID:46] [RANK:0] merging datasets
[2024-04-22 16:48:18,310] [INFO] [axolotl.load_tokenized_prepared_datasets:404] [PID:46] [RANK:0] Saving merged prepared dataset to disk... /workspace/last_run_prepared/0468ae86c6bad72780d77c7d538dd375
Traceback (most recent call last):
File "/workspace/axolotl/finetune.py", line 86, in <module>
fire.Fire(do_cli)
File "/usr/local/lib/python3.11/dist-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/axolotl/finetune.py", line 81, in do_cli
dataset_meta = load_datasets(cfg=parsed_cfg, cli_args=parsed_cli_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/axolotl/src/axolotl/cli/__init__.py", line 315, in load_datasets
train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
^^^^^^^^^^^^^^^^
File "/workspace/axolotl/src/axolotl/utils/data.py", line 78, in prepare_dataset
train_dataset, eval_dataset, prompters = load_prepare_datasets(
^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/axolotl/src/axolotl/utils/data.py", line 441, in load_prepare_datasets
dataset, prompters = load_tokenized_prepared_datasets(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/axolotl/src/axolotl/utils/data.py", line 405, in load_tokenized_prepared_datasets
dataset.save_to_disk(prepared_ds_path)
File "/usr/local/lib/python3.11/dist-packages/datasets/arrow_dataset.py", line 1515, in save_to_disk
fs, _ = url_to_fs(dataset_path, **(storage_options or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/fsspec/core.py", line 383, in url_to_fs
chain = _un_chain(url, kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/fsspec/core.py", line 323, in _un_chain
if "::" in path
^^^^^^^^^^^^
TypeError: argument of type 'PosixPath' is not iterable
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 986, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 628, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'finetune.py', 'lora.yml']' returned non-zero exit status 1. |
This looks to be related to axolotl-ai-cloud/axolotl#1544. |
I fixed this by patching code (moeru-ai/Moeru-Llama-3-8B@c65e3b1#diff-b135d17426f077f767e0ec29114d24b182dcaa3f6dadaee03d8ff424adcdff0bR407), the problem now is that it will Segfault in the /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
2024-04-22 18:13:19,839 - INFO - intel_extension_for_pytorch auto imported
2024-04-22 18:13:19,861 - WARNING - The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
[2024-04-22 18:13:21,510] [INFO] [datasets.<module>:58] [PID:46] PyTorch version 2.1.0a0+cxx11.abi available.
dP dP dP
88 88 88
.d8888b. dP. .dP .d8888b. 88 .d8888b. d8888P 88
88' `88 `8bd8' 88' `88 88 88' `88 88 88
88. .88 .d88b. 88. .88 88 88. .88 88 88
`88888P8 dP' `dP `88888P' dP `88888P' dP dP
[2024-04-22 18:13:23,047] [WARNING] [axolotl.scripts.finetune.do_cli:60] [PID:46] [RANK:0] scripts/finetune.py will be replaced with calling axolotl.cli.train
[2024-04-22 18:13:23,050] [WARNING] [axolotl.validate_config:263] [PID:46] [RANK:0] We recommend setting `load_in_8bit: true` for LORA finetuning
[2024-04-22 18:13:23,051] [INFO] [axolotl.normalize_config:169] [PID:46] [RANK:0] GPU memory usage baseline: 0.000GB ()
[2024-04-22 18:13:23,051] [WARNING] [axolotl.scripts.check_accelerate_default_config:363] [PID:46] [RANK:0] accelerate config file found at /root/.cache/huggingface/accelerate/default_config.yaml. This can lead to unexpected errors
[2024-04-22 18:13:23,051] [INFO] [axolotl.scripts.check_user_token:371] [PID:46] [RANK:0] Skipping HuggingFace token verification because HF_HUB_OFFLINE is set to True. Only local files will be used.
[2024-04-22 18:13:23,379] [DEBUG] [axolotl.load_tokenizer:216] [PID:46] [RANK:0] EOS: 128256 / <|im_end|>
[2024-04-22 18:13:23,379] [DEBUG] [axolotl.load_tokenizer:217] [PID:46] [RANK:0] BOS: 128000 / <|begin_of_text|>
[2024-04-22 18:13:23,379] [DEBUG] [axolotl.load_tokenizer:218] [PID:46] [RANK:0] PAD: 128001 / <|end_of_text|>
[2024-04-22 18:13:23,379] [DEBUG] [axolotl.load_tokenizer:219] [PID:46] [RANK:0] UNK: None / None
[2024-04-22 18:13:23,380] [INFO] [axolotl.load_tokenized_prepared_datasets:179] [PID:46] [RANK:0] Loading prepared dataset from disk at /workspace/last_run_prepared/0468ae86c6bad72780d77c7d538dd375...
[2024-04-22 18:13:23,383] [INFO] [axolotl.load_tokenized_prepared_datasets:181] [PID:46] [RANK:0] Prepared dataset loaded from disk...
[2024-04-22 18:13:23,387] [DEBUG] [axolotl.log:60] [PID:46] [RANK:0] total_num_tokens: 18727
[2024-04-22 18:13:23,388] [DEBUG] [axolotl.log:60] [PID:46] [RANK:0] `total_supervised_tokens: 14240`
[2024-04-22 18:13:26,569] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:46] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 18727
[2024-04-22 18:13:26,569] [DEBUG] [axolotl.log:60] [PID:46] [RANK:0] data_loader_len: 3
[2024-04-22 18:13:26,569] [INFO] [axolotl.log:60] [PID:46] [RANK:0] sample_packing_eff_est across ranks: [0.914404296875]
[2024-04-22 18:13:26,569] [DEBUG] [axolotl.log:60] [PID:46] [RANK:0] sample_packing_eff_est: None
[2024-04-22 18:13:26,569] [DEBUG] [axolotl.log:60] [PID:46] [RANK:0] total_num_steps: 12
[2024-04-22 18:13:26,571] [DEBUG] [axolotl.log:60] [PID:46] [RANK:0] total_num_tokens: 338809
[2024-04-22 18:13:26,580] [DEBUG] [axolotl.log:60] [PID:46] [RANK:0] `total_supervised_tokens: 249975`
[2024-04-22 18:13:26,582] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:46] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 338809
[2024-04-22 18:13:26,583] [DEBUG] [axolotl.log:60] [PID:46] [RANK:0] data_loader_len: 80
[2024-04-22 18:13:26,583] [INFO] [axolotl.log:60] [PID:46] [RANK:0] sample_packing_eff_est across ranks: [0.9618260583212209]
[2024-04-22 18:13:26,583] [DEBUG] [axolotl.log:60] [PID:46] [RANK:0] sample_packing_eff_est: 0.97
[2024-04-22 18:13:26,583] [DEBUG] [axolotl.log:60] [PID:46] [RANK:0] total_num_steps: 320
[2024-04-22 18:13:26,598] [DEBUG] [axolotl.train.log:60] [PID:46] [RANK:0] loading tokenizer... /workspace/models/llama-3-8b
[2024-04-22 18:13:26,815] [DEBUG] [axolotl.load_tokenizer:216] [PID:46] [RANK:0] EOS: 128256 / <|im_end|>
[2024-04-22 18:13:26,815] [DEBUG] [axolotl.load_tokenizer:217] [PID:46] [RANK:0] BOS: 128000 / <|begin_of_text|>
[2024-04-22 18:13:26,815] [DEBUG] [axolotl.load_tokenizer:218] [PID:46] [RANK:0] PAD: 128001 / <|end_of_text|>
[2024-04-22 18:13:26,815] [DEBUG] [axolotl.load_tokenizer:219] [PID:46] [RANK:0] UNK: None / None
[2024-04-22 18:13:26,815] [DEBUG] [axolotl.train.log:60] [PID:46] [RANK:0] loading model and peft_config...
[2024-04-22 18:13:26,816] [INFO] [axolotl.load_model:366] [PID:46] [RANK:0] patching _expand_mask
Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00, 3.79it/s]
[2024-04-22 18:14:33,304] [INFO] [axolotl.load_model:677] [PID:46] [RANK:0] converting modules to torch.bfloat16 for flash attention
[2024-04-22 18:14:34,232] [INFO] [axolotl.load_lora:789] [PID:46] [RANK:0] found linear modules: ['up_proj', 'k_proj', 'gate_proj', 'down_proj', 'v_proj', 'o_proj', 'q_proj']
trainable params: 2,143,322,112 || all params: 5,851,353,088 || trainable%: 36.629512520711515
[2024-04-22 18:15:12,793] [INFO] [axolotl.load_model:714] [PID:46] [RANK:0] GPU memory usage after adapters: 0.000GB ()
[2024-04-22 18:15:12,838] [INFO] [axolotl.train.log:60] [PID:46] [RANK:0] Pre-saving adapter config to /workspace/out
[2024-04-22 18:15:12,947] [INFO] [axolotl.train.log:60] [PID:46] [RANK:0] Starting trainer...
[2024-04-22 18:15:13,108] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:46] [RANK:0] packing_efficiency_estimate: 0.97 total_num_tokens per device: 338809
[2024-04-22 18:15:13,109] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:46] [RANK:0] packing_efficiency_estimate: 0.97 total_num_tokens per device: 338809
[2024-04-22 18:15:13,197] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:46] [RANK:0] packing_efficiency_estimate: 0.97 total_num_tokens per device: 338809
0%| | 0/80 [00:00<?, ?it/s]
LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)
LIBXSMM_TARGET: adl [12th Gen Intel(R) Core(TM) i3-12100]
Registry and code: 13 MB
Command: /usr/bin/python3 finetune.py lora.yml
Uptime: 115.136140 s
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 986, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 628, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'finetune.py', 'lora.yml']' died with <Signals.SIGSEGV: 11>. |
Hi @kwaa I will add the prepare data fail issue to troubleshooting. Thank you for your solution! :) This segment fault issue may be related to intel-extension-for-pytorch and oneAPI. Can you provide oneAPI version, GPU driver version and pip list? We will try to reproduce this issue. |
I installed requirements.txt was copied from the axolotl example and has not been modified. My GPUs are currently A770 16G and UHD730. sudo intel_gpu_top -L
# card2 Intel Dg2 (Gen12) pci:vendor=8086,device=56A0,card=0
# └─renderD129
# card1 Intel Alderlake_s (Gen12) pci:vendor=8086,device=4692,card=0
# └─renderD128 |
Thank you for providing detailed env! :) Your dockerfile seems fine. It is based on our XPU inference image. That means you are using intel/oneapi-basekit:2024.0.1 with level-zero 1.16.14 driver. According to the previous error message and env setting, I think the main problem is this finetune program is running on card 1, i.e., iGPU. This may lead to segment fault and OOM. Please refer to this doc and select ARC 770 as main GPU. In most cases, you can choose GPU with env. export ONEAPI_DEVICE_SELECTOR=level_zero:1 |
BTW, instead of patching Axolotl. We can downgrade pip install datasets==2.15.0 |
* Downgrade datasets to 2.15.0 to address axolotl prepare issue axolotl-ai-cloud/axolotl#1544 Tks to @kwaa for providing the solution in #10821 (comment)
Hmm... Something went wrong, I tried to run [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 12th Gen Intel(R) Core(TM) i3-12100 OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix] |
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 12th Gen Intel(R) Core(TM) i3-12100 OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [23.35.27191.42]
[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 730 OpenCL 3.0 NEO [23.35.27191.42]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.27191]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) UHD Graphics 730 1.3 [1.3.27191] This is because older versions of |
ARC 770 is in 0. Please change env. export ONEAPI_DEVICE_SELECTOR=level_zero:0 |
I think it might be useful to provide an Also, I fixed this (#10821 (comment)) by setting environments (intel/compute-runtime#710 (comment)), but at the moment the Trainer doesn't seem to be running correctly: it runs for a second and then never logs again, and I don't get any usable files outside of the json in the Maybe there is something wrong with my axolotl config? /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
2024-04-23 18:33:37,631 - INFO - intel_extension_for_pytorch auto imported
2024-04-23 18:33:37,650 - WARNING - The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
[2024-04-23 18:33:38,905] [INFO] [datasets.<module>:58] [PID:53] PyTorch version 2.1.0a0+cxx11.abi available.
dP dP dP
88 88 88
.d8888b. dP. .dP .d8888b. 88 .d8888b. d8888P 88
88' `88 `8bd8' 88' `88 88 88' `88 88 88
88. .88 .d88b. 88. .88 88 88. .88 88 88
`88888P8 dP' `dP `88888P' dP `88888P' dP dP
[2024-04-23 18:33:39,850] [WARNING] [axolotl.scripts.finetune.do_cli:60] [PID:53] [RANK:0] scripts/finetune.py will be replaced with calling axolotl.cli.train
[2024-04-23 18:33:39,853] [WARNING] [axolotl.validate_config:263] [PID:53] [RANK:0] We recommend setting `load_in_8bit: true` for LORA finetuning
[2024-04-23 18:33:39,854] [INFO] [axolotl.normalize_config:169] [PID:53] [RANK:0] GPU memory usage baseline: 0.000GB ()
[2024-04-23 18:33:39,854] [WARNING] [axolotl.scripts.check_accelerate_default_config:363] [PID:53] [RANK:0] accelerate config file found at /root/.cache/huggingface/accelerate/default_config.yaml. This can lead to unexpected errors
[2024-04-23 18:33:39,854] [INFO] [axolotl.scripts.check_user_token:371] [PID:53] [RANK:0] Skipping HuggingFace token verification because HF_HUB_OFFLINE is set to True. Only local files will be used.
[2024-04-23 18:33:40,124] [DEBUG] [axolotl.load_tokenizer:216] [PID:53] [RANK:0] EOS: 128256 / <|im_end|>
[2024-04-23 18:33:40,124] [DEBUG] [axolotl.load_tokenizer:217] [PID:53] [RANK:0] BOS: 128000 / <|begin_of_text|>
[2024-04-23 18:33:40,124] [DEBUG] [axolotl.load_tokenizer:218] [PID:53] [RANK:0] PAD: 128001 / <|end_of_text|>
[2024-04-23 18:33:40,124] [DEBUG] [axolotl.load_tokenizer:219] [PID:53] [RANK:0] UNK: None / None
[2024-04-23 18:33:40,126] [INFO] [axolotl.load_tokenized_prepared_datasets:179] [PID:53] [RANK:0] Loading prepared dataset from disk at /workspace/last_run_prepared/0468ae86c6bad72780d77c7d538dd375...
[2024-04-23 18:33:40,134] [INFO] [axolotl.load_tokenized_prepared_datasets:181] [PID:53] [RANK:0] Prepared dataset loaded from disk...
[2024-04-23 18:33:40,143] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] total_num_tokens: 18727
[2024-04-23 18:33:40,146] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] `total_supervised_tokens: 14240`
[2024-04-23 18:33:43,050] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:53] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 18727
[2024-04-23 18:33:43,051] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] data_loader_len: 3
[2024-04-23 18:33:43,051] [INFO] [axolotl.log:60] [PID:53] [RANK:0] sample_packing_eff_est across ranks: [0.914404296875]
[2024-04-23 18:33:43,051] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] sample_packing_eff_est: None
[2024-04-23 18:33:43,051] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] total_num_steps: 12
[2024-04-23 18:33:43,062] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] total_num_tokens: 338809
[2024-04-23 18:33:43,071] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] `total_supervised_tokens: 249975`
[2024-04-23 18:33:43,072] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:53] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 338809
[2024-04-23 18:33:43,073] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] data_loader_len: 80
[2024-04-23 18:33:43,073] [INFO] [axolotl.log:60] [PID:53] [RANK:0] sample_packing_eff_est across ranks: [0.9618260583212209]
[2024-04-23 18:33:43,073] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] sample_packing_eff_est: 0.97
[2024-04-23 18:33:43,073] [DEBUG] [axolotl.log:60] [PID:53] [RANK:0] total_num_steps: 320
[2024-04-23 18:33:43,085] [DEBUG] [axolotl.train.log:60] [PID:53] [RANK:0] loading tokenizer... /workspace/models/llama-3-8b
[2024-04-23 18:33:43,286] [DEBUG] [axolotl.load_tokenizer:216] [PID:53] [RANK:0] EOS: 128256 / <|im_end|>
[2024-04-23 18:33:43,286] [DEBUG] [axolotl.load_tokenizer:217] [PID:53] [RANK:0] BOS: 128000 / <|begin_of_text|>
[2024-04-23 18:33:43,286] [DEBUG] [axolotl.load_tokenizer:218] [PID:53] [RANK:0] PAD: 128001 / <|end_of_text|>
[2024-04-23 18:33:43,286] [DEBUG] [axolotl.load_tokenizer:219] [PID:53] [RANK:0] UNK: None / None
[2024-04-23 18:33:43,286] [DEBUG] [axolotl.train.log:60] [PID:53] [RANK:0] loading model and peft_config...
[2024-04-23 18:33:43,287] [INFO] [axolotl.load_model:366] [PID:53] [RANK:0] patching _expand_mask
Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00, 2.76it/s]
[2024-04-23 18:34:38,518] [INFO] [axolotl.load_model:677] [PID:53] [RANK:0] converting modules to torch.bfloat16 for flash attention
[2024-04-23 18:34:39,488] [INFO] [axolotl.load_lora:789] [PID:53] [RANK:0] found linear modules: ['k_proj', 'v_proj', 'gate_proj', 'up_proj', 'down_proj', 'o_proj', 'q_proj']
trainable params: 2,143,322,112 || all params: 5,851,353,088 || trainable%: 36.629512520711515
[2024-04-23 18:35:15,919] [INFO] [axolotl.load_model:714] [PID:53] [RANK:0] GPU memory usage after adapters: 0.000GB ()
[2024-04-23 18:35:17,927] [INFO] [axolotl.train.log:60] [PID:53] [RANK:0] Pre-saving adapter config to /workspace/out
[2024-04-23 18:35:18,034] [INFO] [axolotl.train.log:60] [PID:53] [RANK:0] Starting trainer...
[2024-04-23 18:35:18,204] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:53] [RANK:0] packing_efficiency_estimate: 0.97 total_num_tokens per device: 338809
[2024-04-23 18:35:18,205] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:53] [RANK:0] packing_efficiency_estimate: 0.97 total_num_tokens per device: 338809
[2024-04-23 18:35:18,273] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:53] [RANK:0] packing_efficiency_estimate: 0.97 total_num_tokens per device: 338809 |
We will consider about the default Just checked your axolot config. There are several problems
To support Llama 3 finetuning on ARC, we need to change to the main branch and upgrade several key libs (e.g., peft, transformers). Meanwhile, we need to change several source codes in Ipex-llm. Good news is that we are already working on peft upgrade and llama 3 axolotl support. Will let you know when it's ready. :) |
It looks like I need to download |
I tried [2024-04-24 17:20:25,931] [DEBUG] [axolotl.train.log:60] [PID:53] [RANK:0] loading tokenizer... /workspace/models/llama-2-7b
[2024-04-24 17:20:25,986] [DEBUG] [axolotl.load_tokenizer:216] [PID:53] [RANK:0] EOS: 2 / </s>
[2024-04-24 17:20:25,987] [DEBUG] [axolotl.load_tokenizer:217] [PID:53] [RANK:0] BOS: 1 / <s>
[2024-04-24 17:20:25,987] [DEBUG] [axolotl.load_tokenizer:218] [PID:53] [RANK:0] PAD: 0 / <unk>
[2024-04-24 17:20:25,987] [DEBUG] [axolotl.load_tokenizer:219] [PID:53] [RANK:0] UNK: 0 / <unk>
[2024-04-24 17:20:25,987] [INFO] [axolotl.load_tokenizer:224] [PID:53] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-04-24 17:20:25,987] [DEBUG] [axolotl.train.log:60] [PID:53] [RANK:0] loading model and peft_config...
[2024-04-24 17:20:25,990] [INFO] [axolotl.load_model:366] [PID:53] [RANK:0] patching _expand_mask
Loading checkpoint shards: 100%|██████████| 3/3 [00:11<00:00, 3.88s/it]
[2024-04-24 17:20:56,250] [INFO] [axolotl.load_model:677] [PID:53] [RANK:0] converting modules to torch.bfloat16 for flash attention
[2024-04-24 17:20:56,501] [INFO] [axolotl.load_lora:789] [PID:53] [RANK:0] found linear modules: ['q_proj', 'up_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'down_proj']
trainable params: 39,976,960 || all params: 3,742,765,056 || trainable%: 1.0681129967245264
[2024-04-24 17:21:33,106] [INFO] [axolotl.load_model:714] [PID:53] [RANK:0] GPU memory usage after adapters: 0.000GB ()
[2024-04-24 17:21:35,274] [INFO] [axolotl.train.log:60] [PID:53] [RANK:0] Pre-saving adapter config to /workspace/out
[2024-04-24 17:21:35,278] [INFO] [axolotl.train.log:60] [PID:53] [RANK:0] Starting trainer...
[2024-04-24 17:21:35,516] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:53] [RANK:0] packing_efficiency_estimate: 0.97 total_num_tokens per device: 414041
[2024-04-24 17:21:35,517] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:53] [RANK:0] packing_efficiency_estimate: 0.97 total_num_tokens per device: 414041
[2024-04-24 17:21:35,612] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:53] [RANK:0] packing_efficiency_estimate: 0.97 total_num_tokens per device: 414041 |
Seems axolotl loaded the checkpoint but didn't start training. Can you dump GPU usage with the following command? sudo xpu-smi dump -d 0 -m 0,1,2,5,18
|
Hi @kwaa We built our example with meta llama-2-7b. Not sure if it works on other models. Can you give us the model version & link you are using in fintuning? We can try to reproduce this error. |
Oh, sorry I missed the message before. I'm using unsloth/llama-2-7b Then I now suspect it may have something to do with the container running in the background. If I run the container in the foreground, the interface gets stuck, so I started trying to fix my system yesterday to make it display with iGPU... I'll keep updating if there are changes. |
Update: I fixed the iGPU display issue with I also tried running crun: executable file `xpu-smi` not found in $PATH: No such file or directory: OCI runtime attempted to invoke a command that was not found |
Hi @kwaa If If you are using our image as base image, you can follow this command. https://github.com/intel-analytics/ipex-llm/blob/main/docker/llm/finetune/qlora/xpu/docker/start-qlora-finetuning-on-xpu.sh#L5 |
Update:
|
As the title suggests.
I try to make my own Dockerfile, but it always fails to build.
https://github.com/moeru-ai/Moeru-Llama-3-8B/blob/main/Dockerfile
(It would be great to have IPEX-LLM + ollama/llama.cpp images too)
The text was updated successfully, but these errors were encountered: