Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IPEX][XPU][Windows 11] It takes forever to run the first pass #399

Open
whchan05 opened this issue Aug 5, 2023 · 28 comments
Open

[IPEX][XPU][Windows 11] It takes forever to run the first pass #399

whchan05 opened this issue Aug 5, 2023 · 28 comments
Labels
Performance Windows XPU/GPU XPU/GPU specific issues

Comments

@whchan05
Copy link

whchan05 commented Aug 5, 2023

Describe the issue

Title. Once the first pass is complete, subsequent passes can be run faster. Seem to only happen when the device is set to XPU.

XPU:

before loading time xpu

CPU:

after loading time cpu

Environment:

CPU: 5700X
GPU: A770LE
RAM: 96GB
OS: Win 11

oneAPI:

DPC++ Compiler: 2023.2.0
MKL: 2023.2.0
Torch: 2.0.0a0
IPEX: 2.0.110+gitba7f6c1

Driver:

31.0.101.4577 WHQL

Others:

Miniconda: 23.5.2
Python: 3.10.12
Stable Diffusion WebUI: https://github.com/jbaboval/stable-diffusion-webui Commit: 197dedd

@Mindset-Official
Copy link

takes about 5-10 minutes to start generating the first image in Sd.next for me in native windows. Also takes about 257.9s to startup. After it starts the first image I can start generating pretty quickly, when loading a new model it sometimes takes a few minutes but not as long as the first time and overtime it's actually fast like it should be. this may have something to do with lack of supported torchvision, but I'm not sure.

@jingxu10 jingxu10 added XPU/GPU XPU/GPU specific issues Performance Windows labels Aug 6, 2023
@jingxu10
Copy link
Contributor

jingxu10 commented Aug 7, 2023

The wheel files were not compiled with AOT enabled, thus the first iteration will take longer time.
@min-jean-cho FYI.

@whchan05
Copy link
Author

whchan05 commented Aug 7, 2023

Just now I failed to compile from source again, would be great if prebuilt wheel files with AOT device specified is uploaded

Error Log 2 warnings generated. [1044/1049] Linking CXX static library csrc\gpu\oneDNN\src\dnnl.lib ignoring unknown argument: -fsycl ignoring unknown argument: -Wno-unknown-argument ignoring unknown argument: -Qoption,link,/machine:x64 [1048/1049] Linking CXX shared library csrc\gpu\intel-ext-pt-gpu.dll FAILED: csrc/gpu/intel-ext-pt-gpu.dll csrc/gpu/intel-ext-pt-gpu.lib cmd.exe /C "cd . && "C:\Program Files\CMake\bin\cmake.exe" -E vs_link_dll --intdir=csrc\gpu\CMakeFiles\intel-ext-pt-gpu.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100220~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100220~1.0\x64\mt.exe --manifests -- C:\oneAPI\compiler\2023.2.0\windows\bin\icx.exe /nologo @CMakeFiles\intel-ext-pt-gpu.rsp -LD /Qoption,link,/machine:x64 -rdynamic -Wl,-Bsymbolic-functions /INCREMENTAL:NO -fsycl /EHsc -fsycl-max-parallel-link-jobs=16 -fsycl-targets=spir64_gen,spir64 -fsycl-link-huge-device-code /Xs "-device ats-m150 -options ' -cl-intel-enable-auto-large-GRF-mode -cl-poison-unsupported-fp64-kernels'" -link /out:csrc\gpu\intel-ext-pt-gpu.dll /implib:csrc\gpu\intel-ext-pt-gpu.lib /pdb:csrc\gpu\intel-ext-pt-gpu.pdb /version:0.0 && cd ." LINK: command "C:\oneAPI\compiler\2023.2.0\windows\bin\icx.exe /nologo @CMakeFiles\intel-ext-pt-gpu.rsp -LD /Qoption,link,/machine:x64 -rdynamic -Wl,-Bsymbolic-functions /INCREMENTAL:NO -fsycl /EHsc -fsycl-max-parallel-link-jobs=16 -fsycl-targets=spir64_gen,spir64 -fsycl-link-huge-device-code /Xs -device ats-m150 -options ' -cl-intel-enable-auto-large-GRF-mode -cl-poison-unsupported-fp64-kernels' -link /out:csrc\gpu\intel-ext-pt-gpu.dll /implib:csrc\gpu\intel-ext-pt-gpu.lib /pdb:csrc\gpu\intel-ext-pt-gpu.pdb /version:0.0 /MANIFEST /MANIFESTFILE:csrc\gpu\intel-ext-pt-gpu.dll.manifest" failed (exit code 1120) with the following output: icx: warning: unknown argument ignored in clang-cl: '-rdynamic' [-Wunknown-argument] icx: warning: unknown argument ignored in clang-cl: '-fsycl-link-huge-device-code' [-Wunknown-argument] icx: warning: argument unused during compilation: '-EHsc' [-Wunused-command-line-argument] Creating library csrc\gpu\intel-ext-pt-gpu.lib and object csrc\gpu\intel-ext-pt-gpu.exp reducer-86aa6b.obj : error LNK2019: unresolved external symbol "__declspec(dllimport) public: virtual void __cdecl c10d::Timer::record(enum c10d::Timer::Event)" (__imp_?record@Timer@c10d@@UEAAXW4Event@12@@Z) referenced in function "public: virtual void __cdecl c10d::`anonymous namespace'::XPUTimer::record(enum A0xBD354422::Timer::Event)" (?record@XPUTimer@?A0xBD354422@c10d@@UEAAXW4Event@Timer@2@@Z) Hint on symbols that are defined and could potentially match: "__declspec(dllimport) public: virtual void __cdecl c10::impl::DeviceGuardImplInterface::record(void * *,class c10::Stream const &,signed char,enum c10::EventFlag)const " (__imp_?record@DeviceGuardImplInterface@impl@c10@@UEBAXPEAPEAXAEBVStream@3@CW4EventFlag@3@@Z) reducer-86aa6b.obj : error LNK2019: unresolved external symbol "__declspec(dllimport) class c10::Registry >,struct c10::Device> * __cdecl c10d::TimerRegistry(void)" (__imp_?TimerRegistry@c10d@@YAPEAV?$Registry@W4DeviceType@c10@@V?$unique_ptr@VTimer@c10d@@U?$default_delete@VTimer@c10d@@@std@@@std@@UDevice@2@@c10@@XZ) referenced in function _GLOBAL__sub_I_reducer_e9db2c.cpp reducer-86aa6b.obj : error LNK2001: unresolved external symbol "public: virtual void __cdecl c10d::Timer::record(enum c10d::Timer::Event)" (?record@Timer@c10d@@UEAAXW4Event@12@@Z) C:\Users\playe\AppData\Local\Temp\icx-71bffa\fusion_pass-b19d20.out : fatal error LNK1120: 3 unresolved externals icx: error: linker command failed with exit code 1120 (use -v to see invocation) ninja: build stopped: subcommand failed. Traceback (most recent call last): File "C:\Users\playe\bundle\intel-extension-for-pytorch\setup.py", line 1103, in setup( File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\__init__.py", line 107, in setup return distutils.core.setup(**attrs) File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\core.py", line 185, in setup return run_commands(dist) File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\core.py", line 201, in run_commands dist.run_commands() File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\dist.py", line 969, in run_commands self.run_command(cmd) File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\dist.py", line 1244, in run_command super().run_command(command) File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command cmd_obj.run() File "C:\Users\playe\miniconda3\lib\site-packages\wheel\bdist_wheel.py", line 325, in run self.run_command("build") File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\cmd.py", line 318, in run_command self.distribution.run_command(command) File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\dist.py", line 1244, in run_command super().run_command(command) File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command cmd_obj.run() File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\command\build.py", line 131, in run self.run_command(cmd_name) File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\cmd.py", line 318, in run_command self.distribution.run_command(command) File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\dist.py", line 1244, in run_command super().run_command(command) File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command cmd_obj.run() File "C:\Users\playe\bundle\intel-extension-for-pytorch\setup.py", line 1072, in run self.run_command("build_clib") File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\cmd.py", line 318, in run_command self.distribution.run_command(command) File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\dist.py", line 1244, in run_command super().run_command(command) File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command cmd_obj.run() File "C:\Users\playe\bundle\intel-extension-for-pytorch\setup.py", line 783, in run _build_project(build_args, ipex_xpu_build_dir, my_env, use_ninja) File "C:\Users\playe\bundle\intel-extension-for-pytorch\setup.py", line 554, in _build_project check_call(["ninja"] + build_args, cwd=build_dir, env=build_env) File "C:\Users\playe\miniconda3\lib\subprocess.py", line 369, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['ninja', '-j', '16', 'install']' returned non-zero exit status 1.

@min-jean-cho
Copy link
Contributor

Thanks @whchan05, @Mindset-Official. The first kernel run without AOT compilation is expected to take a little longer. AOT compilation will significantly reduce the first kernel run time. In the meantime, feel free to try building from source with AOT as described here.

@min-jean-cho
Copy link
Contributor

Just now I failed to compile from source again, would be great if prebuilt wheel files with AOT device specified is uploaded

Thanks @whchan05, we will shortly investigate the issue.

@min-jean-cho
Copy link
Contributor

Docker is not required for IPEX on native windows.

@Mindset-Official
Copy link

Mindset-Official commented Aug 8, 2023

Thanks @whchan05, @Mindset-Official. The first kernel run without AOT compilation is expected to take a little longer. AOT compilation will significantly reduce the first kernel run time. In the meantime, feel free to try building from source with AOT as described here.

I have attempted to build from source with AOT but it takes a long time during the compile (about 6 hours) when reaching this point

[1046/1049] Linking CXX static library csrc\gpu\oneDNN\src\dnnl.lib
ignoring unknown argument: -fsycl
ignoring unknown argument: -Wno-unknown-argument
ignoring unknown argument: -Qoption,link,/machine:x64
[1047/1049] Linking CXX shared library csrc\gpu\intel-ext-pt-gpu.dll

I was able to build the wheel files for IPEX but then i run into this error

Python 3.10.12 | packaged by Anaconda, Inc. | (main, Jul 5 2023, 19:01:18) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

import torch
import intel_extension_for_pytorch
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\mymin\miniconda3\lib\site-packages\intel_extension_for_pytorch_init_.py", line 100, in
from . import inductor
File "C:\Users\mymin\miniconda3\lib\site-packages\intel_extension_for_pytorch_inductor_init
.py", line 1, in
from torch._inductor.codegen.common import register_backend_for_device
ImportError: cannot import name 'register_backend_for_device' from 'torch._inductor.codegen.common' (C:\Users\mymin\miniconda3\lib\site-packages\torch_inductor\codegen\common.py)

I am unsure if I am missing something or there is an issue with AOT in windows? thanks for the help.

my system is

win 11
amd 5600
arc a750
running on Sata ssd

@Vipitis
Copy link

Vipitis commented Aug 8, 2023

consider compiling prebuilt wheels with AOT enabled, especially the windows native variant. It should be s much better experience, and you only need to support Arc dGPU. As PVC is not available as a workstation card and I suspect will not run on Windows (unless both of that will change in the near future).
calling first model inference takes slightly over 11 minutes on my old system.

@min-jean-cho
Copy link
Contributor

I have attempted to build from source with AOT but it takes a long time during the compile (about 6 hours) when reaching this point

Thanks @Mindset-Official, we are currently aware of the long AOT build time. Even though AOT build is expected to take longer than non-AOT (i.e., JIT), this long build time (~6 hrs) on windows is unreasonable. This is under investigation.

@min-jean-cho
Copy link
Contributor

Thanks @Vipitis for the feedback, we will take into account.

@Vipitis
Copy link

Vipitis commented Aug 16, 2023

[...] building from source with AOT as described here.

found an error in the linked doc. first sentence claims the prebuilt wheels have AOT enabled or both GPU devices. Could you confirm what the wheels hosted on https://developer.intel.com/ipex-whl-stable-xpu are built with?

@jingxu10
Copy link
Contributor

jingxu10 commented Aug 16, 2023

Hi, that sentence applies to wheel files for Linux. Wheel files for Windows don't have AOT enabled yet.
Thanks for pointing this out. We will amend the doc.

@Nuullll
Copy link

Nuullll commented Aug 17, 2023

Is there any ETA for Windows AOT wheels (and also the compatible torchvision wheel)? JIT compilation takes too long and makes IPEX almost unusable on native windows :-/

@ereish64
Copy link

Bump. Would really like to see this issue fixed in the near future.

@Vipitis
Copy link

Vipitis commented Sep 11, 2023

I compiled the wheels for ipex myself. It took around 5.5 hours, 3.5 hours of which were the final two steps.

With AOT there still is a short delay on first inference, but it's in the order of 10 seconds, not 10 minutes.

@min-jean-cho
Copy link
Contributor

Thanks @Vipitis for share, yes, the majority of time is taken by the final linkage step.

@ereish64
Copy link

what device did you use for the [AOC] option when compiling? using "xpu" results in Could not determine device target: xpu.

@min-jean-cho
Copy link
Contributor

@ereish64, please specify the target device in USE_AOT_DEVLIST build option. Please see here for a list of USE_AOT_DEVLIST setting for Intel® Data Center GPU Flex Series and Intel® Arc™ A-Series GPUs. You may also find it helpful to run ocloc compile --help from your oneAPI BaseKit standalone ocloc compiler. ocloc compile --help will show a list of valid target devices under -device option.

@ereish64
Copy link

Thanks! I think that might be a good link to have in the compiling from source section in the windows GPU installation guide.

I would do it myself, but I don't think I can make a pull request for that document.

@min-jean-cho
Copy link
Contributor

Thanks @ereish64 for the recommendation, we will add the link to the windows guide.

@Vipitis
Copy link

Vipitis commented Sep 11, 2023

one trick: you don't need to compile torch with the patches from source. The wheels are available. That will save some time and space, since torch has a lot of dependencies if you get everything from source.

@Nuullll
Copy link

Nuullll commented Sep 18, 2023

@Vipitis Have you seen the following error with your AOT IPEX wheel? I got this error when I tried to generate a 1024x1024 image (SD.Next original backend) for the second time (yes, the first 1024x1024 image can be generated).

RuntimeError: Native API failed. Native API returns: -997 (Command failed to enqueue/execute)

It seems to be an unexpected OOM issue (only 8GB/16GB was occupied when the error happened).
And the version of my AOT wheel was xpu-2.0.110. Should I try xpu-master instead?

@ereish64
Copy link

@Nuullll I can confirm that is the error I usually see when I run out of memory. I only have the 8GB GPU though.

Wondering if there's a hardcoded cap in the AOT compile somewhere...

@Mindset-Official
Copy link

@Vipitis Have you seen the following error with your AOT IPEX wheel? I got this error when I tried to generate a 1024x1024 image (SD.Next original backend) for the second time (yes, the first 1024x1024 image can be generated).

RuntimeError: Native API failed. Native API returns: -997 (Command failed to enqueue/execute)

It seems to be an unexpected OOM issue (only 8GB/16GB was occupied when the error happened). And the version of my AOT wheel was xpu-2.0.110. Should I try xpu-master instead?

Try and update sd.next, i believe this was issue with a recent update that may be fixed now.

@Nuullll
Copy link

Nuullll commented Sep 19, 2023

@ereish64 @Mindset-Official Thanks! I built the IPEX xpu-master AOT wheel and tried it with a fresh sd.next install (hacked a few LOC though, will submit the change to sd.next). Everything seems to work fine right now!
IPEX (AOT) native windows is ~20% slower than WSL.

@ereish64
Copy link

ereish64 commented Nov 3, 2023

For anyone having this issue in the future, here's a compiled wheel file so that you don't have to compile it yourself
Make sure your running:

  • Python 3.9 (64 Bit)

Will install:

  • Torch 2.0.110

@sykimm
Copy link

sykimm commented Mar 4, 2024

I faced same issue, so I builded wheel file from source and installed it. (v2.1.10+xpu)
But the problem still exists.

Environment

  • Intel Core i7-1260P (Intel Iris Xe Graphics for its integrated graphics)
  • Ubuntu 22.04
  • python3.10
  • torch2.1.0a0
  • I set AOT=gen12lp

Have I set the wrong AOT?

@min-jean-cho
Copy link
Contributor

@sykimm, please use USE_AOT_DEVLIST to specify the targeted hardware. Please see https://github.com/intel/intel-extension-for-pytorch/blob/xpu-main/docs/tutorials/technical_details/AOT.md#use-case for a table of supported hardware and USE_AOT_DEVLIST setting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Windows XPU/GPU XPU/GPU specific issues
Projects
None yet
Development

No branches or pull requests

8 participants