-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[IPEX][XPU][Windows 11] It takes forever to run the first pass #399
Comments
takes about 5-10 minutes to start generating the first image in Sd.next for me in native windows. Also takes about 257.9s to startup. After it starts the first image I can start generating pretty quickly, when loading a new model it sometimes takes a few minutes but not as long as the first time and overtime it's actually fast like it should be. this may have something to do with lack of supported torchvision, but I'm not sure. |
The wheel files were not compiled with AOT enabled, thus the first iteration will take longer time. |
Just now I failed to compile from source again, would be great if prebuilt wheel files with AOT device specified is uploaded Error Log2 warnings generated.
[1044/1049] Linking CXX static library csrc\gpu\oneDNN\src\dnnl.lib
ignoring unknown argument: -fsycl
ignoring unknown argument: -Wno-unknown-argument
ignoring unknown argument: -Qoption,link,/machine:x64
[1048/1049] Linking CXX shared library csrc\gpu\intel-ext-pt-gpu.dll
FAILED: csrc/gpu/intel-ext-pt-gpu.dll csrc/gpu/intel-ext-pt-gpu.lib
cmd.exe /C "cd . && "C:\Program Files\CMake\bin\cmake.exe" -E vs_link_dll --intdir=csrc\gpu\CMakeFiles\intel-ext-pt-gpu.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100220~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100220~1.0\x64\mt.exe --manifests -- C:\oneAPI\compiler\2023.2.0\windows\bin\icx.exe /nologo @CMakeFiles\intel-ext-pt-gpu.rsp -LD /Qoption,link,/machine:x64 -rdynamic -Wl,-Bsymbolic-functions /INCREMENTAL:NO -fsycl /EHsc -fsycl-max-parallel-link-jobs=16 -fsycl-targets=spir64_gen,spir64 -fsycl-link-huge-device-code /Xs "-device ats-m150 -options ' -cl-intel-enable-auto-large-GRF-mode -cl-poison-unsupported-fp64-kernels'" -link /out:csrc\gpu\intel-ext-pt-gpu.dll /implib:csrc\gpu\intel-ext-pt-gpu.lib /pdb:csrc\gpu\intel-ext-pt-gpu.pdb /version:0.0 && cd ."
LINK: command "C:\oneAPI\compiler\2023.2.0\windows\bin\icx.exe /nologo @CMakeFiles\intel-ext-pt-gpu.rsp -LD /Qoption,link,/machine:x64 -rdynamic -Wl,-Bsymbolic-functions /INCREMENTAL:NO -fsycl /EHsc -fsycl-max-parallel-link-jobs=16 -fsycl-targets=spir64_gen,spir64 -fsycl-link-huge-device-code /Xs -device ats-m150 -options ' -cl-intel-enable-auto-large-GRF-mode -cl-poison-unsupported-fp64-kernels' -link /out:csrc\gpu\intel-ext-pt-gpu.dll /implib:csrc\gpu\intel-ext-pt-gpu.lib /pdb:csrc\gpu\intel-ext-pt-gpu.pdb /version:0.0 /MANIFEST /MANIFESTFILE:csrc\gpu\intel-ext-pt-gpu.dll.manifest" failed (exit code 1120) with the following output:
icx: warning: unknown argument ignored in clang-cl: '-rdynamic' [-Wunknown-argument]
icx: warning: unknown argument ignored in clang-cl: '-fsycl-link-huge-device-code' [-Wunknown-argument]
icx: warning: argument unused during compilation: '-EHsc' [-Wunused-command-line-argument]
Creating library csrc\gpu\intel-ext-pt-gpu.lib and object csrc\gpu\intel-ext-pt-gpu.exp
reducer-86aa6b.obj : error LNK2019: unresolved external symbol "__declspec(dllimport) public: virtual void __cdecl c10d::Timer::record(enum c10d::Timer::Event)" (__imp_?record@Timer@c10d@@UEAAXW4Event@12@@Z) referenced in function "public: virtual void __cdecl c10d::`anonymous namespace'::XPUTimer::record(enum A0xBD354422::Timer::Event)" (?record@XPUTimer@?A0xBD354422@c10d@@UEAAXW4Event@Timer@2@@Z)
Hint on symbols that are defined and could potentially match:
"__declspec(dllimport) public: virtual void __cdecl c10::impl::DeviceGuardImplInterface::record(void * *,class c10::Stream const &,signed char,enum c10::EventFlag)const " (__imp_?record@DeviceGuardImplInterface@impl@c10@@UEBAXPEAPEAXAEBVStream@3@CW4EventFlag@3@@Z)
reducer-86aa6b.obj : error LNK2019: unresolved external symbol "__declspec(dllimport) class c10::Registry >,struct c10::Device> * __cdecl c10d::TimerRegistry(void)" (__imp_?TimerRegistry@c10d@@YAPEAV?$Registry@W4DeviceType@c10@@V?$unique_ptr@VTimer@c10d@@U?$default_delete@VTimer@c10d@@@std@@@std@@UDevice@2@@c10@@XZ) referenced in function _GLOBAL__sub_I_reducer_e9db2c.cpp
reducer-86aa6b.obj : error LNK2001: unresolved external symbol "public: virtual void __cdecl c10d::Timer::record(enum c10d::Timer::Event)" (?record@Timer@c10d@@UEAAXW4Event@12@@Z)
C:\Users\playe\AppData\Local\Temp\icx-71bffa\fusion_pass-b19d20.out : fatal error LNK1120: 3 unresolved externals
icx: error: linker command failed with exit code 1120 (use -v to see invocation)
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "C:\Users\playe\bundle\intel-extension-for-pytorch\setup.py", line 1103, in
setup(
File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\__init__.py", line 107, in setup
return distutils.core.setup(**attrs)
File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\core.py", line 185, in setup
return run_commands(dist)
File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\core.py", line 201, in run_commands
dist.run_commands()
File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\dist.py", line 969, in run_commands
self.run_command(cmd)
File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\dist.py", line 1244, in run_command
super().run_command(command)
File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "C:\Users\playe\miniconda3\lib\site-packages\wheel\bdist_wheel.py", line 325, in run
self.run_command("build")
File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\dist.py", line 1244, in run_command
super().run_command(command)
File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\command\build.py", line 131, in run
self.run_command(cmd_name)
File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\dist.py", line 1244, in run_command
super().run_command(command)
File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "C:\Users\playe\bundle\intel-extension-for-pytorch\setup.py", line 1072, in run
self.run_command("build_clib")
File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\dist.py", line 1244, in run_command
super().run_command(command)
File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "C:\Users\playe\bundle\intel-extension-for-pytorch\setup.py", line 783, in run
_build_project(build_args, ipex_xpu_build_dir, my_env, use_ninja)
File "C:\Users\playe\bundle\intel-extension-for-pytorch\setup.py", line 554, in _build_project
check_call(["ninja"] + build_args, cwd=build_dir, env=build_env)
File "C:\Users\playe\miniconda3\lib\subprocess.py", line 369, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ninja', '-j', '16', 'install']' returned non-zero exit status 1.
|
Thanks @whchan05, @Mindset-Official. The first kernel run without AOT compilation is expected to take a little longer. AOT compilation will significantly reduce the first kernel run time. In the meantime, feel free to try building from source with AOT as described here. |
Thanks @whchan05, we will shortly investigate the issue. |
Docker is not required for IPEX on native windows. |
I have attempted to build from source with AOT but it takes a long time during the compile (about 6 hours) when reaching this point [1046/1049] Linking CXX static library csrc\gpu\oneDNN\src\dnnl.lib I was able to build the wheel files for IPEX but then i run into this error Python 3.10.12 | packaged by Anaconda, Inc. | (main, Jul 5 2023, 19:01:18) [MSC v.1916 64 bit (AMD64)] on win32
I am unsure if I am missing something or there is an issue with AOT in windows? thanks for the help. my system is win 11 |
consider compiling prebuilt wheels with AOT enabled, especially the windows native variant. It should be s much better experience, and you only need to support Arc dGPU. As PVC is not available as a workstation card and I suspect will not run on Windows (unless both of that will change in the near future). |
Thanks @Mindset-Official, we are currently aware of the long AOT build time. Even though AOT build is expected to take longer than non-AOT (i.e., JIT), this long build time (~6 hrs) on windows is unreasonable. This is under investigation. |
Thanks @Vipitis for the feedback, we will take into account. |
found an error in the linked doc. first sentence claims the prebuilt wheels have AOT enabled or both GPU devices. Could you confirm what the wheels hosted on https://developer.intel.com/ipex-whl-stable-xpu are built with? |
Hi, that sentence applies to wheel files for Linux. Wheel files for Windows don't have AOT enabled yet. |
Is there any ETA for Windows AOT wheels (and also the compatible torchvision wheel)? JIT compilation takes too long and makes IPEX almost unusable on native windows :-/ |
Bump. Would really like to see this issue fixed in the near future. |
I compiled the wheels for ipex myself. It took around 5.5 hours, 3.5 hours of which were the final two steps. With AOT there still is a short delay on first inference, but it's in the order of 10 seconds, not 10 minutes. |
Thanks @Vipitis for share, yes, the majority of time is taken by the final linkage step. |
what device did you use for the [AOC] option when compiling? using "xpu" results in |
@ereish64, please specify the target device in |
Thanks! I think that might be a good link to have in the compiling from source section in the windows GPU installation guide. I would do it myself, but I don't think I can make a pull request for that document. |
Thanks @ereish64 for the recommendation, we will add the link to the windows guide. |
one trick: you don't need to compile torch with the patches from source. The wheels are available. That will save some time and space, since torch has a lot of dependencies if you get everything from source. |
@Vipitis Have you seen the following error with your AOT IPEX wheel? I got this error when I tried to generate a 1024x1024 image (SD.Next original backend) for the second time (yes, the first 1024x1024 image can be generated).
It seems to be an unexpected OOM issue (only 8GB/16GB was occupied when the error happened). |
@Nuullll I can confirm that is the error I usually see when I run out of memory. I only have the 8GB GPU though. Wondering if there's a hardcoded cap in the AOT compile somewhere... |
Try and update sd.next, i believe this was issue with a recent update that may be fixed now. |
@ereish64 @Mindset-Official Thanks! I built the IPEX xpu-master AOT wheel and tried it with a fresh sd.next install (hacked a few LOC though, will submit the change to sd.next). Everything seems to work fine right now! |
For anyone having this issue in the future, here's a compiled wheel file so that you don't have to compile it yourself
Will install:
|
I faced same issue, so I builded wheel file from source and installed it. (v2.1.10+xpu) Environment
Have I set the wrong AOT? |
@sykimm, please use |
Describe the issue
Title. Once the first pass is complete, subsequent passes can be run faster. Seem to only happen when the device is set to XPU.
XPU:
CPU:
Environment:
CPU: 5700X
GPU: A770LE
RAM: 96GB
OS: Win 11
oneAPI:
DPC++ Compiler: 2023.2.0
MKL: 2023.2.0
Torch: 2.0.0a0
IPEX: 2.0.110+gitba7f6c1
Driver:
31.0.101.4577 WHQL
Others:
Miniconda: 23.5.2
Python: 3.10.12
Stable Diffusion WebUI: https://github.com/jbaboval/stable-diffusion-webui Commit: 197dedd
The text was updated successfully, but these errors were encountered: