[IPEX][XPU][Windows 11] It takes forever to run the first pass #399

whchan05 · 2023-08-05T04:20:13Z

Describe the issue

Title. Once the first pass is complete, subsequent passes can be run faster. Seem to only happen when the device is set to XPU.

XPU:

CPU:

Environment:

CPU: 5700X
GPU: A770LE
RAM: 96GB
OS: Win 11

oneAPI:

DPC++ Compiler: 2023.2.0
MKL: 2023.2.0
Torch: 2.0.0a0
IPEX: 2.0.110+gitba7f6c1

Driver:

31.0.101.4577 WHQL

Others:

Miniconda: 23.5.2
Python: 3.10.12
Stable Diffusion WebUI: https://github.com/jbaboval/stable-diffusion-webui Commit: 197dedd

Mindset-Official · 2023-08-05T20:37:00Z

takes about 5-10 minutes to start generating the first image in Sd.next for me in native windows. Also takes about 257.9s to startup. After it starts the first image I can start generating pretty quickly, when loading a new model it sometimes takes a few minutes but not as long as the first time and overtime it's actually fast like it should be. this may have something to do with lack of supported torchvision, but I'm not sure.

jingxu10 · 2023-08-07T02:50:59Z

The wheel files were not compiled with AOT enabled, thus the first iteration will take longer time.
@min-jean-cho FYI.

whchan05 · 2023-08-07T08:37:00Z

Just now I failed to compile from source again, would be great if prebuilt wheel files with AOT device specified is uploaded

Error Log

2 warnings generated.
[1044/1049] Linking CXX static library csrc\gpu\oneDNN\src\dnnl.lib
ignoring unknown argument: -fsycl
ignoring unknown argument: -Wno-unknown-argument
ignoring unknown argument: -Qoption,link,/machine:x64
[1048/1049] Linking CXX shared library csrc\gpu\intel-ext-pt-gpu.dll
FAILED: csrc/gpu/intel-ext-pt-gpu.dll csrc/gpu/intel-ext-pt-gpu.lib
cmd.exe /C "cd . && "C:\Program Files\CMake\bin\cmake.exe" -E vs_link_dll --intdir=csrc\gpu\CMakeFiles\intel-ext-pt-gpu.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100220~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100220~1.0\x64\mt.exe --manifests  -- C:\oneAPI\compiler\2023.2.0\windows\bin\icx.exe /nologo @CMakeFiles\intel-ext-pt-gpu.rsp  -LD /Qoption,link,/machine:x64 -rdynamic -Wl,-Bsymbolic-functions /INCREMENTAL:NO  -fsycl /EHsc -fsycl-max-parallel-link-jobs=16 -fsycl-targets=spir64_gen,spir64 -fsycl-link-huge-device-code /Xs "-device ats-m150 -options ' -cl-intel-enable-auto-large-GRF-mode -cl-poison-unsupported-fp64-kernels'" -link /out:csrc\gpu\intel-ext-pt-gpu.dll /implib:csrc\gpu\intel-ext-pt-gpu.lib /pdb:csrc\gpu\intel-ext-pt-gpu.pdb /version:0.0  && cd ."
LINK: command "C:\oneAPI\compiler\2023.2.0\windows\bin\icx.exe /nologo @CMakeFiles\intel-ext-pt-gpu.rsp -LD /Qoption,link,/machine:x64 -rdynamic -Wl,-Bsymbolic-functions /INCREMENTAL:NO -fsycl /EHsc -fsycl-max-parallel-link-jobs=16 -fsycl-targets=spir64_gen,spir64 -fsycl-link-huge-device-code /Xs -device ats-m150 -options ' -cl-intel-enable-auto-large-GRF-mode -cl-poison-unsupported-fp64-kernels' -link /out:csrc\gpu\intel-ext-pt-gpu.dll /implib:csrc\gpu\intel-ext-pt-gpu.lib /pdb:csrc\gpu\intel-ext-pt-gpu.pdb /version:0.0 /MANIFEST /MANIFESTFILE:csrc\gpu\intel-ext-pt-gpu.dll.manifest" failed (exit code 1120) with the following output:
icx: warning: unknown argument ignored in clang-cl: '-rdynamic' [-Wunknown-argument]
icx: warning: unknown argument ignored in clang-cl: '-fsycl-link-huge-device-code' [-Wunknown-argument]
icx: warning: argument unused during compilation: '-EHsc' [-Wunused-command-line-argument]
   Creating library csrc\gpu\intel-ext-pt-gpu.lib and object csrc\gpu\intel-ext-pt-gpu.exp
reducer-86aa6b.obj : error LNK2019: unresolved external symbol "__declspec(dllimport) public: virtual void __cdecl c10d::Timer::record(enum c10d::Timer::Event)" (__imp_?record@Timer@c10d@@UEAAXW4Event@12@@Z) referenced in function "public: virtual void __cdecl c10d::`anonymous namespace'::XPUTimer::record(enum A0xBD354422::Timer::Event)" (?record@XPUTimer@?A0xBD354422@c10d@@UEAAXW4Event@Timer@2@@Z)
  Hint on symbols that are defined and could potentially match:
    "__declspec(dllimport) public: virtual void __cdecl c10::impl::DeviceGuardImplInterface::record(void * *,class c10::Stream const &,signed char,enum c10::EventFlag)const " (__imp_?record@DeviceGuardImplInterface@impl@c10@@UEBAXPEAPEAXAEBVStream@3@CW4EventFlag@3@@Z)
reducer-86aa6b.obj : error LNK2019: unresolved external symbol "__declspec(dllimport) class c10::Registry >,struct c10::Device> * __cdecl c10d::TimerRegistry(void)" (__imp_?TimerRegistry@c10d@@YAPEAV?$Registry@W4DeviceType@c10@@V?$unique_ptr@VTimer@c10d@@U?$default_delete@VTimer@c10d@@@std@@@std@@UDevice@2@@c10@@XZ) referenced in function _GLOBAL__sub_I_reducer_e9db2c.cpp
reducer-86aa6b.obj : error LNK2001: unresolved external symbol "public: virtual void __cdecl c10d::Timer::record(enum c10d::Timer::Event)" (?record@Timer@c10d@@UEAAXW4Event@12@@Z)
C:\Users\playe\AppData\Local\Temp\icx-71bffa\fusion_pass-b19d20.out : fatal error LNK1120: 3 unresolved externals
icx: error: linker command failed with exit code 1120 (use -v to see invocation)
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "C:\Users\playe\bundle\intel-extension-for-pytorch\setup.py", line 1103, in 
    setup(
  File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\__init__.py", line 107, in setup
    return distutils.core.setup(**attrs)
  File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\core.py", line 185, in setup
    return run_commands(dist)
  File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\core.py", line 201, in run_commands
    dist.run_commands()
  File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\dist.py", line 969, in run_commands
    self.run_command(cmd)
  File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\dist.py", line 1244, in run_command
    super().run_command(command)
  File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
    cmd_obj.run()
  File "C:\Users\playe\miniconda3\lib\site-packages\wheel\bdist_wheel.py", line 325, in run
    self.run_command("build")
  File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\cmd.py", line 318, in run_command
    self.distribution.run_command(command)
  File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\dist.py", line 1244, in run_command
    super().run_command(command)
  File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
    cmd_obj.run()
  File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\command\build.py", line 131, in run
    self.run_command(cmd_name)
  File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\cmd.py", line 318, in run_command
    self.distribution.run_command(command)
  File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\dist.py", line 1244, in run_command
    super().run_command(command)
  File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
    cmd_obj.run()
  File "C:\Users\playe\bundle\intel-extension-for-pytorch\setup.py", line 1072, in run
    self.run_command("build_clib")
  File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\cmd.py", line 318, in run_command
    self.distribution.run_command(command)
  File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\dist.py", line 1244, in run_command
    super().run_command(command)
  File "C:\Users\playe\miniconda3\lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
    cmd_obj.run()
  File "C:\Users\playe\bundle\intel-extension-for-pytorch\setup.py", line 783, in run
    _build_project(build_args, ipex_xpu_build_dir, my_env, use_ninja)
  File "C:\Users\playe\bundle\intel-extension-for-pytorch\setup.py", line 554, in _build_project
    check_call(["ninja"] + build_args, cwd=build_dir, env=build_env)
  File "C:\Users\playe\miniconda3\lib\subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ninja', '-j', '16', 'install']' returned non-zero exit status 1.

min-jean-cho · 2023-08-07T22:27:34Z

Thanks @whchan05, @Mindset-Official. The first kernel run without AOT compilation is expected to take a little longer. AOT compilation will significantly reduce the first kernel run time. In the meantime, feel free to try building from source with AOT as described here.

min-jean-cho · 2023-08-07T22:28:24Z

Just now I failed to compile from source again, would be great if prebuilt wheel files with AOT device specified is uploaded

Thanks @whchan05, we will shortly investigate the issue.

min-jean-cho · 2023-08-07T22:59:58Z

Docker is not required for IPEX on native windows.

Mindset-Official · 2023-08-08T02:38:11Z

Thanks @whchan05, @Mindset-Official. The first kernel run without AOT compilation is expected to take a little longer. AOT compilation will significantly reduce the first kernel run time. In the meantime, feel free to try building from source with AOT as described here.

I have attempted to build from source with AOT but it takes a long time during the compile (about 6 hours) when reaching this point

[1046/1049] Linking CXX static library csrc\gpu\oneDNN\src\dnnl.lib
ignoring unknown argument: -fsycl
ignoring unknown argument: -Wno-unknown-argument
ignoring unknown argument: -Qoption,link,/machine:x64
[1047/1049] Linking CXX shared library csrc\gpu\intel-ext-pt-gpu.dll

I was able to build the wheel files for IPEX but then i run into this error

Python 3.10.12 | packaged by Anaconda, Inc. | (main, Jul 5 2023, 19:01:18) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

import torch
import intel_extension_for_pytorch
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\mymin\miniconda3\lib\site-packages\intel_extension_for_pytorch_init_.py", line 100, in
from . import inductor
File "C:\Users\mymin\miniconda3\lib\site-packages\intel_extension_for_pytorch_inductor_init.py", line 1, in
from torch._inductor.codegen.common import register_backend_for_device
ImportError: cannot import name 'register_backend_for_device' from 'torch._inductor.codegen.common' (C:\Users\mymin\miniconda3\lib\site-packages\torch_inductor\codegen\common.py)

I am unsure if I am missing something or there is an issue with AOT in windows? thanks for the help.

my system is

win 11
amd 5600
arc a750
running on Sata ssd

Vipitis · 2023-08-08T03:51:35Z

consider compiling prebuilt wheels with AOT enabled, especially the windows native variant. It should be s much better experience, and you only need to support Arc dGPU. As PVC is not available as a workstation card and I suspect will not run on Windows (unless both of that will change in the near future).
calling first model inference takes slightly over 11 minutes on my old system.

min-jean-cho · 2023-08-08T18:50:02Z

I have attempted to build from source with AOT but it takes a long time during the compile (about 6 hours) when reaching this point

Thanks @Mindset-Official, we are currently aware of the long AOT build time. Even though AOT build is expected to take longer than non-AOT (i.e., JIT), this long build time (~6 hrs) on windows is unreasonable. This is under investigation.

min-jean-cho · 2023-08-08T18:50:52Z

Thanks @Vipitis for the feedback, we will take into account.

Vipitis · 2023-08-16T23:28:59Z

[...] building from source with AOT as described here.

found an error in the linked doc. first sentence claims the prebuilt wheels have AOT enabled or both GPU devices. Could you confirm what the wheels hosted on https://developer.intel.com/ipex-whl-stable-xpu are built with?

jingxu10 · 2023-08-16T23:42:59Z

Hi, that sentence applies to wheel files for Linux. Wheel files for Windows don't have AOT enabled yet.
Thanks for pointing this out. We will amend the doc.

Nuullll · 2023-08-17T01:51:40Z

Is there any ETA for Windows AOT wheels (and also the compatible torchvision wheel)? JIT compilation takes too long and makes IPEX almost unusable on native windows :-/

ereish64 · 2023-09-11T14:27:41Z

Bump. Would really like to see this issue fixed in the near future.

Vipitis · 2023-09-11T17:19:14Z

I compiled the wheels for ipex myself. It took around 5.5 hours, 3.5 hours of which were the final two steps.

With AOT there still is a short delay on first inference, but it's in the order of 10 seconds, not 10 minutes.

min-jean-cho · 2023-09-11T17:30:44Z

Thanks @Vipitis for share, yes, the majority of time is taken by the final linkage step.

ereish64 · 2023-09-11T17:59:19Z

what device did you use for the [AOC] option when compiling? using "xpu" results in Could not determine device target: xpu.

min-jean-cho · 2023-09-11T18:11:42Z

@ereish64, please specify the target device in USE_AOT_DEVLIST build option. Please see here for a list of USE_AOT_DEVLIST setting for Intel® Data Center GPU Flex Series and Intel® Arc™ A-Series GPUs. You may also find it helpful to run ocloc compile --help from your oneAPI BaseKit standalone ocloc compiler. ocloc compile --help will show a list of valid target devices under -device option.

ereish64 · 2023-09-11T18:51:33Z

Thanks! I think that might be a good link to have in the compiling from source section in the windows GPU installation guide.

I would do it myself, but I don't think I can make a pull request for that document.

min-jean-cho · 2023-09-11T18:54:41Z

Thanks @ereish64 for the recommendation, we will add the link to the windows guide.

Vipitis · 2023-09-11T20:49:53Z

one trick: you don't need to compile torch with the patches from source. The wheels are available. That will save some time and space, since torch has a lot of dependencies if you get everything from source.

Nuullll · 2023-09-18T12:12:16Z

@Vipitis Have you seen the following error with your AOT IPEX wheel? I got this error when I tried to generate a 1024x1024 image (SD.Next original backend) for the second time (yes, the first 1024x1024 image can be generated).

RuntimeError: Native API failed. Native API returns: -997 (Command failed to enqueue/execute)

It seems to be an unexpected OOM issue (only 8GB/16GB was occupied when the error happened).
And the version of my AOT wheel was xpu-2.0.110. Should I try xpu-master instead?

ereish64 · 2023-09-18T12:53:35Z

@Nuullll I can confirm that is the error I usually see when I run out of memory. I only have the 8GB GPU though.

Wondering if there's a hardcoded cap in the AOT compile somewhere...

Mindset-Official · 2023-09-18T19:07:26Z

@Vipitis Have you seen the following error with your AOT IPEX wheel? I got this error when I tried to generate a 1024x1024 image (SD.Next original backend) for the second time (yes, the first 1024x1024 image can be generated).
RuntimeError: Native API failed. Native API returns: -997 (Command failed to enqueue/execute)
It seems to be an unexpected OOM issue (only 8GB/16GB was occupied when the error happened). And the version of my AOT wheel was xpu-2.0.110. Should I try xpu-master instead?

Try and update sd.next, i believe this was issue with a recent update that may be fixed now.

Nuullll · 2023-09-19T07:34:10Z

@ereish64 @Mindset-Official Thanks! I built the IPEX xpu-master AOT wheel and tried it with a fresh sd.next install (hacked a few LOC though, will submit the change to sd.next). Everything seems to work fine right now!
IPEX (AOT) native windows is ~20% slower than WSL.

ereish64 · 2023-11-03T14:44:53Z

For anyone having this issue in the future, here's a compiled wheel file so that you don't have to compile it yourself
Make sure your running:

Python 3.9 (64 Bit)

Will install:

Torch 2.0.110

sykimm · 2024-03-04T12:43:25Z

I faced same issue, so I builded wheel file from source and installed it. (v2.1.10+xpu)
But the problem still exists.

Environment

Intel Core i7-1260P (Intel Iris Xe Graphics for its integrated graphics)
Ubuntu 22.04
python3.10
torch2.1.0a0
I set AOT=gen12lp

Have I set the wrong AOT?

min-jean-cho · 2024-03-04T18:12:36Z

@sykimm, please use USE_AOT_DEVLIST to specify the targeted hardware. Please see https://github.com/intel/intel-extension-for-pytorch/blob/xpu-main/docs/tutorials/technical_details/AOT.md#use-case for a table of supported hardware and USE_AOT_DEVLIST setting

jingxu10 added XPU/GPU XPU/GPU specific issues Performance Windows labels Aug 6, 2023

simonlui mentioned this issue Sep 23, 2023

Intel Arc thread oobabooga/text-generation-webui#3761

Closed

simonlui mentioned this issue Jun 19, 2024

Who are Mat1 and Mat2 and why won't they leave me alone? comfyanonymous/ComfyUI#3102

Open

Nuullll mentioned this issue Dec 18, 2024

A750: Stable diffusion 2x slow down comparing IPEX-XPU 2.5 with IPEX-XPU 2.0 on Windows #749

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IPEX][XPU][Windows 11] It takes forever to run the first pass #399

[IPEX][XPU][Windows 11] It takes forever to run the first pass #399

whchan05 commented Aug 5, 2023 •

edited

Loading

Mindset-Official commented Aug 5, 2023

jingxu10 commented Aug 7, 2023

whchan05 commented Aug 7, 2023 •

edited

Loading

min-jean-cho commented Aug 7, 2023

min-jean-cho commented Aug 7, 2023

min-jean-cho commented Aug 7, 2023

Mindset-Official commented Aug 8, 2023 •

edited

Loading

Vipitis commented Aug 8, 2023

min-jean-cho commented Aug 8, 2023

min-jean-cho commented Aug 8, 2023

Vipitis commented Aug 16, 2023

jingxu10 commented Aug 16, 2023 •

edited

Loading

Nuullll commented Aug 17, 2023

ereish64 commented Sep 11, 2023

Vipitis commented Sep 11, 2023 •

edited

Loading

min-jean-cho commented Sep 11, 2023

ereish64 commented Sep 11, 2023

min-jean-cho commented Sep 11, 2023

ereish64 commented Sep 11, 2023

min-jean-cho commented Sep 11, 2023

Vipitis commented Sep 11, 2023

Nuullll commented Sep 18, 2023

ereish64 commented Sep 18, 2023

Mindset-Official commented Sep 18, 2023

Nuullll commented Sep 19, 2023 •

edited

Loading

ereish64 commented Nov 3, 2023

sykimm commented Mar 4, 2024

min-jean-cho commented Mar 4, 2024

[IPEX][XPU][Windows 11] It takes forever to run the first pass #399

[IPEX][XPU][Windows 11] It takes forever to run the first pass #399

Comments

whchan05 commented Aug 5, 2023 • edited Loading

Describe the issue

Mindset-Official commented Aug 5, 2023

jingxu10 commented Aug 7, 2023

whchan05 commented Aug 7, 2023 • edited Loading

min-jean-cho commented Aug 7, 2023

min-jean-cho commented Aug 7, 2023

min-jean-cho commented Aug 7, 2023

Mindset-Official commented Aug 8, 2023 • edited Loading

Vipitis commented Aug 8, 2023

min-jean-cho commented Aug 8, 2023

min-jean-cho commented Aug 8, 2023

Vipitis commented Aug 16, 2023

jingxu10 commented Aug 16, 2023 • edited Loading

Nuullll commented Aug 17, 2023

ereish64 commented Sep 11, 2023

Vipitis commented Sep 11, 2023 • edited Loading

min-jean-cho commented Sep 11, 2023

ereish64 commented Sep 11, 2023

min-jean-cho commented Sep 11, 2023

ereish64 commented Sep 11, 2023

min-jean-cho commented Sep 11, 2023

Vipitis commented Sep 11, 2023

Nuullll commented Sep 18, 2023

ereish64 commented Sep 18, 2023

Mindset-Official commented Sep 18, 2023

Nuullll commented Sep 19, 2023 • edited Loading

ereish64 commented Nov 3, 2023

sykimm commented Mar 4, 2024

min-jean-cho commented Mar 4, 2024

whchan05 commented Aug 5, 2023 •

edited

Loading

whchan05 commented Aug 7, 2023 •

edited

Loading

Mindset-Official commented Aug 8, 2023 •

edited

Loading

jingxu10 commented Aug 16, 2023 •

edited

Loading

Vipitis commented Sep 11, 2023 •

edited

Loading

Nuullll commented Sep 19, 2023 •

edited

Loading