Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kfd not supported on this ASIC for vega frontier edition #57

Closed
akostadinov opened this issue Oct 5, 2018 · 12 comments
Closed

kfd not supported on this ASIC for vega frontier edition #57

akostadinov opened this issue Oct 5, 2018 · 12 comments

Comments

@akostadinov
Copy link

akostadinov commented Oct 5, 2018

This is on a very clean, just installed RHEL 7.5, Core 2 Duo 2GHz, Gygabyte GA-P35-DS3L, Vega Frontier Edition Air cooled vs rocm-dkms-1.9.211-1.x86_64. I just did a clean-install of RHEL + yum update + follow ROCm installation document.

But things are not working well. I see in dmesg.txt:

[    2.131266] amdgpu 0000:03:00.0: kfd not supported on this ASIC

To get module loaded I had to add the following modprobe option and rebuild initrd (dracut -f):

# cat /etc/modprobe.d/00local.conf 
options amdgpu exp_hw_support=1

Some other output as advised in ROCm/ROCm#415:

# /opt/rocm/bin/rocm-smi 
====================    ROCm System Management Interface    ====================
================================================================================
 GPU  Temp    AvgPwr   SCLK     MCLK     Fan      Perf    SCLK OD    MCLK OD
  0   33c     22.0W    852Mhz   167Mhz   15.69%   auto      0%         0%       
================================================================================
====================           End of ROCm SMI Log          ====================
# /opt/rocm/bin/rocminfo 
hsa api call failure at line 900, file: /home/1019/git/rocm-rel-1.9-211/rocminfo/rocminfo.cc. Call returned 4104
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XTX [Radeon Vega Frontier Edition] (prog-if 00 [VGA controller])
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 6b76
	Flags: bus master, fast devsel, latency 0, IRQ 30
	Memory at d0000000 (64-bit, prefetchable) [size=256M]
	Memory at e0000000 (64-bit, prefetchable) [size=2M]
	I/O ports at b000 [size=256]
	Memory at f5000000 (32-bit, non-prefetchable) [size=512K]
	[virtual] Expansion ROM at f4000000 [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [64] Express Legacy Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150] Advanced Error Reporting
	Capabilities: [200] #15
	Capabilities: [270] #19
	Capabilities: [2a0] Access Control Services
	Capabilities: [2b0] Address Translation Service (ATS)
	Capabilities: [2c0] Page Request Interface (PRI)
	Capabilities: [2d0] Process Address Space ID (PASID)
	Capabilities: [320] Latency Tolerance Reporting
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu
# uname -r
3.10.0-862.14.4.el7.x86_64
# dkms status
amdgpu, 1.9-211.el7, 3.10.0-862.el7.x86_64, x86_64: installed (original_module exists)

# modinfo amdgpu
filename:       /lib/modules/3.10.0-862.14.4.el7.x86_64/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko.xz
license:        GPL and additional rights
description:    AMD GPU
author:         AMD linux driver team
firmware:       amdgpu/raven_gpu_info.bin
firmware:       amdgpu/vega10_gpu_info.bin
firmware:       amdgpu/topaz_mc.bin
firmware:       radeon/hawaii_mc.bin
firmware:       radeon/bonaire_mc.bin
firmware:       amdgpu/polaris12_mc.bin
firmware:       amdgpu/polaris10_mc.bin
firmware:       amdgpu/polaris11_mc.bin
firmware:       amdgpu/tonga_mc.bin
firmware:       amdgpu/vega10_asd.bin
firmware:       amdgpu/vega10_sos.bin
firmware:       amdgpu/polaris12_rlc.bin
firmware:       amdgpu/polaris12_mec2.bin
firmware:       amdgpu/polaris12_mec.bin
firmware:       amdgpu/polaris12_me.bin
firmware:       amdgpu/polaris12_pfp.bin
firmware:       amdgpu/polaris12_ce.bin
firmware:       amdgpu/polaris10_rlc.bin
firmware:       amdgpu/polaris10_mec2.bin
firmware:       amdgpu/polaris10_mec.bin
firmware:       amdgpu/polaris10_me.bin
firmware:       amdgpu/polaris10_pfp.bin
firmware:       amdgpu/polaris10_ce.bin
firmware:       amdgpu/polaris11_rlc.bin
firmware:       amdgpu/polaris11_mec2.bin
firmware:       amdgpu/polaris11_mec.bin
firmware:       amdgpu/polaris11_me.bin
firmware:       amdgpu/polaris11_pfp.bin
firmware:       amdgpu/polaris11_ce.bin
firmware:       amdgpu/fiji_rlc.bin
firmware:       amdgpu/fiji_mec2.bin
firmware:       amdgpu/fiji_mec.bin
firmware:       amdgpu/fiji_me.bin
firmware:       amdgpu/fiji_pfp.bin
firmware:       amdgpu/fiji_ce.bin
firmware:       amdgpu/topaz_rlc.bin
firmware:       amdgpu/topaz_mec.bin
firmware:       amdgpu/topaz_me.bin
firmware:       amdgpu/topaz_pfp.bin
firmware:       amdgpu/topaz_ce.bin
firmware:       amdgpu/tonga_rlc.bin
firmware:       amdgpu/tonga_mec2.bin
firmware:       amdgpu/tonga_mec.bin
firmware:       amdgpu/tonga_me.bin
firmware:       amdgpu/tonga_pfp.bin
firmware:       amdgpu/tonga_ce.bin
firmware:       amdgpu/stoney_rlc.bin
firmware:       amdgpu/stoney_mec.bin
firmware:       amdgpu/stoney_me.bin
firmware:       amdgpu/stoney_pfp.bin
firmware:       amdgpu/stoney_ce.bin
firmware:       amdgpu/carrizo_rlc.bin
firmware:       amdgpu/carrizo_mec2.bin
firmware:       amdgpu/carrizo_mec.bin
firmware:       amdgpu/carrizo_me.bin
firmware:       amdgpu/carrizo_pfp.bin
firmware:       amdgpu/carrizo_ce.bin
firmware:       amdgpu/raven_rlc.bin
firmware:       amdgpu/raven_mec2.bin
firmware:       amdgpu/raven_mec.bin
firmware:       amdgpu/raven_me.bin
firmware:       amdgpu/raven_pfp.bin
firmware:       amdgpu/raven_ce.bin
firmware:       amdgpu/vega10_rlc.bin
firmware:       amdgpu/vega10_mec2.bin
firmware:       amdgpu/vega10_mec.bin
firmware:       amdgpu/vega10_me.bin
firmware:       amdgpu/vega10_pfp.bin
firmware:       amdgpu/vega10_ce.bin
firmware:       amdgpu/topaz_sdma1.bin
firmware:       amdgpu/topaz_sdma.bin
firmware:       amdgpu/polaris12_sdma1.bin
firmware:       amdgpu/polaris12_sdma.bin
firmware:       amdgpu/polaris11_sdma1.bin
firmware:       amdgpu/polaris11_sdma.bin
firmware:       amdgpu/polaris10_sdma1.bin
firmware:       amdgpu/polaris10_sdma.bin
firmware:       amdgpu/stoney_sdma.bin
firmware:       amdgpu/fiji_sdma1.bin
firmware:       amdgpu/fiji_sdma.bin
firmware:       amdgpu/carrizo_sdma1.bin
firmware:       amdgpu/carrizo_sdma.bin
firmware:       amdgpu/tonga_sdma1.bin
firmware:       amdgpu/tonga_sdma.bin
firmware:       amdgpu/raven_sdma.bin
firmware:       amdgpu/vega10_sdma1.bin
firmware:       amdgpu/vega10_sdma.bin
firmware:       amdgpu/vega10_uvd.bin
firmware:       amdgpu/polaris12_uvd.bin
firmware:       amdgpu/polaris11_uvd.bin
firmware:       amdgpu/polaris10_uvd.bin
firmware:       amdgpu/stoney_uvd.bin
firmware:       amdgpu/fiji_uvd.bin
firmware:       amdgpu/carrizo_uvd.bin
firmware:       amdgpu/tonga_uvd.bin
firmware:       amdgpu/vega10_vce.bin
firmware:       amdgpu/polaris12_vce.bin
firmware:       amdgpu/polaris11_vce.bin
firmware:       amdgpu/polaris10_vce.bin
firmware:       amdgpu/stoney_vce.bin
firmware:       amdgpu/fiji_vce.bin
firmware:       amdgpu/carrizo_vce.bin
firmware:       amdgpu/tonga_vce.bin
firmware:       amdgpu/raven_vcn.bin
firmware:       amdgpu/vega10_acg_smc.bin
firmware:       amdgpu/vega10_smc.bin
firmware:       amdgpu/polaris12_smc.bin
firmware:       amdgpu/polaris11_k_smc.bin
firmware:       amdgpu/polaris11_smc_sk.bin
firmware:       amdgpu/polaris11_smc.bin
firmware:       amdgpu/polaris10_k_smc.bin
firmware:       amdgpu/polaris10_smc_sk.bin
firmware:       amdgpu/polaris10_smc.bin
firmware:       amdgpu/fiji_smc.bin
firmware:       amdgpu/tonga_k_smc.bin
firmware:       amdgpu/tonga_smc.bin
firmware:       amdgpu/topaz_k_smc.bin
firmware:       amdgpu/topaz_smc.bin
retpoline:      Y
rhelversion:    7.5
srcversion:     DF6776A72A14F9F4C37BB34
alias:          pci:v00001002d000015DDsv*sd*bc*sc*i*
alias:          pci:v00001002d0000687Fsv*sd*bc*sc*i*
alias:          pci:v00001002d0000686Csv*sd*bc*sc*i*
alias:          pci:v00001002d00006868sv*sd*bc*sc*i*
alias:          pci:v00001002d00006867sv*sd*bc*sc*i*
alias:          pci:v00001002d00006864sv*sd*bc*sc*i*
alias:          pci:v00001002d00006863sv*sd*bc*sc*i*
alias:          pci:v00001002d00006862sv*sd*bc*sc*i*
alias:          pci:v00001002d00006861sv*sd*bc*sc*i*
alias:          pci:v00001002d00006860sv*sd*bc*sc*i*
alias:          pci:v00001002d0000699Fsv*sd*bc*sc*i*
alias:          pci:v00001002d00006997sv*sd*bc*sc*i*
alias:          pci:v00001002d00006995sv*sd*bc*sc*i*
alias:          pci:v00001002d00006987sv*sd*bc*sc*i*
alias:          pci:v00001002d00006986sv*sd*bc*sc*i*
alias:          pci:v00001002d00006985sv*sd*bc*sc*i*
alias:          pci:v00001002d00006981sv*sd*bc*sc*i*
alias:          pci:v00001002d00006980sv*sd*bc*sc*i*
alias:          pci:v00001002d000067CFsv*sd*bc*sc*i*
alias:          pci:v00001002d000067CCsv*sd*bc*sc*i*
alias:          pci:v00001002d000067CAsv*sd*bc*sc*i*
alias:          pci:v00001002d000067C9sv*sd*bc*sc*i*
alias:          pci:v00001002d000067C8sv*sd*bc*sc*i*
alias:          pci:v00001002d000067DFsv*sd*bc*sc*i*
alias:          pci:v00001002d000067D0sv*sd*bc*sc*i*
alias:          pci:v00001002d000067C7sv*sd*bc*sc*i*
alias:          pci:v00001002d000067C4sv*sd*bc*sc*i*
alias:          pci:v00001002d000067C2sv*sd*bc*sc*i*
alias:          pci:v00001002d000067C1sv*sd*bc*sc*i*
alias:          pci:v00001002d000067C0sv*sd*bc*sc*i*
alias:          pci:v00001002d000067E9sv*sd*bc*sc*i*
alias:          pci:v00001002d000067E7sv*sd*bc*sc*i*
alias:          pci:v00001002d000067E1sv*sd*bc*sc*i*
alias:          pci:v00001002d000067FFsv*sd*bc*sc*i*
alias:          pci:v00001002d000067EFsv*sd*bc*sc*i*
alias:          pci:v00001002d000067EBsv*sd*bc*sc*i*
alias:          pci:v00001002d000067E8sv*sd*bc*sc*i*
alias:          pci:v00001002d000067E3sv*sd*bc*sc*i*
alias:          pci:v00001002d000067E0sv*sd*bc*sc*i*
alias:          pci:v00001002d000098E4sv*sd*bc*sc*i*
alias:          pci:v00001002d00009877sv*sd*bc*sc*i*
alias:          pci:v00001002d00009876sv*sd*bc*sc*i*
alias:          pci:v00001002d00009875sv*sd*bc*sc*i*
alias:          pci:v00001002d00009874sv*sd*bc*sc*i*
alias:          pci:v00001002d00009870sv*sd*bc*sc*i*
alias:          pci:v00001002d0000730Fsv*sd*bc*sc*i*
alias:          pci:v00001002d00007300sv*sd*bc*sc*i*
alias:          pci:v00001002d00006939sv*sd*bc*sc*i*
alias:          pci:v00001002d00006938sv*sd*bc*sc*i*
alias:          pci:v00001002d00006930sv*sd*bc*sc*i*
alias:          pci:v00001002d0000692Fsv*sd*bc*sc*i*
alias:          pci:v00001002d0000692Bsv*sd*bc*sc*i*
alias:          pci:v00001002d00006929sv*sd*bc*sc*i*
alias:          pci:v00001002d00006928sv*sd*bc*sc*i*
alias:          pci:v00001002d00006921sv*sd*bc*sc*i*
alias:          pci:v00001002d00006920sv*sd*bc*sc*i*
alias:          pci:v00001002d00006907sv*sd*bc*sc*i*
alias:          pci:v00001002d00006903sv*sd*bc*sc*i*
alias:          pci:v00001002d00006902sv*sd*bc*sc*i*
alias:          pci:v00001002d00006901sv*sd*bc*sc*i*
alias:          pci:v00001002d00006900sv*sd*bc*sc*i*
depends:        drm,drm_kms_helper,ttm,i2c-core,i2c-algo-bit
intree:         Y
vermagic:       3.10.0-862.14.4.el7.x86_64 SMP mod_unload modversions 
signer:         Red Hat Enterprise Linux kernel signing key
sig_key:        76:90:84:49:F9:08:40:6C:BF:55:67:B9:55:4D:78:FC:18:76:E5:74
sig_hashalgo:   sha256
parm:           vramlimit:Restrict VRAM for testing, in megabytes (int)
parm:           vis_vramlimit:Restrict visible VRAM for testing, in megabytes (int)
parm:           gartsize:Size of GART to setup in megabytes (32, 64, etc., -1=auto) (uint)
parm:           gttsize:Size of the GTT domain in megabytes (-1 = auto) (int)
parm:           moverate:Maximum buffer migration rate in MB/s. (32, 64, etc., -1=auto, 0=1=disabled) (int)
parm:           benchmark:Run benchmark (int)
parm:           test:Run tests (int)
parm:           audio:Audio enable (-1 = auto, 0 = disable, 1 = enable) (int)
parm:           disp_priority:Display Priority (0 = auto, 1 = normal, 2 = high) (int)
parm:           hw_i2c:hw i2c engine enable (0 = disable) (int)
parm:           pcie_gen2:PCIE Gen2 mode (-1 = auto, 0 = disable, 1 = enable) (int)
parm:           msi:MSI support (1 = enable, 0 = disable, -1 = auto) (int)
parm:           lockup_timeout:GPU lockup timeout in ms (default 0 = disable) (int)
parm:           dpm:DPM support (1 = enable, 0 = disable, -1 = auto) (int)
parm:           fw_load_type:firmware loading type (0 = direct, 1 = SMU, 2 = PSP, -1 = auto) (int)
parm:           aspm:ASPM support (1 = enable, 0 = disable, -1 = auto) (int)
parm:           runpm:PX runtime pm (1 = force enable, 0 = disable, -1 = PX only default) (int)
parm:           ip_block_mask:IP Block Mask (all blocks enabled (default)) (uint)
parm:           bapm:BAPM support (1 = enable, 0 = disable, -1 = auto) (int)
parm:           deep_color:Deep Color support (1 = enable, 0 = disable (default)) (int)
parm:           vm_size:VM address space size in gigabytes (default 64GB) (int)
parm:           vm_fragment_size:VM fragment size in bits (4, 5, etc. 4 = 64K (default), Max 9 = 2M) (int)
parm:           vm_block_size:VM page table size in bits (default depending on vm_size) (int)
parm:           vm_fault_stop:Stop on VM fault (0 = never (default), 1 = print first, 2 = always) (int)
parm:           vm_debug:Debug VM handling (0 = disabled (default), 1 = enabled) (int)
parm:           vm_update_mode:VM update using CPU (0 = never (default except for large BAR(LB)), 1 = Graphics only, 2 = Compute only (default for LB), 3 = Both (int)
parm:           vram_page_split:Number of pages after we split VRAM allocations (default 512, -1 = disable) (int)
parm:           exp_hw_support:experimental hw support (1 = enable, 0 = disable (default)) (int)
parm:           sched_jobs:the max number of jobs supported in the sw queue (default 32) (int)
parm:           sched_hw_submission:the max number of HW submissions (default 2) (int)
parm:           ppfeaturemask:all power features enabled (default)) (uint)
parm:           no_evict:Support pinning request from user space (1 = enable, 0 = disable (default)) (int)
parm:           direct_gma_size:Direct GMA size in megabytes (max 96MB) (int)
parm:           pcie_gen_cap:PCIE Gen Caps (0: autodetect (default)) (uint)
parm:           pcie_lane_cap:PCIE Lane Caps (0: autodetect (default)) (uint)
parm:           cg_mask:Clockgating flags mask (0 = disable clock gating) (uint)
parm:           pg_mask:Powergating flags mask (0 = disable power gating) (uint)
parm:           sdma_phase_quantum:SDMA context switch phase quantum (x 1K GPU clock cycles, 0 = no change (default 32)) (uint)
parm:           disable_cu:Disable CUs (se.sh.cu,...) (charp)
parm:           virtual_display:Enable virtual display feature (the virtual_display will be set like xxxx:xx:xx.x,x;xxxx:xx:xx.x,x) (charp)
parm:           ngg:Next Generation Graphics (1 = enable, 0 = disable(default depending on gfx)) (int)
parm:           prim_buf_per_se:the size of Primitive Buffer per Shader Engine (default depending on gfx) (int)
parm:           pos_buf_per_se:the size of Position Buffer per Shader Engine (default depending on gfx) (int)
parm:           cntl_sb_buf_per_se:the size of Control Sideband per Shader Engine (default depending on gfx) (int)
parm:           param_buf_per_se:the size of Off-Chip Pramater Cache per Shader Engine (default depending on gfx) (int)
parm:           job_hang_limit:how much time allow a job hang and not drop it (default 0) (int)
parm:           lbpw:Load Balancing Per Watt (LBPW) support (1 = enable, 0 = disable, -1 = auto) (int)

# modinfo amdkfd
filename:       /lib/modules/3.10.0-862.14.4.el7.x86_64/kernel/drivers/gpu/drm/amd/amdkfd/amdkfd.ko.xz
version:        0.7.2
license:        GPL and additional rights
description:    Standalone HSA driver for AMD's GPUs
author:         AMD Inc. and others
retpoline:      Y
rhelversion:    7.5
srcversion:     BE4FDC5CFB9735D5DA0516E
depends:        amd_iommu_v2
intree:         Y
vermagic:       3.10.0-862.14.4.el7.x86_64 SMP mod_unload modversions 
signer:         Red Hat Enterprise Linux kernel signing key
sig_key:        76:90:84:49:F9:08:40:6C:BF:55:67:B9:55:4D:78:FC:18:76:E5:74
sig_hashalgo:   sha256
parm:           sched_policy:Scheduling policy (0 = HWS (Default), 1 = HWS without over-subscription, 2 = Non-HWS (Used for debugging only) (int)
parm:           max_num_of_queues_per_device:Maximum number of supported queues per device (1 = Minimum, 4096 = default) (int)
parm:           send_sigterm:Send sigterm to HSA process on unhandled exception (0 = disable, 1 = enable) (int)

moved from ROCm/ROCm#572

@fxkamd
Copy link
Contributor

fxkamd commented Oct 5, 2018

Modinfo is showing the original amdgpu module, not the one installed by dkms. The kernel log shows that it's running the original driver as well (based on the amdgpu driver version).

The kernel version reported by dkms status and uname is also slightly off (3.10.0-862.el7.x86_64 vs. 3.10.0-862.14.4.el7.x86_64). Do you have multiple kernels installed?

@jlgreathouse
Copy link
Contributor

If I had to take a guess: you may have run yum update and then installed rocm-dkms without a reboot between these two steps. Or installed rocm-dkms then ran yum update and our DKMS system didn't rebuild the driver for your new kernel.

Could you run sudo dkms add amdgpu/1.9-211.el7, sudo dkms build amdgpu/1.9-211.el7, and sudo dkms install amdgpu/1.9-211.el7? Finally, if all of that completes without problem, sudo update-initramfs -u -k all

@akostadinov
Copy link
Author

akostadinov commented Oct 7, 2018

you may have run yum update and then installed rocm-dkms without a reboot between these two steps

@jlgreathouse , shamefully I didn't realize that there is an existing amdgpu module that needed to be overridden by dkms. I thought that without rocm the card will not be recognized at all so your guess is absolutely correct

Now rocminfo works fine. Also I could compile and run all HIP-Examples.

There are a few issues though.

  • When doing dkms install, it automatically updates initrd. But there are a lot of errors like Possible missing firmware "amdgpu/vega12_sdma.bin" for kernel module "amdgpu.ko". You can see the attached full log above. The same thing happens when I do dracut -f -v. FYI update-initramfs does not exist as a command on my RHEL system. The issue is that module cannot be loaded on boot, I need to rmmod/modprobe once system is booted.
  • Another thing is that I see a trace in dmesg. I can't say what it should mean or its effect but decided to upload it anyway.
  • Finally I'm worried about performance. Initially I tried to use the card on Windows due to lack of PCIe 3 and this is the first time I get it running under linux. And now the windows is gone because I figured no way to compile tensorflow on it. But what I managed to run for awhile was ethminer with Athlon 64 x2 (unfortunately switched to Core 2 duo because the other motherboard was not stable). I've got some 37-39 MH/s when weather was cold. Now with ROCm I'm getting 3.5MH/s OOB without any tuning. For how I compiled ethminer see Specify OpenCL dir/lib when compiling from source ethereum-mining/ethminer#1324 (comment)
  • Also related to performance is that I see GPU fan capped at 40% using rocm-smi. Maybe I need to change the pp tables. But presently performance I see is so little that it doesn't matter. When I have more time I'd rather focus on other benchmarks first.

When I have some more time I'm gonna try running some tensorflow benchmarks and try to compare with other results I find on the internet. It is a hobby project for me and I would prefer to wait a little bit if ROCm can be fixed instead of messing my system with proprietary drivers. But maybe I can clone the system to another HDD for a try of amdgpu pro to see if there is any difference.

Any help with the above issues is appreciated. Let me know if I can provide more debug info or if you like me to file issues separately.

Thanks a lot!

@jlgreathouse
Copy link
Contributor

jlgreathouse commented Oct 8, 2018

shamefully I didn't realize that there is an existing amdgpu module that needed to be overridden by dkms. I thought that without rocm the card will not be recognized at all so your guess is absolutely correct

Nothing shameful about this. The ROCm stack has a lot of moving pieces, and it's not like we've written documentation for everything. But yes, most distributions come with the upstream version of amdgpu and amdkfd by default. However, unless these are sufficiently new (4.17, 4.18, etc.), these upstream drivers will not work with the user-land ROCm software stack. As such, the rocm-dkms package installs our customized ROCm drivers using DKMS. However, our DKMS scripts sometimes do not properly build or install the package if the a kernel is installed but not yet loaded when you install the DKMS module.

  • When doing dkms install, it automatically updates initrd. But there are a lot of errors like Possible missing firmware "amdgpu/vega12_sdma.bin" for kernel module "amdgpu.ko". You can see the attached full log above. The same thing happens when I do dracut -f -v. FYI update-initramfs does not exist as a command on my RHEL system. The issue is that module cannot be loaded on boot, I need to rmmod/modprobe once system is booted.

I just attempted this on a CentOS 7.5 installation. Mind you, I'm working with the ROCm 1.9.1 release that we put out late last week:

amdkfd.ko.xz:
Running module version sanity check.
 - Original module
   - Found /lib/modules/3.10.0-862.14.4.el7.x86_64/updates/amdkfd.ko.xz
   - Storing in /var/lib/dkms/amdgpu/original_module/3.10.0-862.14.4.el7.x86_64/x86_64/
   - Archiving for uninstallation purposes
 - Installation
   - Installing to /lib/modules/3.10.0-862.14.4.el7.x86_64/extra/
Adding any weak-modules
Possible missing firmware "amdgpu/vega12_gpu_info.bin" for kernel module "amdgpu.ko"
Possible missing firmware "amdgpu/vega12_asd.bin" for kernel module "amdgpu.ko"
Possible missing firmware "amdgpu/vega12_sos.bin" for kernel module "amdgpu.ko"
Possible missing firmware "amdgpu/vegam_rlc.bin" for kernel module "amdgpu.ko"
Possible missing firmware "amdgpu/vegam_mec2.bin" for kernel module "amdgpu.ko"
Possible missing firmware "amdgpu/vegam_mec.bin" for kernel module "amdgpu.ko"
Possible missing firmware "amdgpu/vegam_me.bin" for kernel module "amdgpu.ko"
Possible missing firmware "amdgpu/vegam_pfp.bin" for kernel module "amdgpu.ko"
Possible missing firmware "amdgpu/vegam_ce.bin" for kernel module "amdgpu.ko"
Possible missing firmware "amdgpu/vega12_rlc.bin" for kernel module "amdgpu.ko"
Possible missing firmware "amdgpu/vega12_mec2.bin" for kernel module "amdgpu.ko"
Possible missing firmware "amdgpu/vega12_mec.bin" for kernel module "amdgpu.ko"
Possible missing firmware "amdgpu/vega12_me.bin" for kernel module "amdgpu.ko"
Possible missing firmware "amdgpu/vega12_pfp.bin" for kernel module "amdgpu.ko"
Possible missing firmware "amdgpu/vega12_ce.bin" for kernel module "amdgpu.ko"
Possible missing firmware "amdgpu/vegam_sdma1.bin" for kernel module "amdgpu.ko"
Possible missing firmware "amdgpu/vegam_sdma.bin" for kernel module "amdgpu.ko"
Possible missing firmware "amdgpu/vega12_sdma1.bin" for kernel module "amdgpu.ko"
Possible missing firmware "amdgpu/vega12_sdma.bin" for kernel module "amdgpu.ko"
Possible missing firmware "amdgpu/vega12_uvd.bin" for kernel module "amdgpu.ko"
Possible missing firmware "amdgpu/vegam_uvd.bin" for kernel module "amdgpu.ko"
Possible missing firmware "amdgpu/vega12_vce.bin" for kernel module "amdgpu.ko"
Possible missing firmware "amdgpu/vegam_vce.bin" for kernel module "amdgpu.ko"
Possible missing firmware "amdgpu/vega12_smc.bin" for kernel module "amdgpu.ko"
Possible missing firmware "amdgpu/vegam_smc.bin" for kernel module "amdgpu.ko"

depmod....

Backing up initramfs-3.10.0-862.14.4.el7.x86_64.img to /boot/initramfs-3.10.0-862.14.4.el7.x86_64.img.old-dkms
Making new initramfs-3.10.0-862.14.4.el7.x86_64.img
(If next boot fails, revert to initramfs-3.10.0-862.14.4.el7.x86_64.img.old-dkms image)
dracut..............

DKMS: install completed.

So some of those missing firmware images are expected. In particular, for our yet-to-be-released "Vega 12" GPUs, and the not-yet-supported-in-ROCM "Vega M" GPUs. That said, your installation seems to be missing many more firmware images. Could you show me what files (if any?) exist in /usr/src/amdgpu-1.9-211.el7/firmware/amdgpu/ ? The /usr/src/amdgpu-1.9.211.el7/ directory contains all of the source code that DKMS uses to build the driver, including the firmware blobs that are used to initialize a lot of the IP blocks on the GPU.

The fact that your DKMS install is missing so many firmware files implies to me that something went wrong during the initial download/installation.

If you do yum autoremove rocm-dkms, yum clean all, and yum install rocm-dkms (to get ROCm 1.9.1) then reboot, do you see the same problems?

  • Another thing is that I see a trace in dmesg. I can't say what it should mean or its effect but decided to upload it anyway.

I suspect that this issue is because you are trying to load the amdgpu module after boot. You should really be loading amdgpu, amdkfd, etc. at boot time as far as I am aware.

  • Finally I'm worried about performance. Initially I tried to use the card on Windows due to lack of PCIe 3 and this is the first time I get it running under linux. And now the windows is gone because I figured no way to compile tensorflow on it. But what I managed to run for awhile was ethminer with Athlon 64 x2 (unfortunately switched to Core 2 duo because the other motherboard was not stable). I've got some 37-39 MH/s when weather was cold. Now with ROCm I'm getting 3.5MH/s OOB without any tuning. For how I compiled ethminer see ethereum-mining/ethminer#1324 (comment)

I'm unable to offer help with individual applications. If the developer or users of this application can point out where in the ROCm software stack they believe a problem is happening, we're happy to investigate those potential issues. However, "my application is not performing as well as I would like" is a bit too general. There are thousands of apps out there, and we can't promise to personally help optimize every one of them.

  • Also related to performance is that I see GPU fan capped at 40% using rocm-smi. Maybe I need to change the pp tables. But presently performance I see is so little that it doesn't matter. When I have more time I'd rather focus on other benchmarks first.

Likely a dupe of this issue. Keep an eye on that one.

@jlgreathouse
Copy link
Contributor

Hi @akostadinov

Related to the GPU fan being capped at 40% on your GPU. I have given a relatively long description of this effect, why it happens, and some potential steps you can take to bypass it in this response.

@kentrussell
Copy link
Contributor

The error in the dmesg is a display issue so it shouldn't affect the performance of a compute application, if that helps.

@akostadinov
Copy link
Author

akostadinov commented Dec 24, 2018

Thank you all for chiming in! Here is log after clean 2.0-89.el7 installation and reboot (as suggested by @jlgreathouse): dkms_install.log

List of firmware: firmware_list.txt

In dmesg I still see

[    3.726038] amdgpu 0000:03:00.0: Failed to load gpu_info firmware "amdgpu/vega10_gpu_info.bin"
[    3.726090] amdgpu 0000:03:00.0: Fatal error during GPU init
[    3.726136] [drm] amdgpu: finishing device.
[    3.727125] amdgpu: probe of 0000:03:00.0 failed with error -2

Seems like firmwares are there but not found for some reason. Any advice?

@jlgreathouse
Copy link
Contributor

Could you please give me the exact list of directions you're using to do this installation? I cannot reproduce this problem on CentOS 7.5 or CentOS 7.6 in either ROCm 1.9.1 or ROCm 2.0. How are you getting the rock-dkms package? Are you trying to rebuild from source, or are you getting the package from repo.radeon.com?

On m CentOS box, I do see some firmware images missing (primarily for not-yet-supported GPUs like Vega 12 and VegaM, and for older GPUs that we do not support, like Kabini and Bonaire). I can, however, find the Vega10 firmware even though your log shows that your build did not.

I see in your log that you are manually using sudo dkms install amdgpu/2.0-89.el7. Is this because the initial yum install rock-dkms (or yum install rocm-dkms) failed to install the package?

One thing that may be worth trying if you're installing on a fresh system. Our Experimental ROC project has scripts for installing ROCm from scratch on various Linux distributions. I've tested all of these on various system configurations, and each has yielded a working ROCm installation. For instance, on your setup (as described earlier), you can run the scripts in distro_install_scripts/CentOS/CentOS_7.6/rpm_install/ to configure your system to support ROCm (00_prepare_system_centos_7.6.sh), install base ROCm (01_install_rocm_centos_7.6.sh), configure your users to allow GPU access (02_setup_rocm_users.sh), and install ROCm software libraries (03_install_rocm_libraries_cetos_7.6.sh). Since these are shell scripts, it should also be possible to follow these directions manually.

@akostadinov
Copy link
Author

@jlgreathouse , hi, I did clean install as you suggested. If I haven't paster earlier, here is repo that I use
rocm.repo.

sudo yum autoremove rocm-dkms
sudo reboot
sudo yum install  rocm-dkms
sudo reboot

What I see in dmesg is [ 3.745226] amdgpu 0000:03:00.0: Failed to load gpu_info firmware "amdgpu/vega10_gpu_info.bin".

So as explained in my previous comment I play with dkms remove/add/uninstall/build/install as you earlier proposed. In all cases it appears the firmware files are not found although they are on the file system (see my previous comment. Maybe issue is only with dracut finding them because after full boot, the driver seems loaded properly by removing and then loading module again:

sudo rmmod amdgpu
sudo modprobe amdgpu

@akostadinov
Copy link
Author

akostadinov commented Dec 31, 2018

I figured out what the issue was. Some earlier package version has installed /etc/dracut.conf.d/amdgpu-3.10.0-957.1.3.el7.x86_64.conf and /etc/dracut.conf.d/amdgpu-3.10.0-957.el7.x86_64.conf and inside these files there was the fw_dir directive. These directives actually remove the default firmware dirs thus causing issues when building initrd.

These files do not exist in latest package versions thus it shouldn't be an issue any more.

Apparently yum remove did not get rid of the files. A similar report and explanation of dracut behaviour can be found in SUSE bugzilla.

I'll close this issue now and create new issues if I hit anything else. Thank you all!

FYI a quick ethminer 0.18 alpha-3 check gets me almost 36Mhash/s which is something reasonable without any optimizations. Fan cap at 40% gets me down to 32 after few minutes. Now trying to run tensorflow container image but can't yet figure out how to run the benchmarks from it. Will also check the fan limit thing later.

@akostadinov
Copy link
Author

I can't help myself thanking you for the amazing work I see done for reaching this 2.0-89.el7 release. It is rock solid the whole afternoon and evening. Overdrive controls are working so nicely with cli and sysfs interface. This is really a solid base for doing serious work. I'm sure it is only time to make this platform most popular. High quality open source platform, the feeling to actually own your gear instead of having to play by the rules of somebody...

@jlgreathouse
Copy link
Contributor

@akostadinov thank you very much for your feedback, and for your work in tracking down the firmware problem you were describing. Having this information here will definitely be helpful in the future if this problem pops up again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants