Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test failures and crashes on 580 #92

Closed
Krastanov opened this issue Feb 2, 2021 · 17 comments
Closed

test failures and crashes on 580 #92

Krastanov opened this issue Feb 2, 2021 · 17 comments
Labels
bug Something isn't working

Comments

@Krastanov
Copy link

My understanding is that the 580 is going out of support, but for what is worth, here is a test run and a console session with failures.

Is there any expectation for these tests to ever pass on 580?

Let me know how I can help fix these issues (if possible). I have zero knowledge of the low-level implementation of the gpu support.

A failed attempt at matrix-vector multiplication

               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.6.0-beta1 (2021-01-08)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using AMDGPU; using LinearAlgebra

julia> N = 100;

julia> m = rand(Float64, N, N); a = rand(Float64, N); b = rand(Float64, N); 

julia> m_g = ROCArray(m); a_g = ROCArray(a); b_g = ROCArray(b);

julia> versioninfo()
Julia Version 1.6.0-beta1
Commit b84990e1ac (2021-01-08 12:42 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 7 1700 Eight-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.0 (ORCJIT, znver1)

julia> mul!(b_g, m_g, a_g)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
Memory access fault by GPU node-1 (Agent handle: 0x19b4290) on address 0x640000. Reason: Page not present or supervisor privilege.

signal (6): Aborted
in expression starting at REPL[4]:1
Allocations: 34952292 (Pool: 34939863; Big: 12429); GC: 39
fish: “~/localcompiles/julia-1.6.0-bet…” terminated by signal SIGABRT (Abort)

The test summary


Test Summary:                                 | Pass  Error  Broken  Total
AMDGPU                                        |  932     15      81   1028
  Core                                        |                   1      1
  HSA                                         |   16      6             22
    HSA Status Error                          |    1                     1
    Agent                                     |    5                     5
    Memory                                    |   10      6             16
      Pointer-based                           |    3                     3
      Array-based                             |    2                     2
      Type-based                              |    1                     1
      Pointer information                     |           1              1
      Page-locked memory (OS allocations)     |           5              5
      Exceptions                              |    3                     3
      Mutable structs                         |    1                     1
  Codegen                                     |    3                     3
  Device Functions                            |  175             77    252
  ROCArray                                    |  737      9       3    749
    GPUArrays test suite                      |  737      9            746
      math                                    |    8                     8
      indexing scalar                         |  249                   249
      input output                            |    5                     5
      value constructors                      |   36                    36
      indexing multidimensional               |   25      9             34
        sliced setindex                       |    1                     1
        sliced setindex                       |    1                     1
        sliced setindex                       |    1                     1
        sliced setindex                       |    1                     1
        sliced setindex                       |    1                     1
        sliced setindex                       |    1                     1
        sliced setindex, CPU source           |    1                     1
        sliced setindex, CPU source           |    1                     1
        sliced setindex, CPU source           |    1                     1
        sliced setindex, CPU source           |    1                     1
        sliced setindex, CPU source           |    1                     1
        sliced setindex, CPU source           |    1                     1
        empty array                           |    8      7             15
          1D                                  |    1      1              2
          2D with other index Colon()         |    2      2              4
          2D with other index 1:5             |    2      2              4
          2D with other index 5               |    2      2              4
        GPU source                            |    2      1              3
        CPU source                            |    2      1              3
        JuliaGPU/CUDA.jl#461: sliced setindex |    1                     1
      interface                               |    7                     7
      conversions                             |   72                    72
      constructors                            |  335                   335
    ROCm External Libraries                   |                   3      3
ERROR: LoadError: Some tests did not pass: 932 passed, 0 failed, 15 errored, 81 broken.
in expression starting at /home/stefan/.julia/packages/AMDGPU/UpYiP/test/runtests.jl:29
ERROR: Package AMDGPU errored during testing


rocminfo

~> /opt/rocm/bin/rocminfo
ROCk module is loaded
Able to open /dev/kfd read-write
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 1700 Eight-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 1700 Eight-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3000                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32878744(0x1f5b098) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    32878744(0x1f5b098) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
    N/A                      
*******                  
Agent 2                  
*******                  
  Name:                    gfx803                             
  Uuid:                    GPU-XX                             
  Marketing Name:          Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          4096(0x1000)                       
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26591(0x67df)                      
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1366                               
  BDFID:                   2304                               
  Internal Node ID:        1                                  
  Compute Unit:            36                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8388608(0x800000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx803          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***    

clinfo

~> /opt/rocm/opencl/bin/clinfo
Number of platforms:				 2
  Platform Profile:				 FULL_PROFILE
  Platform Version:				 OpenCL 1.1 Mesa 20.3.4 - kisak-mesa PPA
  Platform Name:				 Clover
  Platform Vendor:				 Mesa
  Platform Extensions:				 cl_khr_icd
  Platform Profile:				 FULL_PROFILE
  Platform Version:				 OpenCL 2.0 AMD-APP (3212.0)
  Platform Name:				 AMD Accelerated Parallel Processing
  Platform Vendor:				 Advanced Micro Devices, Inc.
  Platform Extensions:				 cl_khr_icd cl_amd_event_callback 


  Platform Name:				 Clover
Number of devices:				 1
  Device Type:					 CL_DEVICE_TYPE_GPU
  Vendor ID:					 1002h
  Max compute units:				 36
  Max work items dimensions:			 3
    Max work items[0]:				 256
    Max work items[1]:				 256
    Max work items[2]:				 256
  Max work group size:				 256
  Preferred vector width char:			 16
  Preferred vector width short:			 8
  Preferred vector width int:			 4
  Preferred vector width long:			 2
  Preferred vector width float:			 4
  Preferred vector width double:		 2
  Native vector width char:			 16
  Native vector width short:			 8
  Native vector width int:			 4
  Native vector width long:			 2
  Native vector width float:			 4
  Native vector width double:			 2
  Max clock frequency:				 1366Mhz
  Address bits:					 64
  Max memory allocation:			 6871947673
  Image support:				 No
  Max size of kernel argument:			 1024
  Alignment (bits) of base address:		 32768
  Minimum alignment (bytes) for any datatype:	 128
  Single precision floating point capability
    Denorms:					 No
    Quiet NaNs:					 Yes
    Round to nearest even:			 Yes
    Round to zero:				 No
    Round to +ve and infinity:			 No
    IEEE754-2008 fused multiply-add:		 No
  Cache type:					 None
  Cache line size:				 0
  Cache size:					 0
  Global memory size:				 27487790692
  Constant buffer size:				 67108864
  Max number of constant args:			 16
  Local memory type:				 Scratchpad
  Local memory size:				 32768
  Kernel Preferred work group size multiple:	 64
  Error correction support:			 0
  Unified memory for Host and Device:		 0
  Profiling timer resolution:			 0
  Device endianess:				 Little
  Available:					 Yes
  Compiler available:				 Yes
  Execution capabilities:				 
    Execute OpenCL kernels:			 Yes
    Execute native function:			 No
  Queue on Host properties:				 
    Out-of-Order:				 No
    Profiling :					 Yes
  Platform ID:					 0x7f589bdbab60
  Name:						 Radeon RX 580 Series (POLARIS10, DRM 3.40.0, 5.4.0-65-generic, LLVM 11.0.1)
  Vendor:					 AMD
  Device OpenCL C version:			 OpenCL C 1.1 
  Driver version:				 20.3.4 - kisak-mesa PPA
  Profile:					 FULL_PROFILE
  Version:					 OpenCL 1.1 Mesa 20.3.4 - kisak-mesa PPA
  Extensions:					 cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp64


  Platform Name:				 AMD Accelerated Parallel Processing
Number of devices:				 1
  Device Type:					 CL_DEVICE_TYPE_GPU
  Vendor ID:					 1002h
  Board name:					 Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]
  Device Topology:				 PCI[ B#9, D#0, F#0 ]
  Max compute units:				 36
  Max work items dimensions:			 3
    Max work items[0]:				 1024
    Max work items[1]:				 1024
    Max work items[2]:				 1024
  Max work group size:				 256
  Preferred vector width char:			 4
  Preferred vector width short:			 2
  Preferred vector width int:			 1
  Preferred vector width long:			 1
  Preferred vector width float:			 1
  Preferred vector width double:		 1
  Native vector width char:			 4
  Native vector width short:			 2
  Native vector width int:			 1
  Native vector width long:			 1
  Native vector width float:			 1
  Native vector width double:			 1
  Max clock frequency:				 1366Mhz
  Address bits:					 64
  Max memory allocation:			 7301444400
  Image support:				 Yes
  Max number of images read arguments:		 128
  Max number of images write arguments:		 8
  Max image 2D width:				 16384
  Max image 2D height:				 16384
  Max image 3D width:				 16384
  Max image 3D height:				 16384
  Max image 3D depth:				 8192
  Max samplers within kernel:			 26591
  Max size of kernel argument:			 1024
  Alignment (bits) of base address:		 1024
  Minimum alignment (bytes) for any datatype:	 128
  Single precision floating point capability
    Denorms:					 No
    Quiet NaNs:					 Yes
    Round to nearest even:			 Yes
    Round to zero:				 Yes
    Round to +ve and infinity:			 Yes
    IEEE754-2008 fused multiply-add:		 Yes
  Cache type:					 Read/Write
  Cache line size:				 64
  Cache size:					 16384
  Global memory size:				 8589934592
  Constant buffer size:				 7301444400
  Max number of constant args:			 8
  Local memory type:				 Scratchpad
  Local memory size:				 65536
  Max pipe arguments:				 16
  Max pipe active reservations:			 16
  Max pipe packet size:				 3006477104
  Max global variable size:			 7301444400
  Max global variable preferred total size:	 8589934592
  Max read/write image args:			 64
  Max on device events:				 1024
  Queue on device max size:			 8388608
  Max on device queues:				 1
  Queue on device preferred size:		 262144
  SVM capabilities:				 
    Coarse grain buffer:			 Yes
    Fine grain buffer:				 Yes
    Fine grain system:				 No
    Atomics:					 No
  Preferred platform atomic alignment:		 0
  Preferred global atomic alignment:		 0
  Preferred local atomic alignment:		 0
  Kernel Preferred work group size multiple:	 64
  Error correction support:			 0
  Unified memory for Host and Device:		 0
  Profiling timer resolution:			 1
  Device endianess:				 Little
  Available:					 Yes
  Compiler available:				 Yes
  Execution capabilities:				 
    Execute OpenCL kernels:			 Yes
    Execute native function:			 No
  Queue on Host properties:				 
    Out-of-Order:				 No
    Profiling :					 Yes
  Queue on Device properties:				 
    Out-of-Order:				 Yes
    Profiling :					 Yes
  Platform ID:					 0x7f589388acf0
  Name:						 gfx803
  Vendor:					 Advanced Micro Devices, Inc.
  Device OpenCL C version:			 OpenCL C 2.0 
  Driver version:				 3212.0 (HSA1.1,LC)
  Profile:					 FULL_PROFILE
  Version:					 OpenCL 1.2 
  Extensions:					 cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program 
@Krastanov
Copy link
Author

Running on master still has failing tests, but way fewer:

Test Summary:                                 | Pass  Error  Broken  Total
AMDGPU                                        | 1040      2      79   1121
  Core                                        |                   1      1
  HSA                                         |   16                    16
  Codegen                                     |    3                     3
  Device Functions                            |  179             75    254
  ROCArray                                    |  744      2       3    749
    GPUArrays test suite                      |  744      2            746
      math                                    |    8                     8
      indexing scalar                         |  249                   249
      input output                            |    5                     5
      value constructors                      |   36                    36
      indexing multidimensional               |   32      2             34
        sliced setindex                       |    1                     1
        sliced setindex                       |    1                     1
        sliced setindex                       |    1                     1
        sliced setindex                       |    1                     1
        sliced setindex                       |    1                     1
        sliced setindex                       |    1                     1
        sliced setindex, CPU source           |    1                     1
        sliced setindex, CPU source           |    1                     1
        sliced setindex, CPU source           |    1                     1
        sliced setindex, CPU source           |    1                     1
        sliced setindex, CPU source           |    1                     1
        sliced setindex, CPU source           |    1                     1
        empty array                           |   15                    15
        GPU source                            |    2      1              3
        CPU source                            |    2      1              3
        JuliaGPU/CUDA.jl#461: sliced setindex |    1                     1
      interface                               |    7                     7
      conversions                             |   72                    72
      constructors                            |  335                   335
    ROCm External Libraries                   |                   3      3
  External Packages                           |   97                    97
ERROR: LoadError: Some tests did not pass: 1040 passed, 0 failed, 2 errored, 79 broken.
in expression starting at /home/stefan/.julia/packages/AMDGPU/TAdgr/test/runtests.jl:27
ERROR: Package AMDGPU errored during testing

The matrix multiplication still crashes

julia> using AMDGPU; using LinearAlgebra

julia> N = 100; m = rand(Float64, N, N); a = rand(Float64, N); b = rand(Float64, N); m_g = ROCArray(m); a_g = ROCArray(a); b_g = ROCArray(b);

julia> mul!(b_g, m_g, a_g)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
Memory access fault by GPU node-1 (Agent handle: 0x2177160) on address 0x640000. Reason: Page not present or supervisor privilege.

signal (6): Aborted
in expression starting at REPL[3]:1
Allocations: 31842465 (Pool: 31831316; Big: 11149); GC: 37
fish: “~/localcompiles/julia-1.6.0-bet…” terminated by signal SIGABRT (Abort)

Here is the manifest:

pkg> st --manifest
Status `~/Documents/ScratchSpace/julia_gpu/Manifest.toml`
  [21141c5a] AMDGPU v0.2.2 `https://github.com/JuliaGPU/AMDGPU.jl.git#master`
  [621f4979] AbstractFFTs v0.5.0
  [79e6a3ab] Adapt v3.1.1
  [b99e7846] BinaryProvider v0.5.10
  [fa961155] CEnum v0.4.1
  [34da2185] Compat v3.25.0
  [187b0558] ConstructionBase v1.0.0
  [864edb3b] DataStructures v0.18.9
  [0c68f7d7] GPUArrays v6.2.0
  [61eb1bfa] GPUCompiler v0.9.2
  [929cbde3] LLVM v3.6.0
  [1914dd2f] MacroTools v0.5.6
  [bac558e1] OrderedCollections v1.3.3
  [ae029012] Requires v1.1.2
  [6c6a2e73] Scratch v1.0.3
  [efcf1570] Setfield v0.7.0
  [a759f4b9] TimerOutputs v0.5.7
  [0dad84c5] ArgTools
  [56f22d72] Artifacts
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8bb1440f] DelimitedFiles
  [8ba89e20] Distributed
  [f43a241f] Downloads
  [9fa8497b] Future
  [b77e0a4c] InteractiveUtils
  [b27032c2] LibCURL
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [a63ad114] Mmap
  [ca575930] NetworkOptions
  [44cfe95a] Pkg
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA
  [9e88b42a] Serialization
  [1a1011a3] SharedArrays
  [6462fe0b] Sockets
  [2f01184e] SparseArrays
  [10745b16] Statistics
  [fa267f1f] TOML
  [a4e569a6] Tar
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
  [deac9b47] LibCURL_jll
  [29816b5a] LibSSH2_jll
  [c8ffd9c3] MbedTLS_jll
  [14a3606d] MozillaCACerts_jll
  [83775a58] Zlib_jll
  [8e850ede] nghttp2_jll

@jpsamaroo
Copy link
Member

Seems like it might be a crash in rocBLAS, but I'm not sure since I don't regularly run AMDGPU with it enabled (because it sucks to build). Do you have rocBLAS installed?

@Krastanov
Copy link
Author

I do not think so. I checked with apt-get and rocblas was not installed. Then, just to check, I also ran sudo apt-get install rocblas which reported successful (and brand new) install. However, the problem persists even after installing rocblas, so I think it is something independent from it.

@Krastanov
Copy link
Author

I checked a couple of times with and without rocblas (by running sudo apt-get install/purge rocblas and then running ] build AMDGPU), but the crash in the matrix multiplication persists.

@Krastanov
Copy link
Author

I attempted various debug and serialization flags, as suggested in ROCm/tensorflow-upstream#302 and in https://rocmdocs.amd.com/en/latest/Other_Solutions/Other-Solutions.html , but I did not get any debug info out to stderr!? Is AMDGPU.jl capturing and redirecting stderr? Any other suggestions to try to track what exactly causes the memory fault?

Here is my attempt with the entirety of its console output:

$> export HCC_SERIALIZE_KERNEL=0x3; export HCC_SERIALIZE_COPY=0x3;
$> export HIP_TRACE_API=0x2; export MIOPEN_ENABLE_LOGGING_CMD=1;
$> ~/localcompiles/julia-1.6.0-beta1/bin/julia --project=.
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.6.0-beta1 (2021-01-08)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using AMDGPU; using LinearAlgebra

julia> N = 100; m = rand(Float64, N, N); a = rand(Float64, N); b = rand(Float64, N); m_g = ROCArray(m); a_g = ROCArray(a); b_g = ROCArray(b);

julia> mul!(b_g, m_g, a_g)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
Memory access fault by GPU node-1 (Agent handle: 0x13812e0) on address 0x640000. Reason: Page not present or supervisor privilege.

@jpsamaroo
Copy link
Member

I've also noticed that HSAKMT environment variables don't work with AMDGPU.jl. We don't do any stderr capture to my knowledge. Do note that those variables apply to HCC, HIP, and MIOpen, none of which we use in any significant capacity (except for HIP, for device sync, which is not done automatically).

@Krastanov
Copy link
Author

All of this was on rocm 4. I tried also installing tensorflow-rocm, but that had the additional requirements of installing apt install rocm-libs rccl. Tensorflow seemed to work fine, but after adding these extra libraries AMDGPU.jl stopped building!? ] build AMDGPU started reporting this error ROCm/ROCm#1269

I ended downgrading to rocm 3.5.1. Now AMDGPU.jl seems to work. Tensforflow 2.4 does not work anymore, but I can downgrade tensorflow too.

There are test failures for the current release of AMDGPU:

Test Summary:                                 | Pass  Fail  Error  Broken  Total
AMDGPU                                        | 1198    12     15      90   1315
  Core                                        |                         1      1
  HSA                                         |   16            6             22
    HSA Status Error                          |    1                           1
    Agent                                     |    5                           5
    Memory                                    |   10            6             16
      Pointer-based                           |    3                           3
      Array-based                             |    2                           2
      Type-based                              |    1                           1
      Pointer information                     |                 1              1
      Page-locked memory (OS allocations)     |                 5              5
      Exceptions                              |    3                           3
      Mutable structs                         |    1                           1
  Codegen                                     |    3                           3
  Device Functions                            |  175                   77    252
  ROCArray                                    | 1003    12      9      12   1036
    GPUArrays test suite                      |  737            9            746
      math                                    |    8                           8
      indexing scalar                         |  249                         249
      input output                            |    5                           5
      value constructors                      |   36                          36
      indexing multidimensional               |   25            9             34
        sliced setindex                       |    1                           1
        sliced setindex                       |    1                           1
        sliced setindex                       |    1                           1
        sliced setindex                       |    1                           1
        sliced setindex                       |    1                           1
        sliced setindex                       |    1                           1
        sliced setindex, CPU source           |    1                           1
        sliced setindex, CPU source           |    1                           1
        sliced setindex, CPU source           |    1                           1
        sliced setindex, CPU source           |    1                           1
        sliced setindex, CPU source           |    1                           1
        sliced setindex, CPU source           |    1                           1
        empty array                           |    8            7             15
          1D                                  |    1            1              2
          2D with other index Colon()         |    2            2              4
          2D with other index 1:5             |    2            2              4
          2D with other index 5               |    2            2              4
        GPU source                            |    2            1              3
        CPU source                            |    2            1              3
        JuliaGPU/CUDA.jl#461: sliced setindex |    1                           1
      interface                               |    7                           7
      conversions                             |   72                          72
      constructors                            |  335                         335
    ROCm External Libraries                   |  266    12             12    290
      BLAS                                    |   17                          17
      FFT                                     |  106    12             12    130
        T = ComplexF64                        |   33     4                    37
          1D                                  |    3                           3
          1D inplace                          |    2                           2
          2D                                  |    3                           3
          2D inplace                          |    2                           2
          Batch 1D                            |    6                           6
          3D                                  |    3                           3
          3D inplace                          |    2                           2
          Batch 2D (in 3D)                    |    5     2                     7
          Batch 2D (in 4D)                    |    7     2                     9
        T = ComplexF32                        |   33     4                    37
          1D                                  |    3                           3
          1D inplace                          |    2                           2
          2D                                  |    3                           3
          2D inplace                          |    2                           2
          Batch 1D                            |    6                           6
          3D                                  |    3                           3
          3D inplace                          |    2                           2
          Batch 2D (in 3D)                    |    5     2                     7
          Batch 2D (in 4D)                    |    7     2                     9
        T = Float32                           |   20     2              6     28
          1D                                  |    4                           4
          2D                                  |    4                           4
          Batch 1D                            |    4     2                     6
          3D                                  |    4                           4
          Batch 2D (in 3D)                    |    1                    3      4
          Batch 2D (in 4D)                    |    3                    3      6
        T = Float64                           |   20     2              6     28
          1D                                  |    4                           4
          2D                                  |    4                           4
          Batch 1D                            |    4     2                     6
          3D                                  |    4                           4
          Batch 2D (in 3D)                    |    1                    3      4
          Batch 2D (in 4D)                    |    3                    3      6
      rand                                    |  143                         143
ERROR: LoadError: Some tests did not pass: 1198 passed, 12 failed, 15 errored, 90 broken.
in expression starting at /home/stefan/.julia/packages/AMDGPU/UpYiP/test/runtests.jl:29
ERROR: Package AMDGPU errored during testing

And here are the tests on the current master branch, doing a bit better, but still having errors:

Test Summary:                                 | Pass  Fail  Error  Broken  Total
AMDGPU                                        | 1306    12      2      88   1408
  Core                                        |                         1      1
  HSA                                         |   16                          16
  Codegen                                     |    3                           3
  Device Functions                            |  179                   75    254
  ROCArray                                    | 1010    12      2      12   1036
    GPUArrays test suite                      |  744            2            746
      math                                    |    8                           8
      indexing scalar                         |  249                         249
      input output                            |    5                           5
      value constructors                      |   36                          36
      indexing multidimensional               |   32            2             34
        sliced setindex                       |    1                           1
        sliced setindex                       |    1                           1
        sliced setindex                       |    1                           1
        sliced setindex                       |    1                           1
        sliced setindex                       |    1                           1
        sliced setindex                       |    1                           1
        sliced setindex, CPU source           |    1                           1
        sliced setindex, CPU source           |    1                           1
        sliced setindex, CPU source           |    1                           1
        sliced setindex, CPU source           |    1                           1
        sliced setindex, CPU source           |    1                           1
        sliced setindex, CPU source           |    1                           1
        empty array                           |   15                          15
        GPU source                            |    2            1              3
        CPU source                            |    2            1              3
        JuliaGPU/CUDA.jl#461: sliced setindex |    1                           1
      interface                               |    7                           7
      conversions                             |   72                          72
      constructors                            |  335                         335
    ROCm External Libraries                   |  266    12             12    290
      BLAS                                    |   17                          17
      FFT                                     |  106    12             12    130
        T = ComplexF64                        |   33     4                    37
          1D                                  |    3                           3
          1D inplace                          |    2                           2
          2D                                  |    3                           3
          2D inplace                          |    2                           2
          Batch 1D                            |    6                           6
          3D                                  |    3                           3
          3D inplace                          |    2                           2
          Batch 2D (in 3D)                    |    5     2                     7
          Batch 2D (in 4D)                    |    7     2                     9
        T = ComplexF32                        |   33     4                    37
          1D                                  |    3                           3
          1D inplace                          |    2                           2
          2D                                  |    3                           3
          2D inplace                          |    2                           2
          Batch 1D                            |    6                           6
          3D                                  |    3                           3
          3D inplace                          |    2                           2
          Batch 2D (in 3D)                    |    5     2                     7
          Batch 2D (in 4D)                    |    7     2                     9
        T = Float32                           |   20     2              6     28
          1D                                  |    4                           4
          2D                                  |    4                           4
          Batch 1D                            |    4     2                     6
          3D                                  |    4                           4
          Batch 2D (in 3D)                    |    1                    3      4
          Batch 2D (in 4D)                    |    3                    3      6
        T = Float64                           |   20     2              6     28
          1D                                  |    4                           4
          2D                                  |    4                           4
          Batch 1D                            |    4     2                     6
          3D                                  |    4                           4
          Batch 2D (in 3D)                    |    1                    3      4
          Batch 2D (in 4D)                    |    3                    3      6
      rand                                    |  143                         143
  External Packages                           |   97                          97
ERROR: LoadError: Some tests did not pass: 1306 passed, 12 failed, 2 errored, 88 broken.
in expression starting at /home/stefan/.julia/packages/AMDGPU/AKLQk/test/runtests.jl:27
ERROR: Package AMDGPU errored during testing

@Krastanov
Copy link
Author

Am I correct in assuming that if I want to use 580 with AMDGPU.jl, I have to freeze rocm to version 3.5.1 and just hope for "best effort", without any guarantees given that the device seems to be going out of support in rocm?

Should I freeze the AMDGPU.jl version too? Should I expect future versions of AMDGPU.jl to lower the level of support for 580?

Is there a more "official" table of support, giving hardware versions, rocm versions, and AMDGPU.jl versions that are tested/supported?

@Krastanov
Copy link
Author

Sigh... now there is a separate problem (on rocm 3.5.1 and AMDGPU#master) that simply gives wrong answers (no crash, just incorrect answers) when I do matrix multiplication:

julia> N = 10; T = Float64; a,b,c = cpus = [rand(T, N, N) for i in 1:3]; ag,bg,cg = [ROCArray(i) for i in cpus];

julia> mul!(ag,bg,cg)
10×10 ROCMatrix{Float64}:
 0.169517  0.666133   0.787853   0.952216  0.52438    0.226889  0.895567  0.563802   0.603744  0.0810141
 0.774994  0.0350809  0.705357   0.544661  0.775764   0.966118  0.965179  0.351198   0.25837   0.0632102
 0.947915  0.0939128  0.711592   0.964582  0.484883   0.503159  0.618847  0.199      0.598743  0.913767
 0.166383  0.24303    0.0343327  0.954652  0.952374   0.911542  0.216517  0.144033   0.601291  0.205171
 0.349153  0.223039   0.129581   0.442686  0.766986   0.551424  0.292206  0.0795419  0.43372   0.655484
 0.173297  0.241994   0.915943   0.191715  0.202254   0.305148  0.221799  0.78068    0.75416   0.900042
 0.137884  0.25165    0.342389   0.159862  0.355102   0.836764  0.989629  0.935794   0.526686  0.762097
 0.116692  0.244034   0.724202   0.794337  0.168172   0.497086  0.937436  0.592061   0.813417  0.351207
 0.33148   0.346618   0.96186    0.436207  0.430171   0.623167  0.823441  0.63495    0.477421  0.497221
 0.411855  0.231901   0.578217   0.623853  0.0970518  0.633137  0.945868  0.616912   0.731479  0.731409

julia> mul!(a,b,c)
10×10 Matrix{Float64}:
 2.87753  2.07307  3.1106   3.34475  2.74262  3.61348  3.07164  2.82941  2.95761  1.97778
 3.03134  1.83036  3.24821  3.72434  3.07734  3.92818  3.27126  3.92044  3.94197  2.58042
 2.89656  2.19109  2.64611  3.02358  2.89144  3.69149  2.87703  3.7068   3.77624  2.60822
 1.46729  1.08032  1.38129  1.41364  1.54596  1.69974  1.36592  1.88397  1.82102  0.655919
 1.90435  1.22412  1.73246  1.82557  1.94339  2.39507  1.9207   1.88028  2.36406  1.84471
 1.96599  1.63776  2.00905  2.08006  1.7166   2.25551  1.6797   1.89926  1.67244  1.063
 2.09604  1.65736  1.91219  2.21415  1.86685  2.60482  2.10144  2.70042  2.52462  1.43959
 2.56943  1.31602  1.94323  2.61937  3.13482  2.81117  2.08695  2.95018  2.91306  1.97668
 2.74569  2.02152  3.04165  3.21203  2.86864  3.48828  2.51794  2.95315  2.98953  2.64708
 3.01357  2.14793  2.52376  2.93145  3.03869  3.55187  3.10702  3.39474  3.41577  2.14719

If you guys have any suggestions where to look for the source of these issues (or whether I should downgrade/upgrade to other versions), let me know. Either way, thanks for your effort in putting this library together!

Some community-sourced table of "this hardware ran successfully for me" would be really useful.

@jpsamaroo jpsamaroo added the bug Something isn't working label Feb 4, 2021
@jpsamaroo
Copy link
Member

I tested this on my Vega system, and I also get a memory access fault. I'll run this under my newly-working debugger in the next day or two.

Btw, our CI was running on an RX480 for the longest time, but I had to remove the card because HIP started killing the build process due to not being able to find code for the GPU (stupid problem, I should reproduce it and patch it upstream). I'll probably put the RX480 in another machine and add it to the CI queue so that we ensure that we still have working support.

@Krastanov
Copy link
Author

Is there a way to donate to the CI effort? (money or compute time, especially if I can get my 580 to do CI for you; I am competent enough sysadmin to run a docker on this computer that is accessible to your CI jobs). It is in my selfish interest to get 580 with configuration similar to mine (ubuntu with same drivers and rocm version) ;)

By the way, as a new users I was definitely very confused by what rocm version I should be using. What version of rocm is used by the CI?

@jpsamaroo
Copy link
Member

We currently use Buildkite to host CI, which runs under docker-compose, so it's pretty nicely isolated. I'll talk to the JuliaGPU devs and see what they think.

Also, the ROCm config is not fixed to a particular version, which is something I would like to fix by providing ROCm libraries as JLLs, but that's complicated by such a config not working on my musl system 😄 It's on the roadmap, though.

@jpsamaroo
Copy link
Member

While I wait for a response on the CI question, I found that the issue does not turn into a regular device error when running with -g2 --check-bounds=yes on AMDGPU master (-g2 is for outputting a full device stacktrace on error), which indicates to me that this is either a miscompile, or a bug somewhere where unsafe_load/unsafe_store is being called manually (since array accesses are bounds-checked).

@jpsamaroo
Copy link
Member

Regarding CI: because adding buildkite agents requires sharing our global secret key with the agent's owner, we can't reasonably accept outside CI. However, I plan to setup an RX480 runner and ensure that we run it for all PRs, to ensure older cards still work as much as possible. We'll also be potentially getting access to a lot of newer (but still Vega arch) AMD GPUs soon, so hopefully we can use some of them for CI.

@jpsamaroo
Copy link
Member

In terms of donations from the community, I would appreciate any bug reports, code contributions, or ideas for improvements you and others might have. That's more valuable to me than CI by a long shot 🙂

@Krastanov
Copy link
Author

Sounds great! If this starts working I would certainly be active giving feedback. I do have a bunch of projects that would use bitwise operations on integer types, so hopefully I will be able to stress-test that side of the project.

@jpsamaroo
Copy link
Member

I'm closing this in favor of #103, since the failing tests you reported are known to fail (see #91), or just generally unreliable (in my experience).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants