virtual method of a class cannot be used in other class' method on gpu hip #145

dreamer2368 · 2022-06-28T00:48:13Z

This is demonstrated in DGNonLinearForm::evalFaceFlux_gpu() as a minimal example on the branch gpu_dryair_transport.

  // get state and gradient
  for (int eq = 0; eq < num_equation; eq++) {
    u1[eq] = d_uk_el1[eq + k * num_equation + iface * maxIntPoints * num_equation];
    u2[eq] = d_uk_el2[eq + k * num_equation + iface * maxIntPoints * num_equation];
  }

  // evaluate flux
  // d_rsolver->Eval_LF(u1, u2, nor, Rflux);

  // For some unknown reason, the hip build is very sensitive to
  // how Eval_LF is called.  If we copy the state to u1, u2 and
  // normal to nor and then call as above (which behaves correctly
  // for CUDA), the resulting flux is nonsense.  But, if we just
  // point into d_uk_el1 at the right spots, all is well.  This
  // problem seems to have something to do with virtual functions
  // and pointer arguments, but I don't understand it.  However,
  // this approach is working.
  d_rsolver->Eval_LF(d_uk_el1 + k * num_equation + iface * maxIntPoints * num_equation,
                     d_uk_el2 + k * num_equation + iface * maxIntPoints * num_equation,
                     d_shapeWnor1 + offsetShape1 + maxDofs + 1 + k * (maxDofs + 1 + dim),
                     Rflux);

A known problem in this approach is that it is not applicable to other places where data array indexing order is different.

While this is working at least for d_rsolver->Eval_LF, following the same approach for d_flux->ComputeViscousFluxes does not work (see line 371 in src/dgNonlinearForm.cpp). Through a small test on the branch issue145, it turns out that the problem is the calling of other objects' virtual methods, such as mixture->computeSpeciesEnthalpies or transport->ComputeFluxTransportProperties within d_flux->ComputeViscousFluxes.

When all these methods are hard-coded within the function, d_flux->ComputeViscousFluxes works.
If these virtual methods are changed to non-virtual, they work properly on gpu hip. On the branch issue145 (see line 95 of src/equation_of_state.cpp), GasMixture::computeSpeciesEnthalpies is defined as non-virtual, which can pass cyl3d.gpu.test. (For demonstration purpose, I hard-coded transport->ComputeFluxTransportProperties within d_flux->ComputeViscousFluxes, on line 283 of src/fluxs.cpp. So other tests will fail except cyl3d.gpu.test.)
Defining these methods to be const did not make a difference.
Both computeSpeciesEnthalpies and ComputeFluxTransportProperties are defined to be MFEM_HOST_DEVICE.
For an unknown reason, mixture->ComputePressure works even within other objects' method. Since this is defined inline, I also tried inline for other methods, but it didn't work.
Just as const RiemannSolver *d_rsolver = rsolver_;, I attempted to define a pointer GasMixture *d_mix = mixture; within ComputeViscousFluxes routine, although this did not work. Perhaps the pointer must be defined outside MFEM_FORALL loop, but this is not attempted yet.

It seems that the last point is the fundamental reason for the failure. @trevilo , why should these objects be defined outside MFEM_FORALL loop? Do you think there would be any workaround for this?

The text was updated successfully, but these errors were encountered:

dreamer2368 · 2022-06-28T19:00:39Z

On the commit 4ff9390cfeaa1a876c0bda712ee5337f72cb0afe, GasMixture object pointer is created outside the MFEM_FORALL loop, and taken as an input argument of ComputeViscousFluxes. In this branch, TransportProperties::ComputeFluxTransportProperties is hard-coded within ComputeViscousFluxes.

This returns a memory access fault on gpu hip path:

------------------------------------
  _______ _____   _____
 |__   __|  __ \ / ____|
    | |  | |__) | (___  
    | |  |  ___/ \___ \ 
    | |  | |     ____) | 
    |_|  |_|    |_____/ 

TPS Version:  1.1 (dev)
Git Version:  e798d61
MFEM Version: MFEM v4.3 (release)
------------------------------------

Options used:
   --no-version
   --runFile inputs/input.dtconst.cyl.ini
   --no-debug

Caching input file -> inputs/input.dtconst.cyl.ini

---------------------------------
MFEM Device configuration:
Device configuration: hip,cpu
Memory configuration: host-std,hip
---------------------------------


TPS is using non-GPU-aware MPI.
Fluid is set to the preset 'dry_air'. Input options in [plasma_models] will not be used.
number of elements on rank 0 = 6153
min elements/partition       = 6153
max elements/partition       = 6153
restart: 0
Initial time-step: 2.53434e-06s

[INLET]: Patch number                      = 1 
[INLET]: Total Surface Area                = 9.78349e-03
[INLET]: # of boundary faces               = 133 
[INLET]: # of participating MPI partitions = 1 

[OUTLET]: Patch number                      = 2 
[OUTLET]: Total Surface Area                = 0.00978
[OUTLET]: # of boundary faces               = 131 
[OUTLET]: # of participating MPI partitions = 1 
Iteration = 0: wall clock time/iter = 0.000 (secs)
Memory access fault by GPU node-1 (Agent handle: 0x1c6afc0) on address 0x1d9d000. Reason: Page not present or supervisor privilege.

It seems that there are similar issues (ROCm/tensorflow-upstream#302) with gfx803 architecture.

trevilo · 2023-02-03T03:18:50Z

This issue appears to be hardware specific. We were seeing the problems on our gfx803 card, but recent testing on a gfx90a (after unifying the _CUDA_ and _HIP_ paths in #192) doesn't show any problems. Since gfx803 is an old card that isn't fully supported anymore, I don't think we should worry about this unless we start seeing issues on newer hardware.

dreamer2368 changed the title ~~virtual method of a class cannot take an array as a pointer input on gpu hip~~ virtual method of a class cannot be used in other class' method on gpu hip Jun 28, 2022

dreamer2368 added the bug Something isn't working label Jun 28, 2022

dreamer2368 mentioned this issue Jun 28, 2022

Gpu dryair transport #146

Merged

dreamer2368 mentioned this issue Jun 30, 2022

Refactoring DryAir from _CUDA_ path #147

Merged

trevilo mentioned this issue Jul 5, 2022

Upgrade _HIP_ code path #151

Closed

trevilo closed this as completed Feb 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

virtual method of a class cannot be used in other class' method on gpu hip #145

virtual method of a class cannot be used in other class' method on gpu hip #145

dreamer2368 commented Jun 28, 2022 •

edited

Loading

dreamer2368 commented Jun 28, 2022 •

edited

Loading

trevilo commented Feb 3, 2023

virtual method of a class cannot be used in other class' method on gpu hip #145

virtual method of a class cannot be used in other class' method on gpu hip #145

Comments

dreamer2368 commented Jun 28, 2022 • edited Loading

dreamer2368 commented Jun 28, 2022 • edited Loading

trevilo commented Feb 3, 2023

dreamer2368 commented Jun 28, 2022 •

edited

Loading

dreamer2368 commented Jun 28, 2022 •

edited

Loading