Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

virtual method of a class cannot be used in other class' method on gpu hip #145

Closed
dreamer2368 opened this issue Jun 28, 2022 · 2 comments
Closed
Labels
bug Something isn't working

Comments

@dreamer2368
Copy link
Contributor

dreamer2368 commented Jun 28, 2022

This is demonstrated in DGNonLinearForm::evalFaceFlux_gpu() as a minimal example on the branch gpu_dryair_transport.

  // get state and gradient
  for (int eq = 0; eq < num_equation; eq++) {
    u1[eq] = d_uk_el1[eq + k * num_equation + iface * maxIntPoints * num_equation];
    u2[eq] = d_uk_el2[eq + k * num_equation + iface * maxIntPoints * num_equation];
  }

  // evaluate flux
  // d_rsolver->Eval_LF(u1, u2, nor, Rflux);

  // For some unknown reason, the hip build is very sensitive to
  // how Eval_LF is called.  If we copy the state to u1, u2 and
  // normal to nor and then call as above (which behaves correctly
  // for CUDA), the resulting flux is nonsense.  But, if we just
  // point into d_uk_el1 at the right spots, all is well.  This
  // problem seems to have something to do with virtual functions
  // and pointer arguments, but I don't understand it.  However,
  // this approach is working.
  d_rsolver->Eval_LF(d_uk_el1 + k * num_equation + iface * maxIntPoints * num_equation,
                     d_uk_el2 + k * num_equation + iface * maxIntPoints * num_equation,
                     d_shapeWnor1 + offsetShape1 + maxDofs + 1 + k * (maxDofs + 1 + dim),
                     Rflux);

A known problem in this approach is that it is not applicable to other places where data array indexing order is different.

While this is working at least for d_rsolver->Eval_LF, following the same approach for d_flux->ComputeViscousFluxes does not work (see line 371 in src/dgNonlinearForm.cpp). Through a small test on the branch issue145, it turns out that the problem is the calling of other objects' virtual methods, such as mixture->computeSpeciesEnthalpies or transport->ComputeFluxTransportProperties within d_flux->ComputeViscousFluxes.

  • When all these methods are hard-coded within the function, d_flux->ComputeViscousFluxes works.

  • If these virtual methods are changed to non-virtual, they work properly on gpu hip. On the branch issue145 (see line 95 of src/equation_of_state.cpp), GasMixture::computeSpeciesEnthalpies is defined as non-virtual, which can pass cyl3d.gpu.test. (For demonstration purpose, I hard-coded transport->ComputeFluxTransportProperties within d_flux->ComputeViscousFluxes, on line 283 of src/fluxs.cpp. So other tests will fail except cyl3d.gpu.test.)

  • Defining these methods to be const did not make a difference.

  • Both computeSpeciesEnthalpies and ComputeFluxTransportProperties are defined to be MFEM_HOST_DEVICE.

  • For an unknown reason, mixture->ComputePressure works even within other objects' method. Since this is defined inline, I also tried inline for other methods, but it didn't work.

  • Just as const RiemannSolver *d_rsolver = rsolver_;, I attempted to define a pointer GasMixture *d_mix = mixture; within ComputeViscousFluxes routine, although this did not work. Perhaps the pointer must be defined outside MFEM_FORALL loop, but this is not attempted yet.

It seems that the last point is the fundamental reason for the failure. @trevilo , why should these objects be defined outside MFEM_FORALL loop? Do you think there would be any workaround for this?

@dreamer2368 dreamer2368 changed the title virtual method of a class cannot take an array as a pointer input on gpu hip virtual method of a class cannot be used in other class' method on gpu hip Jun 28, 2022
@dreamer2368 dreamer2368 added the bug Something isn't working label Jun 28, 2022
@dreamer2368
Copy link
Contributor Author

dreamer2368 commented Jun 28, 2022

On the commit 4ff9390cfeaa1a876c0bda712ee5337f72cb0afe, GasMixture object pointer is created outside the MFEM_FORALL loop, and taken as an input argument of ComputeViscousFluxes. In this branch, TransportProperties::ComputeFluxTransportProperties is hard-coded within ComputeViscousFluxes.

This returns a memory access fault on gpu hip path:

------------------------------------
  _______ _____   _____
 |__   __|  __ \ / ____|
    | |  | |__) | (___  
    | |  |  ___/ \___ \ 
    | |  | |     ____) | 
    |_|  |_|    |_____/ 

TPS Version:  1.1 (dev)
Git Version:  e798d61
MFEM Version: MFEM v4.3 (release)
------------------------------------

Options used:
   --no-version
   --runFile inputs/input.dtconst.cyl.ini
   --no-debug

Caching input file -> inputs/input.dtconst.cyl.ini

---------------------------------
MFEM Device configuration:
Device configuration: hip,cpu
Memory configuration: host-std,hip
---------------------------------


TPS is using non-GPU-aware MPI.
Fluid is set to the preset 'dry_air'. Input options in [plasma_models] will not be used.
number of elements on rank 0 = 6153
min elements/partition       = 6153
max elements/partition       = 6153
restart: 0
Initial time-step: 2.53434e-06s

[INLET]: Patch number                      = 1 
[INLET]: Total Surface Area                = 9.78349e-03
[INLET]: # of boundary faces               = 133 
[INLET]: # of participating MPI partitions = 1 

[OUTLET]: Patch number                      = 2 
[OUTLET]: Total Surface Area                = 0.00978
[OUTLET]: # of boundary faces               = 131 
[OUTLET]: # of participating MPI partitions = 1 
Iteration = 0: wall clock time/iter = 0.000 (secs)
Memory access fault by GPU node-1 (Agent handle: 0x1c6afc0) on address 0x1d9d000. Reason: Page not present or supervisor privilege.

It seems that there are similar issues (ROCm/tensorflow-upstream#302) with gfx803 architecture.

@trevilo
Copy link
Contributor

trevilo commented Feb 3, 2023

This issue appears to be hardware specific. We were seeing the problems on our gfx803 card, but recent testing on a gfx90a (after unifying the _CUDA_ and _HIP_ paths in #192) doesn't show any problems. Since gfx803 is an old card that isn't fully supported anymore, I don't think we should worry about this unless we start seeing issues on newer hardware.

@trevilo trevilo closed this as completed Feb 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants