Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DYAMOND configuration GPU run + metadata size error #2216

Closed
sriharshakandala opened this issue Oct 9, 2023 · 11 comments
Closed

DYAMOND configuration GPU run + metadata size error #2216

sriharshakandala opened this issue Oct 9, 2023 · 11 comments
Assignees

Comments

@sriharshakandala
Copy link
Member

sriharshakandala commented Oct 9, 2023

The DYAMOND configuration GPU run is crashes witha metadata size error on the GPU.
The full error message can be found here: https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/106#018b1570-62bc-400f-a454-04a9a2595aa3
The error seems to be triggered in atmos_state function in src/initial_conditions

@sriharshakandala
Copy link
Member Author

cc: @simonbyrne

@simonbyrne
Copy link
Member

We may have to refactor how we initialize things?

@sriharshakandala
Copy link
Member Author

Sure. I can take a look!

@simonbyrne
Copy link
Member

failure here:
https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/106#018b1570-62bc-400f-a454-04a9a2595aa3/138-231
in particular it is

c = atmos_center_variables.(
local_state.(Fields.local_geometry_field(center_space)),
atmos_model,

@sriharshakandala sriharshakandala self-assigned this Oct 9, 2023
@charleskawczynski
Copy link
Member

We may have to refactor how we initialize things?

Can we first try to do this particular operation on the CPU?

@sriharshakandala
Copy link
Member Author

sriharshakandala commented Oct 18, 2023

The simulation runs fine with Float32 but crashes with Float64. The metadata size in Float64 case seems to approximately double what it is with Float32.

Dry Baroclinic wave with Float32
-----------------------------------------
thermodynamics_params => 100 bytes
rrtmgp_params => 28 bytes
insolation_params => 36 bytes
microphysics_params => 976 bytes
surface_fluxes_params => 124 bytes
turbconv_params => 1192 bytes
Omega => 4 bytes
f_plane_coriolis_frequency => 4 bytes
planet_radius => 4 bytes
astro_unit => 4 bytes
entr_tau => 4 bytes
detr_tau => 4 bytes
entr_coeff => 4 bytes
detr_coeff => 4 bytes
C_E => 4 bytes
C_H => 4 bytes
ΔT_y_dry => 4 bytes
ΔT_y_wet => 4 bytes
σ_b => 4 bytes
T_equator_dry => 4 bytes
T_equator_wet => 4 bytes
T_min_hs => 4 bytes
Δθ_z => 4 bytes
alpha_rayleigh_w => 4 bytes
alpha_rayleigh_uh => 4 bytes
zd_viscous => 4 bytes
zd_rayleigh => 4 bytes
kappa_2_sponge => 4 bytes
-----------------------------------------
total = 2544 bytes
-----------------------------------------
Aquaplanet DYAMOND configuration with Float64
-----------------------------------------
thermodynamics_params => 200 bytes
rrtmgp_params => 56 bytes
insolation_params => 72 bytes
microphysics_params => 1952 bytes
surface_fluxes_params => 248 bytes
turbconv_params => 2360 bytes
Omega => 8 bytes
f_plane_coriolis_frequency => 8 bytes
planet_radius => 8 bytes
astro_unit => 8 bytes
entr_tau => 8 bytes
detr_tau => 8 bytes
entr_coeff => 8 bytes
detr_coeff => 8 bytes
C_E => 8 bytes
C_H => 8 bytes
ΔT_y_dry => 8 bytes
ΔT_y_wet => 8 bytes
σ_b => 8 bytes
T_equator_dry => 8 bytes
T_equator_wet => 8 bytes
T_min_hs => 8 bytes
Δθ_z => 8 bytes
alpha_rayleigh_w => 8 bytes
alpha_rayleigh_uh => 8 bytes
zd_viscous => 8 bytes
zd_rayleigh => 8 bytes
kappa_2_sponge => 8 bytes
-----------------------------------------
total = 5064 bytes
-----------------------------------------

@sriharshakandala
Copy link
Member Author

I think we need to refactor the initialization code!

@simonbyrne
Copy link
Member

yeah, we probably shouldn't use a single broadcast.

@sriharshakandala
Copy link
Member Author

sriharshakandala commented Oct 19, 2023

Upgrading to the current CUDA.jl main branch might solve this issue temporarily on A100 GPUs!
I opened a new issue for refactoring initialization code here: #2258

@simonbyrne
Copy link
Member

@sriharshakandala Is this resolved?

@sriharshakandala
Copy link
Member Author

Yes. The final piece, involving P100s, is resolved with a test branch, here: https://github.com/orgs/CliMA/projects/55/views/1?pane=issue&itemId=37383219

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants