Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/mpi pio read2ndfile #145

Merged

Conversation

ShervanGharari
Copy link
Collaborator

This is to break the step in pull 129 into two steps; in the first step (this pull) the are data structure changed introduce for reading the second nc file which provide data on water management component such as fluxes from/to river segments and target volume for lakes. Most of the change is in the standalone part for this pull.

@ShervanGharari
Copy link
Collaborator Author

Standalone, get basin runoff and model setup are changed so a another nc file can be read and the data can be read and sorter based on the nSeg for abstraction/injection or target volume.
The general changes are as follow:
1- A new data structure in DataType is given
2- Read runoff or read metadata is generalized so it does not rely on runoff data structure can read more general input out/put
3- If provided the second nc file, a data structure infield_info is populated.
4- The start and end of the files for the second file are corrected based on the first time step of runoff input nc files so the second file can be read based on iTime and iTime_local_wm (local iTime for the second file)
5- two more variables, flux and target volume are added in the get basin runoff.
The code compiles but needs to be tested.

@ShervanGharari ShervanGharari marked this pull request as ready for review August 12, 2020 17:53
@ShervanGharari
Copy link
Collaborator Author

I get segmentation fault when reading the runoff data from the global HDMA case with CLM input.

[gra807:20491:0:20491] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace ====
 0 0x0000000000010e90 __funlockfile()  ???:0
 1 0x00000000005abb41 read_runoff_mp_read_2d_runoff_()  /home/shg096/mizuRoute/route/build/src/standalone/read_runoff.f90:406
 2 0x00000000005a94d0 read_runoff_mp_read_runoff_data_()  /home/shg096/mizuRoute/route/build/src/standalone/read_runoff.f90:303
 3 0x00000000005ac471 get_runoff_mp_get_hru_runoff_()  /home/shg096/mizuRoute/route/build/src/standalone/get_basin_runoff.f90:67
 4 0x0000000000775666 MAIN__()  /home/shg096/mizuRoute/route/build/src/standalone/route_runoff.f90:88
 5 0x0000000000411b8e main()  ???:0
 6 0x00000000000202e0 __libc_start_main()  ???:0
 7 0x0000000000411aaa _start()  /tmp/nix-build-glibc-2.24.drv-0/glibc-2.24/csu/../sysdeps/x86_64/start.S:120
===================

It is from the line that pass the read and populated dummy to sim or sim2D variable:

https://github.com/ShervanGharari/mizuRoute/blob/feature/mpi-pio-read2ndfile/route/build/src/standalone/read_runoff.f90#L406

to generalize the reading with multiple data structure, unlike runoff only, I have generalized this part to pass a 2D array out of the read function.

@ShervanGharari
Copy link
Collaborator Author

The segmentation fault was due to the fact that the sim or sim2d was not allocated. I have changed that to inout so the already allocated runoff_data%sim or runoff_data%sim2d can be pass to the subroutines and be given the values.

@ShervanGharari
Copy link
Collaborator Author

ShervanGharari commented Aug 13, 2020

The code result in identical streamflow simulation for HDMA, CLM case on 16 CPUs to the branch NCAR/feature/mpi-pio.
The code is able to read the flux passed to it. it is capable of reading the correct abstraction injection nc files and its local_time (local_iTime_wm)...
A more rigorous test is needed when the read runoff addition to the river segment is distributed to CPU. This will be done in a new pull in mpi_process after this pull is merged.

route/build/src/standalone/model_setup.f90 Outdated Show resolved Hide resolved
route/build/src/public_var.f90 Outdated Show resolved Hide resolved
else
infileinfo_data(iFile)%unit = trim(time_units)
end if
call get_var_attr(trim(dir_name)//trim(inputfileinfo(iFile)%infilename), &
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users are allowed to specify time units (e.g., days since 1980-01-01 00:00:00) and calendar in control file (so overwriting the one from netcdf). This is because we need to enforce time unit format (some format is not recognized in mizuRoute) and calendar name (only allow - standard, noleap, gregorian etc.) and if netcdf has different time unit format and calendar name, it will cause problem later (maybe get run time error). To overwrite, I put some check here https://github.com/NCAR/mizuRoute/blob/62837b893cf9831d955875fc61c9471d8ac5661a/route/build/src/standalone/model_setup.f90#L218 for time unit check.

Question is now we have multiple input files. I am not sure the code needs to check and make sure that calendar and time unit is the same for both input file? This is usability issue (do users get annoyed by enforcing the same time unit and calendar ?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the calendar and time step from the control file assume that they should be the same as the water management calendar and time steps.

! private subroutine: get the two infiledata and convert the iTimebound of
! the input_info_wm to match the input_info
! *********************************************************************
SUBROUTINE inFile_corr_time(inputfileinfo, & ! input: the structure of simulated runoff, evapo and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. .. I am not sure what this subroutine is actually doing. does this just compare time bounds between two file streams (runoff vs reach fluxes take)? inputfileinfo_wm(:)%iTimebound(:) has already computed when inFile_pop is called (in init_inFile_pop) ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The init_infile_pop populated both iTimebound(:) for runoff netcdf and abstraction/injection.
The second file might have different starting and ending point in time (the first one might start earlier, later or the same might end earlier or later). To have a reference to the iTime and iTimelocal, I decided to correct this before going into the model that will make infile_name simpler.
Examples:
1)if the first of second files start earlier than the runoff files its initial iTimebound will be negative (like -10000 days meaning the file is starting 30 years earlier)
2)if the first of second files start later than runoff then the iTimebound will be positive (like it start from 1000 day after the runoff file)
in the first example the model should not have an issue the iTime_local_wm start from 10000 while iTime and iTime_local are actually 1. in the second example the iTime_local_wm will be -1000 if the model start from the first iTime (from runoff time step) and the model stops. however if the starting point is set to sometime after 1000 days then the model works as it can read the second file values...
alternatively we can add this step into infile_name is that will be cleaner...
what is your suggestions on that?

@ShervanGharari
Copy link
Collaborator Author

ShervanGharari commented Aug 25, 2020

I have added the checks for the timing of the second netcdf files. The checks are comparing the start and end of the simulation to the start and end of the second netcdf files. There is no need to check the start and end of the runoff netcdf files with the second files as start and end of simulations are updated if they area earlier or later than the runoff file.
consider the following example
runoff files 1980-01-01 to 1984-12-31
water management files 1975-01-01 to 1979-12-31
start_sim 1975-01-01
end_sime 1985-12-31
first init_time will reset the start from 1975-01-01 to start of runoff file which is 1980-01-01, then the code check the start_sim with the water management file and as it is past the water management last time step the simulation stops. Similarly if the water management runoff is after the runoff the simulation stopes because the first time. steps of water management will be passed the last time step of runoff file (which is the actual or updated sim_end).
the code compiles but need further checks.

@ShervanGharari
Copy link
Collaborator Author

The scatter_wm subroutine is added to the mpi_process.f90. the subroutine is not called. the code compiles.
next will be to call the scatter_wm and pass the distributed target volume and fluxes for main and tributary reaches to the main_route.

@ShervanGharari
Copy link
Collaborator Author

ShervanGharari commented Aug 26, 2020

more checks are added after scatter runoff for evaporation and precipitation in case is_lake_sim flag is true. The code compiles.

@ShervanGharari
Copy link
Collaborator Author

the variables of the second files are passed all the way down to the main route and are distributed to the RCHFLX_OUT based on seg order (needs checking in main_route). The code compiles but need to be tested generally.
One test can be to read the model simulation as the second file to the model. If the printed RCHFLX_OUT%REACH_WM_FLUX is identical for the same reach with simulated discharge in the river segment then scattering and ordering can be assumed correctly implemented.

@ShervanGharari
Copy link
Collaborator Author

ShervanGharari commented Sep 1, 2020

I have performed the proposed test:
1-Simulated the runoff (HDMA + CLM)
2-feed the runoff (which is in time*seg dimensions) as second nc files to the model
3-check if the passed RCHFLX_OUT%REACH_WM_FLUX is the same as RCHFLX_OUT%REACH_Q_IRF
to ensure that the scatter_wm in mpi_process.f90 is working as it should be I have done the first step with 16 cpus and the second step with 10.
the result shows that the simulated reach runoff and read reach runoff from the second file are very similar but not identical... a snippet of that can be seen here... is this the result of precision of read and write files? or the single format of writing the output files?

  SIM                          READ
  7.558259815608805E-005       7.558259676443413E-005
  9.010879396723562E-007       9.010879580273468E-007
  7.421825730101835E-007       7.421825785058900E-007
  9.062744859441921E-002       9.062744677066803E-002
  7.998593531490339E-002       7.998593896627426E-002
  5.284206748944350E-002       5.284206569194794E-002
  0.258958966906890            0.258958965539932     
  0.222698873564474            0.222698867321014     
  0.318773752833959            0.318773746490479     
  5.824489644434995E-005       5.824489562655799E-005
  4.921001208789678E-005       4.921001163893379E-005
  2.38470290873135             2.38470292091370     
  1.42609475347486             1.42609477043152     
  2.31175813896434             2.31175804138184     
  2.65326062034029             2.65326070785522     
  2.63590238006958             2.63590240478516     
  2.04380923184174             2.04380917549133     
  2.06412614986773             2.06412625312805     
  0.253412337054753            0.253412336111069     
  0.166440462818452            0.166440457105637     
  0.148173535820387            0.148173540830612     

@nmizukami
Copy link
Collaborator

Yes, output (write_simout_pio.f90) is in single precision (https://github.com/NCAR/mizuRoute/blob/62837b893cf9831d955875fc61c9471d8ac5661a/route/build/src/write_simoutput_pio.f90#L426). Variables in output netcdf lost some precision. if you change ncd_float->ncd_double, output becomes in double precision, and when you read in, it should keep the precision (i think)

@ShervanGharari
Copy link
Collaborator Author

Thank you Naoki, the next test I make is if only a handful of seg are available in the second nc file. We can then later check what will be the result with double precision instead. For now I think it is pretty clear where the difference is coming from.

@ShervanGharari
Copy link
Collaborator Author

ShervanGharari commented Sep 1, 2020

I have created a new wm input data which has only values for two segments (the sample of files are attached wm.zip). The simulation terminates shortly after it starts. The error massage reads as below; When given the second netcdf of the model output that includes all the seg values the code runs without any issues however this arises when given the wm file that only have the two segment identified. interestingly the code crashes in a place that I cannot really understand why... it also never gets to the get_basin_runoff.f90. what can it be?

srun: error: gra823: task 0: Floating point exception (core dumped)
srun: Terminating job step 37847655.0
slurmstepd: error: *** STEP 37847655.0 ON gra823 CANCELLED AT 2020-09-01T14:57:24 ***
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
route_runoff.mpi-  0000000000932B4E  Unknown               Unknown  Unknown
libpthread-2.24.s  00002AE574946E90  Unknown               Unknown  Unknown
hmca_bcol_basesmu  00002AE58333CBB0  hmca_bcol_basesmu     Unknown  Unknown
libhcoll.so.1      00002AE5847222B5  hmca_coll_ml_barr     Unknown  Unknown
mca_coll_hcoll.so  00002AE5842BDF1A  mca_coll_hcoll_ba     Unknown  Unknown
libmpi.so.40.10.2  00002AE573E25C21  PMPI_Barrier          Unknown  Unknown
libmpi_mpifh.so.4  00002AE573B7E553  MPI_Barrier_f08       Unknown  Unknown
route_runoff.mpi-  0000000000465B89  mpi_mod_mp_shr_mp         748  mpi_utils.f90
route_runoff.mpi-  00000000006FA65C  mpi_routine_mp_sc        1172  mpi_process.f90
route_runoff.mpi-  00000000006F0CD4  mpi_routine_mp_mp         775  mpi_process.f90
route_runoff.mpi-  0000000000780F1F  MAIN__                     94  route_runoff.f90
route_runoff.mpi-  0000000000411B8E  Unknown               Unknown  Unknown
libc-2.24.so       00002AE574D772E0  __libc_start_main     Unknown  Unknown
route_runoff.mpi-  0000000000411AAA  Unknown               Unknown  Unknown

I am just checking; can it be the result of allocation of wm_data starting from line 869 in model_setup.f90. There the size is only 2 instead of total reachID. we take case of the missing values in sort however I am wondering is this causes the problem for mpi.

   ! allocate the hru_ix based on number of hru_id presented in the
   allocate(wm_data_in%seg_ix(size(wm_data_in%seg_id)), stat=ierr)
   if(ierr/=0)then; message=trim(message)//'problem allocating runoff_data_in%hru_ix'; return; endif

   ! get indices of the seg ids in the input file in the routing layer
   call get_qix(wm_data_in%seg_id,  &    ! input: vector of ids in mapping file
                reachID,            &    ! input: vector of ids in the routing layer
                wm_data_in%seg_ix,  &    ! output: indices of hru ids in routing layer
                ierr, cmessage)          ! output: error control
   if(ierr/=0)then; message=trim(message)//trim(cmessage); return; endif

@ShervanGharari
Copy link
Collaborator Author

It seems that pervious communication can be solved is the second input is given the full segment that exists in the river network topology. The input should be given as such:

    seg  1    2         3    4
time  
1        3.0  missing   5    missing
2        3.2  missing   5.2  missing
.
.
.
5        6.1  missing   5.9  missing

instead of

    seg  1     3    
time. 
1        3     5    
2        3.2   5.2  
.
.
.
5        6.1   5.9  

I will still need to prepare a case for this and check...

… this moment the is not abstraction or injection or target volume to the lake as there is no lake module
@ShervanGharari
Copy link
Collaborator Author

I have added the abstration/injection to the irf. the idea is as follow:
reach streamflow - abstraction > 0 then
actual_abs = abstraction and reach stream = reach stream - abstraction
else
actual_abs = reach streamflow and reach streamflow = 0;
I have compiled the code but I faces some reason with srun on Graham in the submitted jobs. so did not yet tested the code.

… the streamflow after abstration and actual abstration are compared to give a sum of zero
@ShervanGharari
Copy link
Collaborator Author

The code compiles, the water balance difference of initial streamflow, streamflow after abstraction and actual abstraction are set to zero. it need more rigorous testing.
For now pull 145 can be merged to the main code. Additionally and before merge I can make the actual abstraction to be written in output file.

route/build/src/irf_route.f90 Outdated Show resolved Hide resolved
route/build/src/irf_route.f90 Show resolved Hide resolved
@ShervanGharari
Copy link
Collaborator Author

code compiles successfully...!

@nmizukami nmizukami merged commit 06655b2 into ESCOMP:feature/mpi-pio Oct 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants