diff --git a/docs/sphinx/source/developer.rst b/docs/sphinx/source/developer.rst index c164a6667..818e273a8 100644 --- a/docs/sphinx/source/developer.rst +++ b/docs/sphinx/source/developer.rst @@ -7,5 +7,6 @@ This page contains documentation intended for developers. :glob: developer/programmer_notes + developer/mpi_comms developer/cuda developer/tests diff --git a/docs/sphinx/source/developer/mpi_comms.rst b/docs/sphinx/source/developer/mpi_comms.rst new file mode 100644 index 000000000..4fcf4ac9d --- /dev/null +++ b/docs/sphinx/source/developer/mpi_comms.rst @@ -0,0 +1,155 @@ +MPI Communication +################# + +SIROCCO is parallelised using the Message Passing Interface (MPI). This page contains information on how data is shared +between ranks and should serve as a basic set of instructions for extending or modifying the data communication +routines. + +In general, all calls to MPI are isolated from the rest of SIROCCO. Most, if not all, of the MPI code is contained +within give source files, which deal entirely with parallelisation or communication. Currently these files are: + +- :code:`communicate_macro.c` +- :code:`communicate_plasma.c` +- :code:`communicate_spectra.c` +- :code:`communicate_wind.c` +- :code:`para_update.c` + +Given the names of the files, it should be obvious what sort of code is contained in them. If you need to extend or +implement a new function for MPI, please place it either in one of the above files or create a new file using an +appropriately similar name. Any parallel code should be wrapped by :code:`#ifdef MPI_ON` and :code:`#endif` as shown in +the code example below: + +.. code:: c + + void communication_function(void) + { + #ifdef MPI_ON + /* MPI communication could should go between the #ifdef's here */ + #endif + } + +Don't forget to update the Makefile and :code:`templates.h` if you add a new file or function. + +Communication pattern: broadcasting data to all ranks +===================================================== + +By far the most typical communication pattern in SIROCCO (and, I think, the only pattern) is to broadcast data from one +rank to all other ranks. This is done, for example, to update and synchronise the plasma or macro atom grids in each +rank. As the data structures in SIROCCO are fairly complex and use pointers/dynamic memory allocation, we as forced to +manually pack and unpack a contiguous communication buffer which results in a fairly manual (and error prone?) process +for communicating data. + +Calculating the size of the communication buffer +------------------------------------------------ + +The size of the communication buffer has to be calculated manually, by counting the number of variables being copied +into it and converting this to the appropriate number of bytes. This is done by the :code:`calculate_comm_buffer_size` +function which takes two arguments: 1) the number of :code:`int`'s and 2) the number of :code:`double`'s. We have to +_manually_ count the number of :code:`int` and :code:`double` variables being communicated. Due to the manual nature of +this, greate care has to be taken to ensure the correct number are counted otherwise MPI will cause crash during +communication. + +When counting variables, one needs to count the number if _single_ variables of a certain type as well as the number of +elements in an array of that same type. Consider the example below, + +.. code:: c + + int my_int; + int *my_int_arr = malloc(10 * sizeof(int)); + int num_ints = 11; + +In this case there are 11 :code:`int`s which will want to be communicated. In practise, calculating the communication +buffer is usually done as in the code example below: + +.. code:: c + + /* We need to ensure the buffer is large enough, as soon ranks may be sending a smaller + communicating buffer. When communicating the plasma grid for example, some ranks may send + 10 cells whilst others may send 9. Therefore we need the buffer to be big enough to receive + 10 cells of data */ + int n_cells_max = get_max_cells_per_rank(NDIM2); + + /* Count the number of integers which will be copied to the communication buffer. In this + example (20 + 2 * nphot_total + 1) is the number of ints being sent PER CELL; + 20 corresponds to 20 ints, 2 * nphot_total corresponds to 2 arrays with nphot_total elements + and the + 1 is an extra int to send the cell number. The extra + 1 at the end is used to + communicate the size of the buffer in bytes */ + int num_ints = n_cells_max * (20 + nphot_total + 1) + 1; + + /* Count the number of doubles to send, following the same arguments as above */ + int num_doubles = n_cells_max * (71 + 2 * NXBANDS + 6 * nphot_total); + + /* Using the data above, we can calculate the buffer size in bytes and then allocate memory*/ + int comm_buffer_size = calculate_comm_buffer_size(num_ints, num_doubles); + char * comm_buffer = malloc(comm_buffer_size); + +Communication implementation +---------------------------- + +The general pattern for packing data into a communication buffer and then sharing it between ranks is as follows, + +- Loop over all the MPI ranks (in MPI_COMM_WORLD. +- If the loop variable is equal to a rank's ID, that rank will broadcast it's subset of data to the other ranks. This + rank uses :code:`MPI_Pack` to copy its data into the communication buffer. +- All ranks call :code:`MPI_Bcast`, which sends data from the root rank (this is the rank which has just put its data + into the communication buffer) and receives it into all non-root ranks. +- Non-root ranks use :code:`MPI_Unpack` to copy data from the communication buffer into the appropriate location. +- This is repeated until all MPI ranks have sent their data root, and have therefore received data from all other ranks. + +In code, this looks something like this: + +.. code:: c + + char *comm_buffer = malloc(comm_buffer_size); + + /* loop over all mpi ranks */ + for (int rank = 0 ; rank < np_mpi_global; ++rank) + { + /* if rank == your rank id, then pack data into comm_buffer. This is the root rank */ + if (rank_global == rank) + { + /* communicates the number of cells the other ranks have to unpack. n_cells_rank + is usually provided via a function argument */ + MPI_Pack(&n_cells_rank, 1, MPI_INT, comm_buffer, ...); + /* start and stop refer to the first cell and last cell for the subset + of cells which this rank has updated or is broadcasting. stop and start + usually are provided via function arguments */ + for (int n_plasma = start; n_plasma < stop; ++n_plasma) + { + MPI_Pack(&plasmamain[n_plasma]->nwind, 1, MPI_INT, comm_buffer, ...); + } + } + + /* every rank calls MPI_Bcast: the root rank will send data and non-root ranks + will receive data */ + MPI_Bcast(comm_buffer, comm_buffer_size, ...); + + /* if you aren't the root rank, then unpack data from the comm buffer */ + if (rank_global != rank) + { + /* unpack the number of cells communicated, so we know how many cells of data, + for example, we need to unpack */ + MPI_Unpack(comm_buffer, 1, MPI_INT, ..., &n_cells_communicated, ...); + /* now we can unpack back into the appropriate data structure */ + for (int n_plasma = 0; n_plasma < n_cells_communicated; ++n_plasma) + { + MPI_Unpack(comm_buffer, 1, MPI_INT, ..., &plasmamain[n_plasma]->nwind, ...); + } + } + } + +This is likely the most best method to communicate data in SIROCCO, given the complexity of the data structures. +Unfortunately there are not many structures or situations where using a derived data type, to simplify code, is viable +due to none of the structures being contiguous in memory. + +Adding a new variable to an existing communication +-------------------------------------------------- + +- Increment the appropriate variable, or function call to :code:`calculate_comm_buffer_size`, to account for and + allocate additional space in the communication buffer. For example, if the new variable is an :code:`int` in the + plasma grid then update :code:`n_cells_max * (20 + 2 * n_phot_total + 1)` to :code:`n_cells_max * (21 + 2 * + n_phot_total + 1)` +- In the block where :code:`rank == rank_global`, add a new call to :code:`MPI_Pack` using the code which is already + there as an example. +- In the block where :code:`rank != rank_global`, add a new call to :code:`MPI_Unpack` using the code which is already + there as an example. diff --git a/source/communicate_macro.c b/source/communicate_macro.c index 433097ee7..573404861 100644 --- a/source/communicate_macro.c +++ b/source/communicate_macro.c @@ -5,8 +5,6 @@ * * @brief Functions for communicating macro atom properties * - * @TODO: as much as this as possible should use non-blocking communication - * ***********************************************************/ #include @@ -27,7 +25,12 @@ * * @details * - * The communication pattern is as outlined in broadcast_updated_plasma_properties. + * The communication pattern and how the size of the communication buffer is + * determined is documented in + * $SIROCCO/docs/sphinx/source/developer/mpi_comms.rst. + * To communicate a new variable, the communication buffer needs to be made + * bigger and a new `MPI_Pack` and `MPI_Unpack` call need to be added. See the + * developer documentation for more details. * **********************************************************/ @@ -44,7 +47,13 @@ broadcast_macro_atom_emissivities (const int n_start, const int n_stop, const in d_xsignal (files.root, "%-20s Begin macro atom emissivity communication\n", "NOK"); const int n_cells_max = get_max_cells_per_rank (NPLASMA); const int comm_buffer_size = calculate_comm_buffer_size (1 + n_cells_max, n_cells_max * (1 + nlevels_macro)); + char *comm_buffer = malloc (comm_buffer_size); + if (comm_buffer == NULL) + { + Error ("broadcast_macro_atom_emissivities: Error in allocating memory for comm_buffer\n"); + Exit (EXIT_FAILURE); + } for (current_rank = 0; current_rank < np_mpi_global; current_rank++) { @@ -92,7 +101,12 @@ broadcast_macro_atom_emissivities (const int n_start, const int n_stop, const in * * @details * - * The communication pattern is as outlined in broadcast_updated_plasma_properties. + * The communication pattern and how the size of the communication buffer is + * determined is documented in + * $SIROCCO/docs/sphinx/source/developer/mpi_comms.rst. + * To communicate a new variable, the communication buffer needs to be made + * bigger and a new `MPI_Pack` and `MPI_Unpack` call need to be added. See the + * developer documentation for more details. * **********************************************************/ @@ -109,7 +123,13 @@ broadcast_macro_atom_recomb (const int n_start, const int n_stop, const int n_ce d_xsignal (files.root, "%-20s Begin macro atom recombination communication\n", "NOK"); const int n_cells_max = get_max_cells_per_rank (NPLASMA); const int comm_buffer_size = calculate_comm_buffer_size (1 + n_cells_max, n_cells_max * (2 * size_alpha_est + 2 * nphot_total)); + char *const comm_buffer = malloc (comm_buffer_size); + if (comm_buffer == NULL) + { + Error ("broadcast_macro_atom_recomb: Error in allocating memory for comm_buffer\n"); + Exit (EXIT_FAILURE); + } for (current_rank = 0; current_rank < np_mpi_global; ++current_rank) { @@ -178,7 +198,12 @@ broadcast_macro_atom_recomb (const int n_start, const int n_stop, const int n_ce * * @details * - * The communication pattern is as outlined in broadcast_updated_plasma_properties. + * The communication pattern and how the size of the communication buffer is + * determined is documented in + * $SIROCCO/docs/sphinx/source/developer/mpi_comms.rst. + * To communicate a new variable, the communication buffer needs to be made + * bigger and a new `MPI_Pack` and `MPI_Unpack` call need to be added. See the + * developer documentation for more details. * **********************************************************/ @@ -195,7 +220,13 @@ broadcast_updated_macro_atom_properties (const int n_start, const int n_stop, co d_xsignal (files.root, "%-20s Begin macro atom updated properties communication\n", "NOK"); const int n_cells_max = get_max_cells_per_rank (NPLASMA); const int comm_buffer_size = calculate_comm_buffer_size (1 + 3 * n_cells_max, n_cells_max * (6 * size_gamma_est + 2 * size_Jbar_est)); + char *const comm_buffer = malloc (comm_buffer_size); + if (comm_buffer == NULL) + { + Error ("broadcast_updated_macro_atom_properties: Error in allocating memory for comm_buffer\n"); + Exit (EXIT_FAILURE); + } for (current_rank = 0; current_rank < np_mpi_global; ++current_rank) { @@ -260,6 +291,13 @@ broadcast_updated_macro_atom_properties (const int n_start, const int n_stop, co * and should only be called if geo.rt_mode == RT_MODE_MACRO * and nlevels_macro > 0 * + * The communication pattern and how the size of the communication buffer is + * determined is documented in + * $SIROCCO/docs/sphinx/source/developer/mpi_comms.rst. + * To communicate a new variable, the communication buffer needs to be made + * bigger and a new `MPI_Pack` and `MPI_Unpack` call need to be added. See the + * developer documentation for more details. + * **********************************************************/ int @@ -273,7 +311,13 @@ broadcast_macro_atom_state_matrix (int n_start, int n_stop, int n_cells_rank) const int matrix_size = nlevels_macro + 1; const int n_cells_max = get_max_cells_per_rank (NPLASMA); const int comm_buffer_size = calculate_comm_buffer_size (1 + n_cells_max, n_cells_max * (matrix_size * matrix_size)); + char *comm_buffer = malloc (comm_buffer_size); + if (comm_buffer == NULL) + { + Error ("broadcast_macro_atom_state_matrix: Error in allocating memory for comm_buffer\n"); + Exit (EXIT_FAILURE); + } for (n_mpi = 0; n_mpi < np_mpi_global; n_mpi++) { diff --git a/source/communicate_plasma.c b/source/communicate_plasma.c index 9babec05a..648ca7eda 100644 --- a/source/communicate_plasma.c +++ b/source/communicate_plasma.c @@ -5,8 +5,6 @@ * * @brief Functions for communicating plasma properties * - * @TODO: as much as this as possible should use non-blocking communication - * ***********************************************************/ #include @@ -19,7 +17,7 @@ /**********************************************************/ /** - * @brief + * @brief Broadcast the (initialised) plasma grid to all ranks * * @param [in] int n_start The index of the first cell updated by this rank * @param [in] int n_stop The index of the last cell updated by this rank @@ -27,7 +25,14 @@ * * @details * - * The communication pattern is as outlined in dated_plasma_properties. + * This should only be called once, after grid initialisation. + * + * The communication pattern and how the size of the communication buffer is + * determined is documented in + * $SIROCCO/docs/sphinx/source/developer/mpi_comms.rst. + * To communicate a new variable, the communication buffer needs to be made + * bigger and a new `MPI_Pack` and `MPI_Unpack` call need to be added. See the + * developer documentation for more details. * **********************************************************/ @@ -50,6 +55,12 @@ broadcast_plasma_grid (const int n_start, const int n_stop, const int n_cells_ra 11 * NXBANDS + NBINS_IN_CELL_SPEC + 6 * NFLUX_ANGLES + N_DMO_DT_DIRECTIONS + 12 * NFORCE_DIRECTIONS)); char *comm_buffer = malloc (comm_buffer_size); + if (comm_buffer == NULL) + { + Error ("broadcast_plasma_grid: Error in allocating memory for comm_buffer\n"); + Exit (EXIT_FAILURE); + } + for (current_rank = 0; current_rank < np_mpi_global; current_rank++) { position = 0; @@ -376,12 +387,17 @@ broadcast_plasma_grid (const int n_start, const int n_stop, const int n_cells_ra * * @details * - * The communication pattern is as outlined in dated_plasma_properties. + * The communication pattern and how the size of the communication buffer is + * determined is documented in + * $SIROCCO/docs/sphinx/source/developer/mpi_comms.rst. + * To communicate a new variable, the communication buffer needs to be made + * bigger and a new `MPI_Pack` and `MPI_Unpack` call need to be added. See the + * developer documentation for more details. * * ### Notes ### * * When this is called in wind update, there is redundant information being - * communicated in `dated_plasma_properties` which communicates the exact (but + * communicated in `broadcast_updated_plasma_properties` which communicates the exact (but * probably incorrect) data this function does. A refactor to clean this up could * be done in the future to avoid the extra communication latency from * communicating the data twice. @@ -401,6 +417,11 @@ broadcast_wind_luminosity (const int n_start, const int n_stop, const int n_cell const int n_cells_max = get_max_cells_per_rank (NPLASMA); const int comm_buffer_size = calculate_comm_buffer_size (1 + n_cells_max, 4 * n_cells_max); char *const comm_buffer = malloc (comm_buffer_size); + if (comm_buffer == NULL) + { + Error ("broadcast_wind_luminosity: Error in allocating memory for comm_buffer\n"); + Exit (EXIT_FAILURE); + } for (current_rank = 0; current_rank < np_mpi_global; ++current_rank) { @@ -454,12 +475,17 @@ broadcast_wind_luminosity (const int n_start, const int n_stop, const int n_cell * * @details * - * The communication pattern is as outlined in dated_plasma_properties. + * The communication pattern and how the size of the communication buffer is + * determined is documented in + * $SIROCCO/docs/sphinx/source/developer/mpi_comms.rst. + * To communicate a new variable, the communication buffer needs to be made + * bigger and a new `MPI_Pack` and `MPI_Unpack` call need to be added. See the + * developer documentation for more details. * * ### Notes ### * * When this is called in wind update, there is redundant information being - * communicated in `dated_plasma_properties` which communicates the exact (but + * communicated in `broadcast_updated_plasma_properties` which communicates the exact (but * probably incorrect) data this function does. A refactor to clean this up could * be done in the future to avoid the extra communication latency from * communicating the data twice. @@ -479,6 +505,11 @@ broadcast_wind_cooling (const int n_start, const int n_stop, const int n_cells_r const int n_cells_max = get_max_cells_per_rank (NPLASMA); const int comm_buffer_size = calculate_comm_buffer_size (1 + n_cells_max, 9 * n_cells_max); char *const comm_buffer = malloc (comm_buffer_size); + if (comm_buffer == NULL) + { + Error ("broadcast_wind_cooling: Error in allocating memory for comm_buffer\n"); + Exit (EXIT_FAILURE); + } for (current_rank = 0; current_rank < np_mpi_global; ++current_rank) { @@ -535,26 +566,18 @@ broadcast_wind_cooling (const int n_start, const int n_stop, const int n_cells_r /** * @brief Communicate changing properties in the plasma cells between ranks. * - * @param [in] int n_start the first cell this rank will communicate - * @param [in] int n_stop the last cell this rank will communicate + * @param [in] int n_start the first cell this rank will communicate + * @param [in] int n_stop the last cell this rank will communicate + * @param [in] int n_cells_rank the number of cells this rank will communicate * * @details * - * This makes sure each rank has an updated plasma grid. The way the - * communication is setup is as follows: - * - * - We create a loop over each MPI rank in the MPI_COMM_WORLD communicator - * - If the loop variable is equal to the current rank, the subset of cells that - * rank worked on are packed into `comm_buffer` which is broadcast to all - * ranks. - * - All other ranks unpack that data into the plasma cell. - * - * As well as the properties of the plasma cells, the number of cells - * communicated and the cell numbers are also communicated. The size of the - * comm buffer is currently the minimum size required. To communicate more data - * you need to increase the size of the comm buffer. - * - * @TODO: we need to find out what data is not required + * The communication pattern and how the size of the communication buffer is + * determined is documented in + * $SIROCCO/docs/sphinx/source/developer/mpi_comms.rst. + * To communicate a new variable, the communication buffer needs to be made + * bigger and a new `MPI_Pack` and `MPI_Unpack` call need to be added. See the + * developer documentation for more details. * **********************************************************/ @@ -576,6 +599,11 @@ broadcast_updated_plasma_properties (const int n_start_rank, const int n_stop_ra 1 * n_inner_tot + 9 * NXBANDS + 1 * NBINS_IN_CELL_SPEC); const int size_of_comm_buffer = calculate_comm_buffer_size (num_ints, num_doubles); char *const comm_buffer = malloc (size_of_comm_buffer); + if (comm_buffer == NULL) + { + Error ("broadcast_updated_plasma_properties: Error in allocating memory for comm_buffer\n"); + Exit (EXIT_FAILURE); + } for (n_mpi = 0; n_mpi < np_mpi_global; n_mpi++) { diff --git a/source/communicate_spectra.c b/source/communicate_spectra.c index 2837a0103..df0d6361f 100644 --- a/source/communicate_spectra.c +++ b/source/communicate_spectra.c @@ -5,8 +5,6 @@ * * @brief Functions for communicating wind properties * - * @TODO: as much as this as possible should use non-blocking communication - * ***********************************************************/ #include diff --git a/source/communicate_wind.c b/source/communicate_wind.c index bf406d71d..9b96840b2 100644 --- a/source/communicate_wind.c +++ b/source/communicate_wind.c @@ -5,8 +5,6 @@ * * @brief Functions for communicating wind properties * - * @TODO: as much as this as possible should use non-blocking communication - * ***********************************************************/ #include @@ -19,7 +17,7 @@ /**********************************************************/ /** - * @brief + * @brief Broadcast the wind grid, `wmain`, to all ranks * * @param [in] int n_start The index of the first cell updated by this rank * @param [in] int n_stop The index of the last cell updated by this rank @@ -27,11 +25,14 @@ * * @details * - * The communication pattern is as outlined in broadcast_updated_plasma_properties. + * This function should only be called once, after grid initialisation. * - * We do not communicate the Wind_Paths_Ptr fields as the code which initialises - * and works entirely in serial, so there is no benefit to communicating between - * ranks. + * The communication pattern and how the size of the communication buffer is + * determined is documented in + * $SIROCCO/docs/sphinx/source/developer/mpi_comms.rst. + * To communicate a new variable, the communication buffer needs to be made + * bigger and a new `MPI_Pack` and `MPI_Unpack` call need to be added. See the + * developer documentation for more details. * **********************************************************/ @@ -72,11 +73,19 @@ broadcast_wind_grid (const int n_start, const int n_stop, const int n_cells_rank MPI_Type_create_struct (count, block_lengths, block_offsets, block_types, &wcone_derived_type); MPI_Type_commit (&wcone_derived_type); - /* We also have to also account for some derived types */ + /* Calculate the size of the communication buffer */ const int n_cells_max = get_max_cells_per_rank (NDIM2); + const int num_ints = 5 * n_cells_max + 1; + const int num_doubles = n_cells_max * (13 + 3 * 3 + 1 * 9); // *3 for x, xcen... *9 for v_grad MPI_Pack_size (n_cells_max, wcone_derived_type, MPI_COMM_WORLD, &bytes_wcone); const int comm_buffer_size = calculate_comm_buffer_size (1 + 5 * n_cells_max, n_cells_max * (13 + 3 * 3 + 1 * 9)) + bytes_wcone; + char *comm_buffer = malloc (comm_buffer_size); + if (comm_buffer == NULL) + { + Error ("Unable to allocate memory for communication buffer in broadcast_wind_grid\n"); + Exit (EXIT_FAILURE); + } for (current_rank = 0; current_rank < np_mpi_global; current_rank++) { @@ -151,8 +160,8 @@ broadcast_wind_grid (const int n_start, const int n_stop, const int n_cells_rank } } - MPI_Type_free (&wcone_derived_type); free (comm_buffer); + MPI_Type_free (&wcone_derived_type); d_xsignal (files.root, "%-20s Finished communication of wind grid\n", "NOK"); #endif } diff --git a/source/para_update.c b/source/para_update.c index 94b46236c..87d8a05d0 100644 --- a/source/para_update.c +++ b/source/para_update.c @@ -1,12 +1,10 @@ /***********************************************************/ /** @file para_update.c - * @author ksl, jm + * @author ksl, jm, ejp * @date January, 2018 * - * @brief routines for communicating MC estimators and spectra between MPI ranks. - * - * @TODO: as much as this as possible should use non-blocking communication + * @brief Utility functions for MPI parallelisation * ***********************************************************/