-
Notifications
You must be signed in to change notification settings - Fork 79
Parallel ID Issues
Greg Sjaardema edited this page Jan 25, 2022
·
4 revisions
[Saving an email response to a customer who is seeing confusing output from serial/parallel runs. Will try to expand this later into a better explanation]
The map issue with the decomposed/recomposed files is as follows:
- Assume input mesh has node and/or element map
- On decomposition, the maps are hi-jacked in the spread files and used instead to relate local implicit (1..N) nodes/elements in the spread files back to the implicit serial nodes
- For example, if we decompose a 5-node serial mesh (10…20…30…40…50) to two processors:
- P0 has first three nodes, so its map will be 1…2…3
- P1 has last three nodes, so its map will be 3…4…5
- Note that the original node map in the serial file is gone at this point. This problem has existed in
nem_spread
since the beginning of its existence.
- To try to retain the original serial mesh node map, additional node/element maps named
original_global_id_map
were added at some point several years ago tonem_spread
- P0 will have the entries “10, 20, 30” in this map
- P1 will have the entries “30, 40, 50” in this map.
- If an application reads the
original_global_id_map
, then it can provide the user with the same node/element ids in parallel as in serial. If the application does not read theoriginal_global_id_map
, then the node/element ids will be the serial mesh implicit node ids. - In Sierra, in the “pre-
original_global_id_map
” times, there was an option to ignore the node/element maps in serial runs so that the node/element ids would match in serial and parallel runs. - I then added output of the
original_global_id_map
capability tonem_spread
and reading the maps to IOSS and life was good again… - [Fixed in EPU-5.0] EPU does not read the
original_global_id_map
, so if you do a decomp followed by an immediate epu, the resulting mesh will only have the 1…N map and the original node/element id map will be lost.- An exodiff at this point will give a error since original mesh had 10,20,30,40,50 and epu’d mesh has 1,2,3,4,5
- IOSS combines the global->serial implicit node map with the
original_global_id_map
and presents the client application with the doubly mapped ids. (10,20,30,40,50 in the example above)- On output, the files will have the 10,20,30 and 30,40,50 map and epu will create an output file with 10,20,30,40,50 map.
- Exodiff at this point will work since original and epu’d mesh have same 10,20,30,40,50 map
If the original file has a node map that is a permutation of 1..#nodes, then there will probably be confusion following an epu since the original and epu'd file will look similar, but the ids will be scrambled.
- For simplicity we will reduce this down to a 5 node mesh. Assume map is 1,5,2,4,3
- Decompose to 2 processors.
- P0 has first three nodes, so its map will be 1…2…3
- P1 has last three nodes, so its map will be 3…4…5
- Note that the original node map in the serial file is gone at this point.
- The
original_global_id_map
will have:- P0 will have the entries “1,5,2”
- P1 will have the entries “2,4,3”
- If the application does not read the
original_global_id_map
, on output the maps are:- P0 has 1,2,3
- P1 has 3,4,5
- EPU rejoins the files. Finds all 1..5 nodes existing in output file, so puts the nodes in order and doesn’t output the 1,2,3,4,5 map since it can be regenerated implicitly.
- EXODIFF finds original file has nodes 1,5,2,4,3 and new file has 1,2,3,4,5.
- It maps 1..1, 2..2, 3..3, 4..4, 5..5 and finds that they have different coordinates
- Outputs a “metadata mismatch error/warning”. [I need to check why it says “metadata” in this error message]
- If I do a “exodiff –match_file_order old.g new.g”, I should get a valid “no difference” output.