Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration branch (Uniform refine earlier) #4638

Merged
merged 4 commits into from
Feb 18, 2015
Merged

Integration branch (Uniform refine earlier) #4638

merged 4 commits into from
Feb 18, 2015

Conversation

permcody
Copy link
Member

@permcody permcody commented Feb 3, 2015

This PR moves the uniform refinement steps up into the mesh setup stage. The original code path remains for the case where restart and uniform refinements are both needed.

Important note: The definition of what uniform_refinement means has changed for oversampled meshes. Previously that number assumed how many refinements you wanted from the original mesh, now we have defined it to mean the number of refinements to apply after the mesh has been setup which included the initial uniform refinement steps.

Includes a small bug fix for setting _current_task when executing Actions individually.

@moosebuild
Copy link
Contributor

Results of testing 6f28272 using moose_PR_pre_check recipe:

Passed on: linux-gnu

View the results here: https://www.moosebuild.com/view_job/11350

@moosebuild
Copy link
Contributor

Results of testing 6f28272 using moose_PR_test recipe:

Failed on: linux-gnu

View the results here: https://www.moosebuild.com/view_job/11351

@moosebuild
Copy link
Contributor

Results of testing 6f28272 using moose_PR_test_dbg recipe:

Passed on: linux-gnu

View the results here: https://www.moosebuild.com/view_job/11352

@moosebuild
Copy link
Contributor

Results of testing 6f28272 using moose_PR_test recipe:

Failed on: linux-gnu

View the results here: https://www.moosebuild.com/view_job/11351

@@ -312,6 +309,9 @@ ActionWarehouse::executeAllActions()
void
ActionWarehouse::executeActionsWithAction(const std::string & task)
{
// Set the current task name
_current_task = task;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good change. Was it actually required to make all of this work?

Note: I'm not criticizing, just curious...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it was in order to clean up MooseApp.C where I added the new "uniform_refine_mesh" and had it fire an existing Action manually. There's no place in the code where we hitting this potential bug yet but now it's fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool: that was definitely a bug waiting to happen

@permcody
Copy link
Member Author

permcody commented Feb 4, 2015

Well I'm currently stumped. This PR simply does not work with the stateful adaptivity system in parallel and I don't fully understand why. When running on multiple processors I receive this assert after the first time step. Yes, the simulation starts and runs just fine through the first time step and dies during adaptivity.

Assertion `!node_touched_by_anyone[nodeid] || node_touched_by_me[nodeid]' failed.
[0] src/mesh/mesh_tools.C, line 1337, compiled nodate at notime

Is there a problem with doing the refinement up front BEFORE adding the equation systems? Is there information being lost by adding the equation system to a refined mesh that already has active children?

Stack traces from the two processes:

Stack frames: 16
0: 0   libmesh_dbg.0.dylib                 0x0000000111c07b32 libMesh::print_trace(std::__1::basic_ostream<char, std::__1::char_traits<char> >&) + 82
1: 1   libmesh_dbg.0.dylib                 0x0000000111c0a130 libMesh::write_traceout() + 2656
2: 2   libmesh_dbg.0.dylib                 0x0000000111c01051 libMesh::MacroFunctions::report_error(char const*, int, char const*, char const*) + 689
3: 3   libmesh_dbg.0.dylib                 0x00000001122bb3b3 void libMesh::MeshTools::libmesh_assert_valid_procids<libMesh::Node>(libMesh::MeshBase const&) + 5123
4: 4   libmesh_dbg.0.dylib                 0x0000000111abf3e0 libMesh::DofMap::distribute_local_dofs_var_major(unsigned int&, libMesh::MeshBase&) + 2544
5: 5   libmesh_dbg.0.dylib                 0x0000000111abbdc4 libMesh::DofMap::distribute_dofs(libMesh::MeshBase&) + 2644
6: 6   libmesh_dbg.0.dylib                 0x00000001126f895d libMesh::EquationSystems::reinit() + 3485
7: 7   libmoose-dbg.0.dylib                0x0000000110400ac4 FEProblem::meshChanged() + 196
8: 8   libmoose-dbg.0.dylib                0x00000001104009c4 FEProblem::adaptMesh() + 196
9: 9   libmoose-dbg.0.dylib                0x00000001107b7b70 Transient::incrementStepOrReject() + 96
10: 10  libmoose-dbg.0.dylib                0x00000001107b7a1d Transient::execute() + 125
11: 11  libmoose-dbg.0.dylib                0x0000000110607dce MooseApp::executeExecutioner() + 270
12: 12  libmoose-dbg.0.dylib                0x000000011060951c MooseApp::run() + 60
13: 13  moose_test-dbg                      0x000000010faab344 main + 292
14: 14  libdyld.dylib                       0x00007fff90f925c9 start + 1
15: 15  ???                                 0x0000000000000003 0x0 + 3
Stack frames: 18
0: 0   libmesh_dbg.0.dylib                 0x0000000111c07b32 libMesh::print_trace(std::__1::basic_ostream<char, std::__1::char_traits<char> >&) + 82
1: 1   libmesh_dbg.0.dylib                 0x0000000111c0a130 libMesh::write_traceout() + 2656
2: 2   libmesh_dbg.0.dylib                 0x0000000111bb9ced libMesh::libmesh_terminate_handler() + 13
3: 3   libc++abi.dylib                     0x000000011728c628 _ZSt11__terminatePFvvE + 8
4: 4   libc++abi.dylib                     0x000000011728bc5b __cxa_throw + 171
5: 5   libmesh_dbg.0.dylib                 0x00000001122bb3fa void libMesh::MeshTools::libmesh_assert_valid_procids<libMesh::Node>(libMesh::MeshBase const&) + 5194
6: 6   libmesh_dbg.0.dylib                 0x0000000111abf3e0 libMesh::DofMap::distribute_local_dofs_var_major(unsigned int&, libMesh::MeshBase&) + 2544
7: 7   libmesh_dbg.0.dylib                 0x0000000111abbdc4 libMesh::DofMap::distribute_dofs(libMesh::MeshBase&) + 2644
8: 8   libmesh_dbg.0.dylib                 0x00000001126f895d libMesh::EquationSystems::reinit() + 3485
9: 9   libmoose-dbg.0.dylib                0x0000000110400ac4 FEProblem::meshChanged() + 196
10: 10  libmoose-dbg.0.dylib                0x00000001104009c4 FEProblem::adaptMesh() + 196
11: 11  libmoose-dbg.0.dylib                0x00000001107b7b70 Transient::incrementStepOrReject() + 96
12: 12  libmoose-dbg.0.dylib                0x00000001107b7a1d Transient::execute() + 125
13: 13  libmoose-dbg.0.dylib                0x0000000110607dce MooseApp::executeExecutioner() + 270
14: 14  libmoose-dbg.0.dylib                0x000000011060951c MooseApp::run() + 60
15: 15  moose_test-dbg                      0x000000010faab344 main + 292
16: 16  libdyld.dylib                       0x00007fff90f925c9 start + 1
17: 17  ???                                 0x0000000000000003 0x0 + 3

@friedmud, @roystgnr - Can you think of any call I'm missing?

@permcody
Copy link
Member Author

permcody commented Feb 4, 2015

Actually even that doesn't make sense. This PR passes the other 1000+ tests, many of which have uniform refinement turned on. Maybe there's a problem with the way we build our Refinement and Coarsening maps that's confusing the equation systems object. I don't think we trigger that code AND the early refinement in the same test. hmm....

@YaqiWang
Copy link
Contributor

YaqiWang commented Feb 4, 2015

When is the mesh partitioned?

@roystgnr
Copy link
Contributor

roystgnr commented Feb 4, 2015

[0] src/mesh/mesh_tools.C, line 1337, compiled nodate at notime

Now I want to put an assert in at line 80085 somewhere.

Assertion `!node_touched_by_anyone[nodeid] || node_touched_by_me[nodeid]' failed.

What this means is that the mesh has a node in it which the current
processor believes it owns, which is connected to an element owned by
another processor, and which is not connected to an element owned by
this processor.

The problem should be independent of EquationSystems. I'm not sure
what could lead to this situation, though.

@permcody
Copy link
Member Author

permcody commented Feb 4, 2015

How about doing several levels of refinement at once? I removed this logic since the equation systems don't exist at the point where I'm doing the uniform refinements.
https://github.com/idaholab/moose/blob/devel/framework/src/base/Adaptivity.C#L217

If that isn't it, I'll dig into https://github.com/idaholab/moose/blob/devel/framework/src/mesh/MooseMesh.C#L1378
Derek builds several independent meshes for each element type in the simulation to create "closest qp maps" for copying stateful materials when necessary. This is the only other thing that I can think of that might be screwing libMesh up.

@permcody
Copy link
Member Author

permcody commented Feb 4, 2015

OK - just to recheck for my own sanity. I changed the logic temporarily to force the uniform refinement in the original location (i.e. in FEProblem::initialSetup()) which is way late and after all systems have been setup. The test ran just fine without any assertions or other errors. Simply moving the refinement to an early spot is causing this error.

@roystgnr
Copy link
Contributor

roystgnr commented Feb 4, 2015

I believe doing several levels of refinement at once should be supported if-and-only-if there's no solutions being projected from the old to the new mesh.

Typically we do have solutions to be projected, so even when we have multiple refinement levels to do (e.g. when performing a fine restart from a very coarse solution) we typically do the refinement one level at a time. But "start with a coarse mesh, then do uniformly_refine(N>1), then build systems on it and do further AMR" is such a common use case that I'd be astonished if it had regressed.

Unless the independent meshes are sharing nodes (which would have been too huge a bug to have gone uncaught to begin with), using multiple meshes shouldn't create any confusion about which processors own what.

How do I replicate the current failure case? ./run_tests --something_or_other?

How small (n_elem) a failure case can you boil this down to?

@permcody
Copy link
Member Author

permcody commented Feb 4, 2015

Thanks - I don't think there's any issue with several levels of refinements or anything else you proposed. The MOOSE and application tests suites run fine with this change and in parallel. This issue with this one test is something much more stupid 😄 Hopefully @friedmud has an idea to shed some light on the situation.

@friedmud
Copy link
Contributor

friedmud commented Feb 4, 2015

I'm stumped too. I mean - what does this have to do with stateful material properties?!? Or can you get this to happen without that now?

@permcody
Copy link
Member Author

permcody commented Feb 4, 2015

I believe you need stateful materials + adaptivity + uniform refinement to
trigger this problem.

On Wed Feb 04 2015 at 1:42:23 PM Derek Gaston [email protected]
wrote:

I'm stumped too. I mean - what does this have to do with stateful material
properties?!? Or can you get this to happen without that now?


Reply to this email directly or view it on GitHub
#4638 (comment).

@friedmud
Copy link
Contributor

friedmud commented Feb 5, 2015

Looking at this now

@permcody
Copy link
Member Author

permcody commented Feb 5, 2015

Cool, if you need something let me know. I've been doing a lot of this work on one of the hpcbuild boxes. I just used X-forwarding and -start_in_debugger to trip the error. It's 100% reproducible on more than one processor.

One more thing. Just for fun I commented out the buildRefinementandCoarsening() maps call in FEProblem::initialSetup(). I STILL hit this error after the first timestep but before it would need those maps to do the material property copies.

@friedmud
Copy link
Contributor

friedmud commented Feb 5, 2015

Ok - thanks for the heads up on that. I'm going to go hardcore...
commenting out huge chunks of the system until I figure out where shit is
getting screwed up...

Derek

On Thu Feb 05 2015 at 10:26:29 AM EST Cody Permann [email protected]
wrote:

Cool, if you need something let me know. I've been doing a lot of this
work on one of the hpcbuild boxes. I just used X-forwarding and
-start_in_debugger to trip the error. It's 100% reproducible on more than
one processor.

One more thing. Just for fun I commented out the
buildRefinementandCoarsening() maps call in FEProblem::initialSetup(). I
STILL hit this error after the first timestep but before it would need
those maps to do the material property copies.


Reply to this email directly or view it on GitHub
#4638 (comment).

@permcody
Copy link
Member Author

permcody commented Feb 5, 2015

Other things I tried:

  • Did one level of refinement at a time even before the equation system exists just to make sure doing all the levels at once wasn't a source of the problem - NO CHANGE
  • Put "if false" in the first refinement step SetupMeshChanged::act() to disable early refinement and a corresponding "if true" in FEProblem::initialSetup() to restore the original order while leaving the rest of my PR in place. That indeed fixes it.

@permcody
Copy link
Member Author

permcody commented Feb 5, 2015

@roystgnr and @jwpeterson - @friedmud discovered that the issue had to do with our use of MeshBase::skip_partitioning(true) so I started messing around with that. We have this turned off because of #4532. It turns out that we are apparently calling that method "too late" but we have always called it rather late.

I hardcoded the skip partitioning call right after the uniform refinement calls and this test case ran. The problem is that we only want to disable partitioning for the very specific case spelled out in #4532. Now that we are doing these refinements first I don't have all of the information available to make that decision so I'm at a point where I'm trying to understand the specific issue of why this doesn't work to figure out the right place.

Current MOOSE workflow looks like this (simplified):

  1. build or read mesh
  2. prepare
  3. modify mesh (optional)
  4. prepare again (if mesh was modified)
  5. setup the rest of MOOSE
  6. skip partitioning if applicable (Adaptivity + Stateful Materials + Repartitioning #4532)
  7. start running simulation
  8. build coarsening and refinement maps using mesh and mesh refinement utilities
  9. uniform refine

This PR moves step 9 to step 3.5 which breaks due to the skip partitioning in 6. I already mentioned that If I hard code 6 at 3.6 it fixes this case. What I don't understand is why this is issue. Clearly we are getting away with preparing the mesh in 2 and 4 without the skip partitioning flag now.

In a different attempt to fix the issue I attempted to add yet another mesh.prepare() call at 6.1 which did not change anything. I'm failing to understand why the partitioning appears to get hosed when moving around the refinement even after calling prepare yet again. Any insight would be helpful.

@permcody
Copy link
Member Author

permcody commented Feb 5, 2015

Wait! Even more interesting is that I successfully put step 6 at 4.1 and it worked. That leaves only step 5 in the middle so there's something else going on possibly with setting up the equation system. Still looking...

@jwpeterson
Copy link
Member

On Thu, Feb 5, 2015 at 10:44 AM, Cody Permann [email protected]
wrote:

@roystgnr https://github.com/roystgnr and @jwpeterson
https://github.com/jwpeterson - @friedmud https://github.com/friedmud
discovered that the issue had to do with our use of
MeshBase::skip_partitioning(true) so I started messing around with that. We
have this turned off because of #4532
#4532. It turns out that we are
apparently calling that method "too late" but we have always called it
rather late.

I hardcoded the skip partitioning call right after the uniform refinement
calls and this test case ran. The problem is that we only want to disable
partitioning for the very specific case spelled out in #4532
#4532. Now that we are doing
these refinements first I don't have all of the information available to
make that decision so I'm at a point where I'm trying to understand the
specific issue of why this doesn't work to figure out the right place.

Current MOOSE workflow looks like this (simplified):

  1. build or read mesh
  2. prepare
  3. modify mesh (optional)
  4. prepare again (if mesh was modified)
  5. setup the rest of MOOSE
  6. skip partitioning if applicable (Adaptivity + Stateful Materials + Repartitioning #4532
    Adaptivity + Stateful Materials + Repartitioning #4532)
  7. start running simulation
  8. build coarsening and refinement maps using mesh and mesh refinement
    utilities
  9. uniform refine

This PR moves step 8 to step 3.5 which breaks due to the skip partitioning
in 6. I already mentioned that If I hard code 6 at 3.6 it fixes this case.
What I don't understand is why this is issue. Clearly we are getting away
with preparing the mesh in 2 and 4 without the skip partitioning flag now.

In a different attempt to fix the issue I attempted to add yet another
mesh.prepare() call at 6.1 which did not change anything.

I'm failing to understand why the partitioning appears to get hosed when
moving around the refinement even after calling prepare yet again. Any
insight would be helpful.

I'm not sure I understood all that, but what I think you are saying is that uniform refining before preparing the mesh (while skipping partitioning) breaks stateful material properties in MOOSE, while uniform refining
later does not?

@permcody
Copy link
Member Author

permcody commented Feb 5, 2015

Don't look at it as "breaks stateful materials", instead I'm theorizing that it's actually breakage due to specific ordering of libMesh calls. I'm getting a libMesh assertion which I posted way up above before it gets to any of the material prolongation/restriction logic.

@friedmud
Copy link
Contributor

friedmud commented Feb 5, 2015

Yes - this isn't actually a "stateful material properties" problem at all.
It's just that the codepath for stateful material properties is calling
skip_partitioning(). This problem happens WITHOUT stateful material
properties if you call skip_partitioning()...

Like Cody says: there is some order dependence to uniform refine,
eq::setup(), skip_partitioning()

I suspect that basically no one else in the world ever uses
skip_partitioning() and some subtle side effect has crept in....

On Thu Feb 05 2015 at 1:05:41 PM EST Cody Permann [email protected]
wrote:

Don't look at it as "breaks stateful materials", instead I'm theorizing
that it's actually breakage due to specific ordering of libMesh calls. I'm
getting a libMesh assertion which I posted way up above before it gets to
any of the material prolongation/restriction logic.


Reply to this email directly or view it on GitHub
#4638 (comment).

@permcody
Copy link
Member Author

permcody commented Feb 5, 2015

Whoa - I lied. Turns out that due to multiple registration my disabling of the partitioning was happening earlier than I thought.

I have found out that I have to disable partitioning before uniform refinement. The old workflow did that since the uniform refinement happened very late. It appears to still be a requirement. I don't have an easy solution to this problem unless I do something really hacky like disabling it and then turning it back on later (yuck). While I can snoop most pieces of information earlier if need be, we don't have a good way of snooping information about stateful material properties since they are added programmatically when Materials are added to the system.

@permcody
Copy link
Member Author

permcody commented Feb 9, 2015

I modified libMesh adaptivity ex2 to skip partitioning and to do the refinements before building the equation system and it ran just fine on multiple processors so there's still something in MOOSE that's causing this issue, not libMesh.

@friedmud
Copy link
Contributor

friedmud commented Feb 9, 2015

I don't know - I wasn't able to trigger it in other MOOSE tests either... I
suspect that this is a real corner case. Can you try to make that libMesh
example as much like the MOOSE test as possible? Make it start with the
same mesh and do the same operations to it?
On Mon, Feb 9, 2015 at 11:17 AM Cody Permann [email protected]
wrote:

I modified libMesh adaptivity ex2 to skip partitioning and to do the
refinements before building the equation system and it ran just fine on
multiple processors so there's still something in MOOSE that's causing this
issue, not libMesh.


Reply to this email directly or view it on GitHub
#4638 (comment).

@permcody
Copy link
Member Author

permcody commented Feb 9, 2015

I got it! Takes four processors in the libMesh example. It won't fail on 2 or 3...

@moosebuild
Copy link
Contributor

Results of testing e099710 using moose_PR_pre_check recipe:

Passed on: linux-gnu

View the results here: https://www.moosebuild.com/view_job/11870

@moosebuild
Copy link
Contributor

Results of testing e099710 using moose_PR_test recipe:

Failed on: linux-gnu

View the results here: https://www.moosebuild.com/view_job/11871

@moosebuild
Copy link
Contributor

Results of testing e099710 using moose_PR_app_tests recipe:

Failed on: linux-gnu

View the results here: https://www.moosebuild.com/view_job/11873

@moosebuild
Copy link
Contributor

Results of testing 0bfe906 using moose_PR_pre_check recipe:

Passed on: linux-gnu

View the results here: https://www.moosebuild.com/view_job/11885

@moosebuild
Copy link
Contributor

Results of testing 0bfe906 using moose_PR_app_tests recipe:

Failed on: linux-gnu

View the results here: https://www.moosebuild.com/view_job/11888

@moosebuild
Copy link
Contributor

Results of testing 0bfe906 using moose_PR_test recipe:

Failed on: linux-gnu

View the results here: https://www.moosebuild.com/view_job/11886

@moosebuild
Copy link
Contributor

Results of testing e099710 using moose_PR_test_dbg recipe:

Passed on: linux-gnu

View the results here: https://www.moosebuild.com/view_job/11872

@moosebuild
Copy link
Contributor

Results of testing 0bfe906 using moose_PR_test_dbg recipe:

Passed on: linux-gnu

View the results here: https://www.moosebuild.com/view_job/11887

@moosebuild
Copy link
Contributor

Results of testing 4a9c2e7 using moose_PR_pre_check recipe:

Passed on: linux-gnu

View the results here: https://www.moosebuild.com/view_job/11904

@moosebuild
Copy link
Contributor

Results of testing 4a9c2e7 using moose_PR_app_tests recipe:

Failed on: linux-gnu

View the results here: https://www.moosebuild.com/view_job/11907

@moosebuild
Copy link
Contributor

Results of testing 4a9c2e7 using moose_PR_test recipe:

Failed on: linux-gnu

View the results here: https://www.moosebuild.com/view_job/11905

@moosebuild
Copy link
Contributor

Results of testing 7cc9faa using moose_PR_pre_check recipe:

Passed on: linux-gnu

View the results here: https://www.moosebuild.com/view_job/11910

@moosebuild
Copy link
Contributor

Results of testing 7cc9faa using moose_PR_test recipe:

Passed on: linux-gnu

View the results here: https://www.moosebuild.com/view_job/11911

@moosebuild
Copy link
Contributor

Results of testing 7cc9faa using moose_PR_app_tests recipe:

Failed on: linux-gnu

View the results here: https://www.moosebuild.com/view_job/11913

@moosebuild
Copy link
Contributor

Results of testing 4a9c2e7 using moose_PR_test_dbg recipe:

Passed on: linux-gnu

View the results here: https://www.moosebuild.com/view_job/11906

aeslaughter added a commit that referenced this pull request Feb 18, 2015
Integration branch (Uniform refine earlier)
@aeslaughter aeslaughter merged commit e2ca5d3 into devel Feb 18, 2015
@aeslaughter aeslaughter deleted the integration branch February 18, 2015 22:55
@moosebuild
Copy link
Contributor

Results of testing 7cc9faa using moose_PR_test_dbg recipe:

Passed on: linux-gnu

View the results here: https://www.moosebuild.com/view_job/11912

@permcody permcody mentioned this pull request Sep 15, 2015
@YaqiWang YaqiWang mentioned this pull request Aug 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants