-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Carbon macro atom models are failing #425
Comments
The model.pf files that caused this problem have to be updated to reflected the fact that pairwise_pow has been reduced as an ionization mode. After I made these modifications and I still see problems in multprocessor mode, but it seems to be running fine in single processor modes. The failure mode was not quite the same for the 34 model ther was a segmentation adult in multiprocessor mode, for the 345 model I got a more generic mpi failure message. |
I tried running in multiprocessor mode with a smaller wind (30 x 30) instead of 100. This time in mulitprocesor mode I still get a segemetatin fault for the 34 model. The last lines of the output file before the segfault look like this:
|
On my Mac, in single processor mode. short_34.pf fails with
and short_345 fails with
|
This is an update on where we are with this problem. The short runs of the agn model appear to have been fixed in the sense that the program does not fall over by a changed that @jhmatthews made to macro atoms, which he will explain later. Errors associated with this macro atom problem are still be reported and James is investigating them. At least two other problems have been seen: For long runs photon_checks is seeing too many photons out of the frequency boundary. This is most likely due to the effects of Doppler shifts on the original distribution, as is discussed in #434. For now, ksl has modified dev so it does not fall over at this point. A second problem in long runs, is that we are getting an error in some multiprocessor runs of (Additionally, in multiprocessor mode, the program is often failing with either a segmentation fault in one of the channles, or a general mpi failure, suggesting some kind of problems with mpi) |
@jhmatthews @ssim So, on linux now, The only difference that I know of between this .pf file and the one that has historically been run is that we are not using ionization mode 9, not 7 (which Nick removed). We clearly need a better way of diagnosing these types of problems, e.g to dump the windsave file and all of the information about the photon that where the problem occurred. |
Do you know if this fails in multiprocessor mode with just one process? It might help narrow down if something weird if happening when MPI is turned on. |
Ed, I have just posted a lot more information about this on the github issue page.
There is one group of failures that appears in single processor mode. These often appear to be “unusual” events during the photon cycles that appear well into the calculation. Unfortunately, we do not write out much about what is actually happening then, but simply exit the code.
In multiprocessor mode, their appear to be two types of failures. One arises when one of the “unusual” events happens and that thread is exited. In these case, mpi tells you which thread crashed and you can see the error that caused it to exit in the diag file. Sometimes however, mpi just goes away, and gives a generic error message. In these cases, which typically do not occur during the photon flight, there may be an mpi problem.
Separately, you will see a note on github, about memory leaks. I have been trying to use valgrind to look for these and there is some indication that these do exist.
Cheers,
Knox
From: Edward Parkinson <[email protected]>
Reply-To: agnwinds/python <[email protected]>
Date: Sunday, September 16, 2018 at 9:55 AM
To: agnwinds/python <[email protected]>
Cc: Knox Long <[email protected]>, Author <[email protected]>
Subject: Re: [agnwinds/python] Carbon macro atom models are failing (#425)
Do you know if this fails in multiprocessor mode with just one process? It might help narrow down if something weird if happening when MPI is turned on.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<https://github.com/agnwinds/python/issues/425#issuecomment-421778113>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ACaeVXPedG-LLZ21cRd8TMp76mK3Vl3iks5ubmZkgaJpZM4WDBEK>.
|
Here is a complete output from Valgrind, for the following .pf file which ran to completion with the following errors: The valgrind result is I'm still not sure how to interpret this, but if you look say at the first set of results, it looks like one is trying to change a value at line 762 of recomb.c to something that was allocated by calloc_dyn_plasma at line 648 of grid_wind. Later on, it looks like one is trying to use an uninitialized value of somthing in macro_pops at line 493. At the end, there is a leak summary.
According to some of the Valgrind documentation any problem that results in definitely lost memory should be fixed. |
So I'm struggling to keep up a bit here. As far as I can tell we have a number of different issues with likely different causes and it might be easier to debug if we document or investigate these one by one, possibly in separate threads. Here are some observations, separated out by issue.
|
@jhmatthews Basically, I agree here, but I think we need to figure out a path forward where we can work on the various issues at once, since they all contribute the problem of getting the C34 and C345 modles to work. There are now two types of problems, or at least there are two basic ways the problems are manifested:
It is clear that the rare problems are causing problems with muliti-processing. This is because we have adopted a more "rigorous" approach when the macro-atom problems are discovered than we have elsewhere in the program, that is we in most cases exit the program. Our philosophy elsewhere has been to try to get the program to "soldier" on. (Note that my comment here is not exactly fair; there are other places where we would exit if we got an error, but generally when we have encountered these in the past, we either fixed the error or figured out a way to get the program to continue.) An important point here is that the macro atom issues we are encountering here are rare. Unless there is a deeper underlying problem, which there could be, they are just like any of the other errors we have bypassed in the past. Therefore, unless just looking at the code will permit a solution, I think we need a way to issue an error message and have the program move on. We should print out as much as we can about the situation at the time of the error, but we should "soldier" on. At the point that we can get Python not to crash, I agree with James that a separate git issue is warranted, since as he points out these errors may have different causes and solutions and this thread will be impossible to dissect at a a later date.. |
So after making a small mod so that negative jbar would not cause the program to crash, I ran short_345 and short_34. When compiled without mpicc, short 34 had a segmentation fault in single processor mode; when compiled with gcc it seems to be running past where the segmentation fault occured. short_345 running on mpicc it has ultimately failed with the following end of the log: |
I have been checking what happens with short_34 and short_345 with the new scheme for handling simple atoms in macro atom mode turned off. short_345 seems to run fine, with few errors. short_34 still has the odd problem where it produces a segfault when compiled with mpicc and run with a single processor, but not when compiled and run with gcc. |
After the changes @jhmatthews has made for #436 short_345 appears to run using mpi, but short_34 is still having segmentation faults. James fixed one of these, but we are still seeing these at another point in the program, which I have finally located as occurring in init_freebound. I doubt this has anything to do with the jbar problem. I am suspicious @ssim that it has something to do with the atomic data file, something that get_atomic_data is not catching. The strange thing about this segmentation fault is that it does not always appear, depending on how exactly the code is compiled. In particular, it always occurs with a standard complilation but it does not happen in single processor mode, and it does not seem to happen compiled in multiprocessor mode with debugging turned on (which changes some of the compilation defaults). |
Progress, I think, @jhmatthews @ssim I believe I understand now why segfault occurred sometimes with short_34.pf and sometimes not. It all depends on whether the -03 switch is included. Our standard has been to not include this when the D option is used. When I add the -03 switch to the debug options in Make, then I get the segfault, on this file: The segfault on linux occurs in single and multiprocessor mode, when complied with Given that it occurs in multiprocessor mode, I have been able to hone in on the error. It was not quite where I thought. Instead it was in these lines
The actual error is coming from line 373, which is returning a very large negative value. This would be expected to cause a segfault. Note that with -O3, gdb does not work as well as without it as some variables have been "optimized out". Consequently, I cannot look at a lot of the values in the routine. So one is going to need to be clever to actually figure out what the problem is, and it may be that just adding Log statements will cause the program to work again. I remain of the opinion that there something not quite write about the C 34 data file, or something that get_atomic_data is not catching. |
@ssim @jhmatthews I am betting you all are in a position to figure this out now. I added some print statements to the loop where the seg fault was: and here is what got printed out: |
I'm not sure if this is a lead or not, but I had a quick look at the data files. The CIIICIVCV elem ions file has:
while CIIICIV has:
i.e. nlte is set to 5 for CIV which is the ion that causes the issue. I'm wondering if this is causing a problem? I don't know if this info is actually used though. Knox, can you change the numbers and try again? I can't seem to reproduce the segfault. |
HI James,
That sounds like a good spot to me - could well be it!
Stuart
… On 26 Sep 2018, at 12:44, James Matthews ***@***.***> wrote:
I'm not sure if this is a lead or not, but I had a quick look at the data files. The CIIICIVCV elem ions file has:
#IonV C 6 1 1 11.26000 1000 5 1s^22s^22p^2(3P_0)
#IonV C 6 2 2 24.38400 1000 5 1s^22s^22p(2P^o_{1/2})
IonM C 6 3 1 47.88800 1000 20 1s^22s^2(1S_0)
IonM C 6 4 2 64.49400 1000 20 1s^22s(2S_{1/2})
IonM C 6 5 1 392.09000 1000 20 1s^2(1S_0)
IonM C 6 6 2 489.99700 1000 1 1s(2S_{1/2})
#IonV C 6 7 2 1.0000e+20 0 0 Bare
while CIIICIV has:
#IonV C 6 1 1 11.26000 1000 5 1s^22s^22p^2(3P_0)
#IonV C 6 2 2 24.38400 1000 5 1s^22s^22p(2P^o_{1/2})
IonM C 6 3 1 47.88800 1000 10 1s^22s^2(1S_0)
IonM C 6 4 2 64.49400 1000 5 1s^22s(2S_{1/2})
IonM C 6 5 1 392.09000 1000 1 1s^2(1S_0)
#IonV C 6 6 2 489.99700 1000 0 1s(2S_{1/2})
#IonV C 6 7 2 1.0000e+20 0 0 Bare
i.e. nlte is set to 5 for CIV which is the ion that causes the issue. I'm wondering if this is causing a problem? I don't know if this info is actually used though. Knox, can you change the numbers and try again? I can't seem to reproduce the segfault.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <https://github.com/agnwinds/python/issues/425#issuecomment-424685028>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAwPLTKOb3mRQ_8lUOqNkcPWXZcMPfrrks5ue2isgaJpZM4WDBEK>.
|
Yes, making this change does appear to eliminate the problem. So if that is really it, then we need to determine how/if python should have caught the issue and failed more gracefully.
|
This change causes get_atomic_data to ignore macro levels that are in excess of the number specified in the corresponding ion line.
As noted by @jhmatthews, the problem with the data file was that there were more macro-levels than was specified in the ion line. The code did not check for this. After a conversation with @ssim I have made changes to get_atomic_data.c so that macro-levels in excess of the number given in the corresponding ion result in an error, but are ignored. This follows our general philosophy of ignoring extra lines in the atomic data file. In this case, it means one can easily reduce the complexity of a line by reducing the number of levels (at least under the assumption) that the first n levels are the most important. So for example, the change from h20 to h10 could be carried out by changing a single entry in the ions data. |
So, I tried running the full simulations which Keara began with on 8 processors each on linux, and fiducial_agn_34 ran to completion, but fiducial_agn_345 failed after about 8 ionization cycles on one of the threads, which caused the program to crash. The diagnostic messages presented before the crash were: Here is what that portion of the code says: Evidently d2, the "density" of the upper level here, is a lot bigger than d1, and since the g's of the upper and lower level are both 3, this is going to cause xden_ion to be negative .. which ulitmately gets to the exit statement. |
I'm not 100% sure if this is a robust fix but Knox, can you try this with the code alteration in the PR I just made: #438 If this helps we could accept the PR as a step forward. If it doesn't help then we'll close it |
I have started the long runs of fidicial_agn_c34 and fiducial_agn_c345 for this. So far there are no problems, but I will need to let them run.
From: ssim <[email protected]>
Reply-To: agnwinds/python <[email protected]>
Date: Thursday, October 4, 2018 at 12:10 PM
To: agnwinds/python <[email protected]>
Cc: Knox Long <[email protected]>, Author <[email protected]>
Subject: Re: [agnwinds/python] Carbon macro atom models are failing (#425)
I'm not 100% sure if this is a robust fix but Knox, can you try this with the code alteration in the PR I just made: #438<https://github.com/agnwinds/python/pull/438>
If this helps we could accept the PR as a step forward. If it doesn't help then we'll close it
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<https://github.com/agnwinds/python/issues/425#issuecomment-427097405>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ACaeVXZDmWHjO5UdSzZFRQgy3jdfeOazks5uhkDqgaJpZM4WDBEK>.
|
Unfortunately @ssim @jhmatthews this did not solve the problem. Although fidicial_agn_c34 is still running, fiducial_agn_345 failed in one of the 8 threads I ran at the end of the cycle 19 of 30. The messages at the end of the diag file for that thread are shown below: Note that I have not yet merged the changes Stuart made in macro_pots into the dev branch, I just ran his updated version. So if that needs to be incorporated, then one of us needs to merge it. |
Hmmmm - this seems impossible to me - there must be a logical flaw in the code I’ve written here, but I can’t see it...
… On 5 Oct 2018, at 13:05, Knox S. Long ***@***.***> wrote:
Unfortunately @ssim <https://github.com/ssim> @jhmatthews <https://github.com/jhmatthews> this did not solve the problem. Although fidicial_agn_c34 is still running, fiducial_agn_345 failed in one of the 8 threads I ran at the end of the cycle 19 of 30. The messages at the end of the diag file for that thread are shown below:
<https://user-images.githubusercontent.com/2530901/46534046-77fea380-c86c-11e8-960c-ff3f5acc16d5.png>
Note that I have not yet merged the changes Stuart made in macro_pots into the dev branch, I just ran his updated version. So if that needs to be incorporated, then one of us needs to merge it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <https://github.com/agnwinds/python/issues/425#issuecomment-427342298>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAwPLbCXimUpiURFG4zFVuiSoiMWok7oks5uh0sTgaJpZM4WDBEK>.
|
I also don't understand this. I'll try and work through the code a bit before the telecon. |
@jhmatthews Could this be that we need a fabs() around somewhere? I.e. the "levden_temp" variable we use for sanity checking before writing the populations ...can this get messed up if both "this_ion_density" and "populations[]" are both negative. Maybe we should do the "sane check" directly on "populations" without the division into a fractional population? |
Hi Stuart. Finally had a look at this. I think you are right in that the sane check should really be done on the absolute populations rather than fractional ones, but I'm not sure this can explain the error since we do also check for negative "this_ion_density", so that situation would still lead to "insane", and we know the error still happens even with your fix in #438 . Let's discuss at Friday's telecon, but I'm basically totally stuck on how to proceed here. One thing is that before I come to Belfast I think we need to be able reproduce this problem without running on 30 cores overnight, say by restarting from a wind save file. |
My above comment is incorrect, because the ordering of the checks wasn't quite what I had thoughts. Instead we should check all populations and densities for sanity before we copy anything over, and we should do the populations checks on absolute rather than fractional pops. I've pushed what I think is the correct change here. It would be good if @ssim could review my change and @kslong could run the model that produces the error with the latest commit from #438 |
With the latest fix in pull #438 we are still getting errors and perhaps different ones. Now we are getting
followed slightly later by
Having looked at the first error in the source code, the quantity that is equal to 65.9711 is the optical depth which is calculated from this formula for filling factor of one: kap_bf[nn] * weight_of_packet, but there are also some multiplications by variables and divisions by the same variable (because we record a different quantity in the estimators). It's not clear to me if this is definitely wrong, but it's suspicious. The second error is obviously more serious. How could lots of jumps occur?
Would be good to know more about the transitions/ions in question. Mayyyybe worth running the models with the latest fix but with BF_SIMPLE_EMISSIVITY_APPROACH set to 0 in python.h, to ensure that none of the recent bound free changes are messing anything up here - However, this is unlikely, since we are in macro-atom rather than simple machinery. Confusing. |
Having delved into this further I've verified that nans are still being copied into the levden array on occasion in macro_pops, which implies that my proposed catch is still not working and there must be a basic error in my coding. |
Ok, so I think this is happening because the rate to the C5 ion is zero, which means the total ion density is zero. Because we store things as a fractional population, we do 0.0/0.0 which is nan. I will fix this by catching the case that total_ion_density is zero. Note that we normally have a DENSITY_MIN for other non-macro ions, which I can use if we prefer. It's not clear if this will fix Knox's problem on more threads but it explains what I am seeing. |
The above comment I had got fixed in pull #438 - but there are some comments on that pull request about this error still persisting:
this is happening for a fairly convoluted sequence of reasons as follows:
The actual jump that causes the nan is CIV ground state to CV ground state. We should check it is reasonable that there is nothing in CIV or CV (I have checked that the gamma for that transition is zero). To fix this I have added a density floor in levden (agnwinds/python@0a05836) which ensures that you don't have 0s in denominators of density ratios. This runs much better (currently on cycle 16 of fiducial_agn_c34.pf). The code is still spitting out a bunch of errors in estimators.c due to stimulated correction factors being out of bounds. It's not surprising you get the error, since if all fractional populations in an ion are set to 1e-20 then the code thinks that is a population inversion - in fact it also multiplies 1e-20 in levden by 1e-20 in density[nion]. So I think we just have a couple of decisions to make with how we flag ions where this is happening and how we ensure that errors are suppressed where they should be, but that we still catch genuine issues. It's not clear to me exactly what the best decision is. |
We are now able to run Keara's agn problems, having solved #196, and the results are plausible. Although we clearly want expect to to do more testing of these input files, most notably with for CVs, I am going to declare victory here and close this issue. Changes associated with this issue have been merged into dev. |
@jhmatthews @ssim Reviewing the what is happening with my attempt to run a cv model with C345, it is clear that although the cause may not be the same, the error that is being produced is the same as we were seeing before. So rather than open a new issue about this, I've reopened the old issue. This takes overnight to run on 16 processors. It also seems to fail during the spectral cycles, but that is not associated with the 'no way out' error. I have also run the same model with c34 only, and this does not fail. |
I've done a bit more looking at where the 'no way out' was occuring. During the ionization cycles, it occurs randomly once or twice an ionization cycle at most. I also said above that the c345 model fails in the spectral cycles. This is due to Error: cdf_gen_from_array: all y's were zero or xmin xmax out of range of array x-- returning uniform distribution 0 which appears to be happening while Calculating macro-atom and k-packet sensitivities- this might take a while... Finally, and this may or may not be helpful, I tried to restart the model and it failed almost immediately, with the same type of errors, but now there are more of the 'no way out' errors as well. |
I'm not sure exactly where best to document this, but here is an issue I have uncovered which is related to #451 #486 . The issue is as follows. When one reads in VFKY cross-sections in the C34 or C345 models, the VFKY data can incorrectly be used for the top ion. here's the offending piece of code around line 1400 of get_atomicata.c, activated by the PhotVfkyS keyword: /* Check that there is an ion which has the same ionization state as this record
otherwise it must be a VFKY style record and so read with that format */
else if (strncmp (word, "PhotVfkyS", 8) == 0)
{
// It's a VFKY style photoionization record, beginning with the summary record
sscanf (aline, "%*s %d %d %d %d %le %d\n", &z, &istate, &islp, &ilv, &exx, &np);
for (n = 0; n < np; n++)
{
//Read the topbase photoionization records
if (fgets (aline, LINELENGTH, fptr) == NULL)
{
Error ("Get_atomic_data: Problem reading Vfky photoionization record\n");
Error ("Get_atomic_data: %s\n", aline);
exit (0);
}
sscanf (aline, "%*s %le %le", &xe[n], &xx[n]);
lineno++;
}
for (nion = 0; nion < nions; nion++)
{
if (ion[nion].z == z && ion[nion].istate == istate)
{
if (ion[nion].phot_info == -1)
{
[COPY TO PHOT_TOP STRUCTURE] Let's take the C345 model as an example. In that case, the C6 ion is the top one and should act like the "continuum" - i.e. there's no way you can go above C6. But, if you still provide a C6 xsection in the data then it gets to this if statement and has no way of knowing not to read in the xsection. One can change the if statement to this: |
Note that this initial issue could have feasibly caused bad errors because the macro_info was different in the ion and phot structures, and I could foresee weird numbers being stored in the wrong places. So we should rerun a c345 model once we fix this. The calc_pi_rate error is caused by trying to calculate the rate matrix for Carbon even though it's a macro-atom. It never actually stores these populations but it does calculate them. It still wants to calculate a photoionization rate for this ion because it only stops calculating them when Z = istate - 1 (see How to fix this... I guess we should probably mask macro ions out of the rate matrix? We need to ensure the code is smart enough not to try and calculate CVII in that instance then. We should probably also change the if statement so it checks against the highest ion, rather than the actual charge to be consistent with the elem_ions files? @Higginbottom @ssim @kslong |
Hard for me to believe we can actually mask macro atoms out of the rate matrix calculation, especially H and He. How then do we calculate things. I do think some of what is going on points to a need to do some extreme vetting of the atomic data, perhaps with some kind of ancillary code. |
Possibly, although of course the original error was actually being caught by my test routines. The rate matrix calculation is too fold. We have a rate matrix for the ionization state (if you use matrix ionization modes) for simple ions, and we have a rate matrix for all macro-ions which is for the level populations including the ionization fractions. The two systems of equations are both solved inside the same loop but the data copied to the macro-ions levels is from the second calculation (macro_pops). |
These seem to be working well now. There's a caveat that we should add collision strengths for Carbon, but this issue ca be closed. |
This is an issue that Keara discovered. While the CV models seem to be working with her C34 and C345 macro atom models, the AGN models do not. Her discussion is here
ksl confirms that some of these models are failing. In particularly, running in multiprocessor mode, fiducial_agn_c34.pf had a segmentation fault. The same model running in single processor mode, died after too many sane_checks, as did fiducial_agn_345.pf. It would seem as if the first step is track down the cause of all of the sane checks.
The text was updated successfully, but these errors were encountered: