Carbon macro atom models are failing #425

kslong · 2018-08-19T15:55:06Z

This is an issue that Keara discovered. While the CV models seem to be working with her C34 and C345 macro atom models, the AGN models do not. Her discussion is here

ksl confirms that some of these models are failing. In particularly, running in multiprocessor mode, fiducial_agn_c34.pf had a segmentation fault. The same model running in single processor mode, died after too many sane_checks, as did fiducial_agn_345.pf. It would seem as if the first step is track down the cause of all of the sane checks.

kslong · 2018-09-06T22:20:06Z

The model.pf files that caused this problem have to be updated to reflected the fact that pairwise_pow has been reduced as an ionization mode.

After I made these modifications and I still see problems in multprocessor mode, but it seems to be running fine in single processor modes. The failure mode was not quite the same for the 34 model ther was a segmentation adult in multiprocessor mode, for the 345 model I got a more generic mpi failure message.

kslong · 2018-09-08T13:57:07Z

I tried running in multiprocessor mode with a smaller wind (30 x 30) instead of 100. This time in mulitprocesor mode I still get a segemetatin fault for the 34 model. The last lines of the output file before the segfault look like this:

Thread 0 happy after broadcast.
!!python: Number of ionizing photons 2.03478e+56 lum of ionizing photons 6.34341e+45
mc_estimator_normalise: bb stimulated correction factor is out of bound. Abort.
stimfac -nan, i 26, line[config[i].bbu_jump[j]].nconfigu 29
estimators: den_config (xplasma, i) 0  den_config (xplasma, line[config[i].bbu_jump[j]].nconfigu) 0 
mc_estimator_normalise: bb stimulated correction factor is out of bound. Abort.
stimfac -nan, i 27, line[config[i].bbu_jump[j]].nconfigu 30
estimators: den_config (xplasma, i) 0  den_config (xplasma, line[config[i].bbu_jump[j]].nconfigu) 0 
mc_estimator_normalise: bb stimulated correction factor is out of bound. Abort.
stimfac -nan, i 29, line[config[i].bbu_jump[j]].nconfigu 30
estimators: den_config (xplasma, i) 0  den_config (xplasma, line[config[i].bbu_jump[j]].nconfigu) 0 
init_freebound: Creating recombination coefficients
init_freebound: Creating recombination emissivites between 0.000000e+00 and 1.000000e+50
[science4:584842] *** Process received signal ***
[science4:584842] Signal: Segmentation fault (11)

kslong · 2018-09-11T14:01:26Z

On my Mac, in single processor mode. short_34.pf fails with

error_count: This error will no longer be logged: macro_pops: found unreasonable populations in cell %i, so adopting dilute BBody excitation


Error summary: Something is drastically wrong for any error to occur so much!

Recurrences --  Description
        1 -- rdchoice: Deprecated use of rdchoice, please replace answer to %s with its string equivalent %s 
        1 -- Over 1 GIGABYTE of photon structure allocated. Could cause serious problems.
       59 -- Get_atomic_data: Simple line data ignored for Macro-ion: %d
        7 -- getatomic_data: line input incomplete: %s
      248 -- get_atomicdata: Could not interpret line %d in file %s: %s
       12 -- error_count: This error will no longer be logged: %s
     1969 -- Get_atomic_data: more than one electron yield record for inner_cross %i
       32 -- get_atomicdata: No inner electron yield data for inner cross section %i
        1 -- get_atomicdata: macro atom but no topbase xsection! ion %i z %i istate %i, yet marked as topbase xsection!
        1 -- get_wind_params: zdom[ndom].rmax 0 for wind type %d
        1 -- wind2d: Cell %3d (%2d,%2d) in domain %d has %d corners in wind, but zero volume
        1 -- check_grid: velocity changes by >1,000 km/s in %i cells
        1 -- check_grid: some cells have large changes. Consider modifying zlog_scale or grid dims
       12 -- trans_phot: %ld photons were lost due to DFUDGE (=%8.4e) pushing them outside of the wind after scatter
     4899 -- mc_estimator_normalise: bb stimulated correction factor is out of bound. Abort.
     4899 -- stimfac %g, i %d, line[config[i].bbu_jump[j]].nconfigu %d
      544 -- spectral_estimators: too few photons (1 or 0) in cell %d band %d to produce a model
        3 -- spectral_estimators: Alpha cannot be bracketed (%e %e)in band %i cell %i- setting w to zero
       11 -- photon_checks: nphot  origin  freq     freqmin    freqmax
      381 -- photon_checks: %6d %5d %10.4e %10.4e %10.4e freq out of range
      957 -- walls: distance %g<0. Position %g %g %g 
        2 -- spectral_estimators: multiple photons but only one frequency seen in %d band %d
   100001 -- sane_check: %d %e
    41030 -- macro_pops: level %i has frac. pop. %8.4e in cell %i
     2565 -- macro_pops: found unreasonable populations in cell %i, so adopting dilute BBody excitation
    51278 -- macro_pops: bbd rate is %8.4e in cell/matom %i
     7692 -- macro_pops: ion %i has frac. pop. %8.4e in cell %i

and short_345 fails with

error_count: This error will no longer be logged: macro_pops: found unreasonable populations in cell %i, so adopting dilute BBody excitation


Error summary: Something is drastically wrong for any error to occur so much!

Recurrences --  Description
        1 -- rdchoice: Deprecated use of rdchoice, please replace answer to %s with its string equivalent %s 
        1 -- Over 1 GIGABYTE of photon structure allocated. Could cause serious problems.
       77 -- Get_atomic_data: Simple line data ignored for Macro-ion: %d
        7 -- getatomic_data: line input incomplete: %s
      248 -- get_atomicdata: Could not interpret line %d in file %s: %s
       10 -- error_count: This error will no longer be logged: %s
     1969 -- Get_atomic_data: more than one electron yield record for inner_cross %i
       32 -- get_atomicdata: No inner electron yield data for inner cross section %i
        1 -- get_atomicdata: macro atom but no topbase xsection! ion %i z %i istate %i, yet marked as topbase xsection!
        1 -- get_wind_params: zdom[ndom].rmax 0 for wind type %d
        1 -- wind2d: Cell %3d (%2d,%2d) in domain %d has %d corners in wind, but zero volume
        1 -- check_grid: velocity changes by >1,000 km/s in %i cells
        1 -- check_grid: some cells have large changes. Consider modifying zlog_scale or grid dims
       10 -- trans_phot: %ld photons were lost due to DFUDGE (=%8.4e) pushing them outside of the wind after scatter
      364 -- spectral_estimators: too few photons (1 or 0) in cell %d band %d to produce a model
        9 -- photon_checks: nphot  origin  freq     freqmin    freqmax
      323 -- photon_checks: %6d %5d %10.4e %10.4e %10.4e freq out of range
      671 -- walls: distance %g<0. Position %g %g %g 
       13 -- mc_estimator_normalise: bb stimulated correction factor is out of bound. Abort.
       13 -- stimfac %g, i %d, line[config[i].bbu_jump[j]].nconfigu %d
        2 -- spectral_estimators: multiple photons but only one frequency seen in %d band %d
    36912 -- macro_pops: level %i has frac. pop. %8.4e in cell %i
     4761 -- macro_pops: ion %i has frac. pop. %8.4e in cell %i
     1192 -- macro_pops: found unreasonable populations in cell %i, so adopting dilute BBody excitation
       12 -- sobolev: den_ion has negative density %g %g %g %g %g %g
       12 -- sobolev: continuing by setting tau to zero
   100001 -- sane_check: %d %e
    58329 -- macro_pops: bbd rate is %8.4e in cell/matom %i

kslong · 2018-09-15T19:24:31Z

This is an update on where we are with this problem. The short runs of the agn model appear to have been fixed in the sense that the program does not fall over by a changed that @jhmatthews made to macro atoms, which he will explain later. Errors associated with this macro atom problem are still be reported and James is investigating them.

At least two other problems have been seen:

For long runs photon_checks is seeing too many photons out of the frequency boundary. This is most likely due to the effects of Doppler shifts on the original distribution, as is discussed in #434. For now, ksl has modified dev so it does not fall over at this point.

A second problem in long runs, is that we are getting an error in some multiprocessor runs of
Error: Matom: jumped 1000000 times with no emission abort. Like the photon_checks, when this happens, the program aborts (because as @ssim something is terribly wrong). ksl notes that there may be a related problem #385. At any rate this may be very difficult to debug as it has only appeared (at least so far) in multiprocessor mode.

(Additionally, in multiprocessor mode, the program is often failing with either a segmentation fault in one of the channles, or a general mpi failure, suggesting some kind of problems with mpi)

kslong · 2018-09-16T14:48:10Z

@jhmatthews @ssim So, on linux now,
h10_standard80.pf.txt
fails in multiprocessor mode. The error that occurs on one of the threads is
Error: bb_estimators_increment: trying to add negative contribution to jbar. Abort.
With 16 processors, the failure is in ionization cycle 7 as photons are going through the wind.

The only difference that I know of between this .pf file and the one that has historically been run is that we are not using ionization mode 9, not 7 (which Nick removed).

We clearly need a better way of diagnosing these types of problems, e.g to dump the windsave file and all of the information about the photon that where the problem occurred.

kslong · 2018-09-16T14:50:29Z

@jhmatthews @ssim

I installed valgrind on the linux machine. It shows errrors, most of which I do not know how to interpret, even for the 1d_sn model.

And many more for a short macro-atom rum. For the macro runs, thinks like the plasma cells are implicated.

saultyevil · 2018-09-16T14:55:32Z

Do you know if this fails in multiprocessor mode with just one process? It might help narrow down if something weird if happening when MPI is turned on.

kslong · 2018-09-16T15:05:38Z

Ed, I have just posted a lot more information about this on the github issue page. There is one group of failures that appears in single processor mode. These often appear to be “unusual” events during the photon cycles that appear well into the calculation. Unfortunately, we do not write out much about what is actually happening then, but simply exit the code. In multiprocessor mode, their appear to be two types of failures. One arises when one of the “unusual” events happens and that thread is exited. In these case, mpi tells you which thread crashed and you can see the error that caused it to exit in the diag file. Sometimes however, mpi just goes away, and gives a generic error message. In these cases, which typically do not occur during the photon flight, there may be an mpi problem. Separately, you will see a note on github, about memory leaks. I have been trying to use valgrind to look for these and there is some indication that these do exist. Cheers, Knox From: Edward Parkinson <[email protected]> Reply-To: agnwinds/python <[email protected]> Date: Sunday, September 16, 2018 at 9:55 AM To: agnwinds/python <[email protected]> Cc: Knox Long <[email protected]>, Author <[email protected]> Subject: Re: [agnwinds/python] Carbon macro atom models are failing (#425) Do you know if this fails in multiprocessor mode with just one process? It might help narrow down if something weird if happening when MPI is turned on. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://github.com/agnwinds/python/issues/425#issuecomment-421778113>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ACaeVXPedG-LLZ21cRd8TMp76mK3Vl3iks5ubmZkgaJpZM4WDBEK>.

kslong · 2018-09-17T14:36:20Z

Here is a complete output from Valgrind, for the following .pf file

short_34.pf.txt

which ran to completion with the following errors:

The valgrind result is

valgrind.txt

I'm still not sure how to interpret this, but if you look say at the first set of results, it looks like one is trying to change a value at line 762 of recomb.c to something that was allocated by calloc_dyn_plasma at line 648 of grid_wind.

Later on, it looks like one is trying to use an uninitialized value of somthing in macro_pops at line 493.

At the end, there is a leak summary.

 ==3518128== LEAK SUMMARY:
==3518128==    definitely lost: 242,968 bytes in 1,047 blocks
==3518128==    indirectly lost: 0 bytes in 0 blocks
==3518128==      possibly lost: 0 bytes in 0 blocks
==3518128==    still reachable: 125,958,908 bytes in 4,739 blocks
==3518128==         suppressed: 0 bytes in 0 blocks

According to some of the Valgrind documentation any problem that results in definitely lost memory should be fixed.

jhmatthews · 2018-09-17T15:10:40Z

So I'm struggling to keep up a bit here. As far as I can tell we have a number of different issues with likely different causes and it might be easier to debug if we document or investigate these one by one, possibly in separate threads. Here are some observations, separated out by issue.

The issue caused by a negative jbar for (model h10_standard80.pf.txt) can only be caused by either a non-positive tau_sobolev, a non-positive fabs(dvds) or a non-positive weight_of_packet, each of which is something to worry about, so we should figure out which.
The errors in mc_estimator_normalize (model vshort_34.pf.txt) actually occurs in CV runs too and I think they are caused by having zero populations in the upper levels of certain Carbon ions -- the densities of the upper and lower ions are both zero so the stim_fac quantity is nan and doesn't get used (as it has to be between 0 and 1 to get used). In which case, it isn't really an error at all and the logic of the code should be tweaked slightly. That is unless we are really supposed to have a minimum density in upper ions?

kslong · 2018-09-17T17:57:11Z

@jhmatthews Basically, I agree here, but I think we need to figure out a path forward where we can work on the various issues at once, since they all contribute the problem of getting the C34 and C345 modles to work.

There are now two types of problems, or at least there are two basic ways the problems are manifested:

Rare problems associated with the macro atom machinery
Problems associated with multiprocessing.

It is clear that the rare problems are causing problems with muliti-processing. This is because we have adopted a more "rigorous" approach when the macro-atom problems are discovered than we have elsewhere in the program, that is we in most cases exit the program. Our philosophy elsewhere has been to try to get the program to "soldier" on. (Note that my comment here is not exactly fair; there are other places where we would exit if we got an error, but generally when we have encountered these in the past, we either fixed the error or figured out a way to get the program to continue.)

An important point here is that the macro atom issues we are encountering here are rare. Unless there is a deeper underlying problem, which there could be, they are just like any of the other errors we have bypassed in the past. Therefore, unless just looking at the code will permit a solution, I think we need a way to issue an error message and have the program move on. We should print out as much as we can about the situation at the time of the error, but we should "soldier" on. At the point that we can get Python not to crash, I agree with James that a separate git issue is warranted, since as he points out these errors may have different causes and solutions and this thread will be impossible to dissect at a a later date..

kslong · 2018-09-21T13:58:03Z

So after making a small mod so that negative jbar would not cause the program to crash, I ran short_345 and short_34. When compiled without mpicc, short 34 had a segmentation fault in single processor mode; when compiled with gcc it seems to be running past where the segmentation fault occured. short_345 running on mpicc it has ultimately failed with the following end of the log:

kslong · 2018-09-22T13:57:48Z

I have been checking what happens with short_34 and short_345 with the new scheme for handling simple atoms in macro atom mode turned off. short_345 seems to run fine, with few errors. short_34 still has the odd problem where it produces a segfault when compiled with mpicc and run with a single processor, but not when compiled and run with gcc.

kslong · 2018-09-24T22:58:52Z

After the changes @jhmatthews has made for #436 short_345 appears to run using mpi, but short_34 is still having segmentation faults. James fixed one of these, but we are still seeing these at another point in the program, which I have finally located as occurring in init_freebound. I doubt this has anything to do with the jbar problem. I am suspicious @ssim that it has something to do with the atomic data file, something that get_atomic_data is not catching.

The strange thing about this segmentation fault is that it does not always appear, depending on how exactly the code is compiled. In particular, it always occurs with a standard complilation but it does not happen in single processor mode, and it does not seem to happen compiled in multiprocessor mode with debugging turned on (which changes some of the compilation defaults).

kslong · 2018-09-25T14:11:25Z

Progress, I think, @jhmatthews @ssim I believe I understand now why segfault occurred sometimes with short_34.pf and sometimes not. It all depends on whether the -03 switch is included. Our standard has been to not include this when the D option is used. When I add the -03 switch to the debug options in Make, then I get the segfault, on this file:
vshort_34.pf.txt

The segfault on linux occurs in single and multiprocessor mode, when complied with
make -D python (so this is mpicc and -03).

Given that it occurs in multiprocessor mode, I have been able to hone in on the error. It was not quite where I thought. Instead it was in these lines

specically line 376 in macro_gov

The actual error is coming from line 373, which is returning a very large negative value.

This would be expected to cause a segfault.

Note that with -O3, gdb does not work as well as without it as some variables have been "optimized out". Consequently, I cannot look at a lot of the values in the routine. So one is going to need to be clever to actually figure out what the problem is, and it may be that just adding Log statements will cause the program to work again. I remain of the opinion that there something not quite write about the C 34 data file, or something that get_atomic_data is not catching.

kslong · 2018-09-25T18:47:50Z

@ssim @jhmatthews I am betting you all are in a position to figure this out now. I added some print statements to the loop where the seg fault was:

and here is what got printed out:

jhmatthews · 2018-09-26T11:44:37Z

I'm not sure if this is a lead or not, but I had a quick look at the data files. The CIIICIVCV elem ions file has:

#IonV    C   6   1   1   11.26000 1000   5     1s^22s^22p^2(3P_0) 
#IonV    C   6   2   2   24.38400 1000   5     1s^22s^22p(2P^o_{1/2}) 
IonM    C   6   3   1   47.88800 1000   20     1s^22s^2(1S_0) 
IonM    C   6   4   2   64.49400 1000   20     1s^22s(2S_{1/2}) 
IonM    C   6   5   1  392.09000 1000   20     1s^2(1S_0) 
IonM    C   6   6   2  489.99700 1000   1     1s(2S_{1/2}) 
#IonV    C   6   7   2 1.0000e+20   0   0     Bare

while CIIICIV has:

#IonV    C   6   1   1   11.26000 1000   5     1s^22s^22p^2(3P_0) 
#IonV    C   6   2   2   24.38400 1000   5     1s^22s^22p(2P^o_{1/2}) 
IonM    C   6   3   1   47.88800 1000   10     1s^22s^2(1S_0) 
IonM    C   6   4   2   64.49400 1000   5    1s^22s(2S_{1/2}) 
IonM    C   6   5   1  392.09000 1000   1     1s^2(1S_0) 
#IonV    C   6   6   2  489.99700 1000   0     1s(2S_{1/2}) 
#IonV    C   6   7   2 1.0000e+20   0   0     Bare

i.e. nlte is set to 5 for CIV which is the ion that causes the issue. I'm wondering if this is causing a problem? I don't know if this info is actually used though. Knox, can you change the numbers and try again? I can't seem to reproduce the segfault.

ssim · 2018-09-26T11:52:11Z

HI James, That sounds like a good spot to me - could well be it! Stuart

…

On 26 Sep 2018, at 12:44, James Matthews ***@***.***> wrote: I'm not sure if this is a lead or not, but I had a quick look at the data files. The CIIICIVCV elem ions file has: #IonV C 6 1 1 11.26000 1000 5 1s^22s^22p^2(3P_0) #IonV C 6 2 2 24.38400 1000 5 1s^22s^22p(2P^o_{1/2}) IonM C 6 3 1 47.88800 1000 20 1s^22s^2(1S_0) IonM C 6 4 2 64.49400 1000 20 1s^22s(2S_{1/2}) IonM C 6 5 1 392.09000 1000 20 1s^2(1S_0) IonM C 6 6 2 489.99700 1000 1 1s(2S_{1/2}) #IonV C 6 7 2 1.0000e+20 0 0 Bare while CIIICIV has: #IonV C 6 1 1 11.26000 1000 5 1s^22s^22p^2(3P_0) #IonV C 6 2 2 24.38400 1000 5 1s^22s^22p(2P^o_{1/2}) IonM C 6 3 1 47.88800 1000 10 1s^22s^2(1S_0) IonM C 6 4 2 64.49400 1000 5 1s^22s(2S_{1/2}) IonM C 6 5 1 392.09000 1000 1 1s^2(1S_0) #IonV C 6 6 2 489.99700 1000 0 1s(2S_{1/2}) #IonV C 6 7 2 1.0000e+20 0 0 Bare i.e. nlte is set to 5 for CIV which is the ion that causes the issue. I'm wondering if this is causing a problem? I don't know if this info is actually used though. Knox, can you change the numbers and try again? I can't seem to reproduce the segfault. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/agnwinds/python/issues/425#issuecomment-424685028>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAwPLTKOb3mRQ_8lUOqNkcPWXZcMPfrrks5ue2isgaJpZM4WDBEK>.

kslong · 2018-09-26T13:53:07Z

Yes, making this change does appear to eliminate the problem. So if that is really it, then we need to determine how/if python should have caught the issue and failed more gracefully.

This change causes get_atomic_data to ignore macro levels that are in excess of the number specified in the corresponding ion line.

kslong · 2018-09-26T21:37:22Z

As noted by @jhmatthews, the problem with the data file was that there were more macro-levels than was specified in the ion line. The code did not check for this.

After a conversation with @ssim I have made changes to get_atomic_data.c so that macro-levels in excess of the number given in the corresponding ion result in an error, but are ignored. This follows our general philosophy of ignoring extra lines in the atomic data file. In this case, it means one can easily reduce the complexity of a line by reducing the number of levels (at least under the assumption) that the first n levels are the most important. So for example, the change from h20 to h10 could be carried out by changing a single entry in the ions data.

kslong · 2018-09-27T19:41:38Z

So, I tried running the full simulations which Keara began with on 8 processors each on linux, and fiducial_agn_34 ran to completion, but fiducial_agn_345 failed after about 8 ionization cycles on one of the threads, which caused the program to crash. The diagnostic messages presented before the crash were:

Here is what that portion of the code says:

Evidently d2, the "density" of the upper level here, is a lot bigger than d1, and since the g's of the upper and lower level are both 3, this is going to cause xden_ion to be negative .. which ulitmately gets to the exit statement.

ssim · 2018-10-04T17:10:02Z

I'm not 100% sure if this is a robust fix but Knox, can you try this with the code alteration in the PR I just made: #438

If this helps we could accept the PR as a step forward. If it doesn't help then we'll close it

kslong · 2018-10-04T18:43:38Z

I have started the long runs of fidicial_agn_c34 and fiducial_agn_c345 for this. So far there are no problems, but I will need to let them run. From: ssim <[email protected]> Reply-To: agnwinds/python <[email protected]> Date: Thursday, October 4, 2018 at 12:10 PM To: agnwinds/python <[email protected]> Cc: Knox Long <[email protected]>, Author <[email protected]> Subject: Re: [agnwinds/python] Carbon macro atom models are failing (#425) I'm not 100% sure if this is a robust fix but Knox, can you try this with the code alteration in the PR I just made: #438<https://github.com/agnwinds/python/pull/438> If this helps we could accept the PR as a step forward. If it doesn't help then we'll close it — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://github.com/agnwinds/python/issues/425#issuecomment-427097405>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ACaeVXZDmWHjO5UdSzZFRQgy3jdfeOazks5uhkDqgaJpZM4WDBEK>.

kslong · 2018-10-05T12:05:37Z

Unfortunately @ssim @jhmatthews this did not solve the problem. Although fidicial_agn_c34 is still running, fiducial_agn_345 failed in one of the 8 threads I ran at the end of the cycle 19 of 30. The messages at the end of the diag file for that thread are shown below:

Note that I have not yet merged the changes Stuart made in macro_pots into the dev branch, I just ran his updated version. So if that needs to be incorporated, then one of us needs to merge it.

ssim · 2018-10-05T12:28:29Z

Hmmmm - this seems impossible to me - there must be a logical flaw in the code I’ve written here, but I can’t see it...

…

On 5 Oct 2018, at 13:05, Knox S. Long ***@***.***> wrote: Unfortunately @ssim <https://github.com/ssim> @jhmatthews <https://github.com/jhmatthews> this did not solve the problem. Although fidicial_agn_c34 is still running, fiducial_agn_345 failed in one of the 8 threads I ran at the end of the cycle 19 of 30. The messages at the end of the diag file for that thread are shown below: <https://user-images.githubusercontent.com/2530901/46534046-77fea380-c86c-11e8-960c-ff3f5acc16d5.png> Note that I have not yet merged the changes Stuart made in macro_pots into the dev branch, I just ran his updated version. So if that needs to be incorporated, then one of us needs to merge it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/agnwinds/python/issues/425#issuecomment-427342298>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAwPLbCXimUpiURFG4zFVuiSoiMWok7oks5uh0sTgaJpZM4WDBEK>.

jhmatthews · 2018-10-05T12:39:23Z

I also don't understand this. I'll try and work through the code a bit before the telecon.

ssim · 2018-10-07T15:05:46Z

@jhmatthews Could this be that we need a fabs() around somewhere? I.e. the "levden_temp" variable we use for sanity checking before writing the populations ...can this get messed up if both "this_ion_density" and "populations[]" are both negative. Maybe we should do the "sane check" directly on "populations" without the division into a fractional population?

jhmatthews · 2018-10-24T09:42:44Z

Hi Stuart. Finally had a look at this. I think you are right in that the sane check should really be done on the absolute populations rather than fractional ones, but I'm not sure this can explain the error since we do also check for negative "this_ion_density", so that situation would still lead to "insane", and we know the error still happens even with your fix in #438 .

Let's discuss at Friday's telecon, but I'm basically totally stuck on how to proceed here. One thing is that before I come to Belfast I think we need to be able reproduce this problem without running on 30 cores overnight, say by restarting from a wind save file.

jhmatthews · 2018-11-02T14:16:31Z

My above comment is incorrect, because the ordering of the checks wasn't quite what I had thoughts. Instead we should check all populations and densities for sanity before we copy anything over, and we should do the populations checks on absolute rather than fractional pops.

I've pushed what I think is the correct change here. It would be good if @ssim could review my change and @kslong could run the model that produces the error with the latest commit from #438

jhmatthews · 2018-11-06T15:48:39Z

With the latest fix in pull #438 we are still getting errors and perhaps different ones. Now we are getting

bf_estimator_increment: A packet survived an optical depth of 65.9711
bf_estimator_increment: freq_av 5.66755e+15, ft 3.28805e+15

followed slightly later by

Error: Matom: jumped 1000000 times with no emission. Abort.

Having looked at the first error in the source code, the quantity that is equal to 65.9711 is the optical depth which is calculated from this formula for filling factor of one: kap_bf[nn] * weight_of_packet, but there are also some multiplications by variables and divisions by the same variable (because we record a different quantity in the estimators). It's not clear to me if this is definitely wrong, but it's suspicious.

The second error is obviously more serious. How could lots of jumps occur?

if two levels are close to detailed balance?
if the energy difference between the levels is vanishingly small w.r.t the absolute energy of the transitions - p_emit is proportional to energy gap but p_jump proportional to gap. Even so, you would think you would jump to another level that had a reasonable emission probability
possibly a combination of the above with some incorrect normalisation to artificially inflate the jumping probability for one specific transition.
a flat-out bug or book keeping error
???

Would be good to know more about the transitions/ions in question.

Mayyyybe worth running the models with the latest fix but with BF_SIMPLE_EMISSIVITY_APPROACH set to 0 in python.h, to ensure that none of the recent bound free changes are messing anything up here - However, this is unlikely, since we are in macro-atom rather than simple machinery.

Confusing.

jhmatthews · 2018-11-08T14:15:24Z

Having delved into this further I've verified that nans are still being copied into the levden array on occasion in macro_pops, which implies that my proposed catch is still not working and there must be a basic error in my coding.

jhmatthews · 2018-11-08T14:49:33Z

Ok, so I think this is happening because the rate to the C5 ion is zero, which means the total ion density is zero. Because we store things as a fractional population, we do 0.0/0.0 which is nan.

I will fix this by catching the case that total_ion_density is zero. Note that we normally have a DENSITY_MIN for other non-macro ions, which I can use if we prefer. It's not clear if this will fix Knox's problem on more threads but it explains what I am seeing.

jhmatthews · 2018-11-22T12:28:53Z

The above comment I had got fixed in pull #438 - but there are some comments on that pull request about this error still persisting:

Error: macro atom level has no way out 11 0 0

this is happening for a fairly convoluted sequence of reasons as follows:

in macro_pops, I had a density floor for ions but not for levels. As a result, the levden array can contain 0.
when you calculated the stimulated recomb rate there's a factor of n_u/n_l - if n_l= 0 then you get a nan for the jumping probability. This happens in a bound-free upwards jump.
As a result, the normalisation of the jumping probabilities is also nan. So when you try and figure out where the packet jumps to, you are asking if a finite number is less than a nan which is never true, so the "n" counter in matom() never gets incremented and you jump to the first level even if that level had nothing to do with the levels that caused the issue.

The actual jump that causes the nan is CIV ground state to CV ground state. We should check it is reasonable that there is nothing in CIV or CV (I have checked that the gamma for that transition is zero).

To fix this I have added a density floor in levden (agnwinds/python@0a05836) which ensures that you don't have 0s in denominators of density ratios. This runs much better (currently on cycle 16 of fiducial_agn_c34.pf).

The code is still spitting out a bunch of errors in estimators.c due to stimulated correction factors being out of bounds. It's not surprising you get the error, since if all fractional populations in an ion are set to 1e-20 then the code thinks that is a population inversion - in fact it also multiplies 1e-20 in levden by 1e-20 in density[nion].

So I think we just have a couple of decisions to make with how we flag ions where this is happening and how we ensure that errors are suppressed where they should be, but that we still catch genuine issues. It's not clear to me exactly what the best decision is.

kslong · 2018-12-02T18:33:44Z

We are now able to run Keara's agn problems, having solved #196, and the results are plausible. Although we clearly want expect to to do more testing of these input files, most notably with for CVs, I am going to declare victory here and close this issue.

Changes associated with this issue have been merged into dev.

kslong · 2019-03-19T18:51:41Z

@jhmatthews @ssim Reviewing the what is happening with my attempt to run a cv model with C345, it is clear that although the cause may not be the same, the error that is being produced is the same as we were seeing before. So rather than open a new issue about this, I've reopened the old issue.

ixvel_c345.pf.txt

This takes overnight to run on 16 processors. It also seems to fail during the spectral cycles, but that is not associated with the 'no way out' error.

I have also run the same model with c34 only, and this does not fail.

kslong · 2019-03-19T19:35:33Z

I've done a bit more looking at where the 'no way out' was occuring. During the ionization cycles, it occurs randomly once or twice an ionization cycle at most.

I also said above that the c345 model fails in the spectral cycles. This is due to

Error: cdf_gen_from_array: all y's were zero or xmin xmax out of range of array x-- returning uniform distribution 0

which appears to be happening while

Calculating macro-atom and k-packet sensitivities- this might take a while...

Finally, and this may or may not be helpful, I tried to restart the model and it failed almost immediately, with the same type of errors, but now there are more of the 'no way out' errors as well.

jhmatthews · 2019-04-01T12:03:38Z

I'm not sure exactly where best to document this, but here is an issue I have uncovered which is related to #451 #486 . The issue is as follows. When one reads in VFKY cross-sections in the C34 or C345 models, the VFKY data can incorrectly be used for the top ion.

here's the offending piece of code around line 1400 of get_atomicata.c, activated by the PhotVfkyS keyword:

          /* Check that there is an ion which has the same ionization state as this record
             otherwise it must be a VFKY style record and so read with that format */

          else if (strncmp (word, "PhotVfkyS", 8) == 0)
          {
            // It's a VFKY style photoionization record, beginning with the summary record
            sscanf (aline, "%*s %d %d %d %d %le %d\n", &z, &istate, &islp, &ilv, &exx, &np);


            for (n = 0; n < np; n++)
            {
              //Read the topbase photoionization records
              if (fgets (aline, LINELENGTH, fptr) == NULL)
              {
                Error ("Get_atomic_data: Problem reading Vfky photoionization record\n");
                Error ("Get_atomic_data: %s\n", aline);
                exit (0);
              }
              sscanf (aline, "%*s %le %le", &xe[n], &xx[n]);
              lineno++;

            }

            for (nion = 0; nion < nions; nion++)
            {
              if (ion[nion].z == z && ion[nion].istate == istate)
              {
                if (ion[nion].phot_info == -1)
                {
                 [COPY TO PHOT_TOP STRUCTURE]

Let's take the C345 model as an example. In that case, the C6 ion is the top one and should act like the "continuum" - i.e. there's no way you can go above C6. But, if you still provide a C6 xsection in the data then it gets to this if statement and has no way of knowing not to read in the xsection.

One can change the if statement to this: if (ion[nion].phot_info == -1 && ion[nion].macro_info != 1). If I do that, the following error goes away, but I get this error: calc_pi_rate: No photoionization xsection for ion 8 (element 6, ion state 6), which causes the code to exit.

jhmatthews · 2019-04-01T12:11:18Z

Note that this initial issue could have feasibly caused bad errors because the macro_info was different in the ion and phot structures, and I could foresee weird numbers being stored in the wrong places. So we should rerun a c345 model once we fix this.

The calc_pi_rate error is caused by trying to calculate the rate matrix for Carbon even though it's a macro-atom. It never actually stores these populations but it does calculate them. It still wants to calculate a photoionization rate for this ion because it only stops calculating them when Z = istate - 1 (see if (ion[mm].istate != ion[mm].z + 1) in l. 128 matrix_ion.c.

How to fix this... I guess we should probably mask macro ions out of the rate matrix? We need to ensure the code is smart enough not to try and calculate CVII in that instance then. We should probably also change the if statement so it checks against the highest ion, rather than the actual charge to be consistent with the elem_ions files? @Higginbottom @ssim @kslong

kslong · 2019-04-01T14:51:55Z

Hard for me to believe we can actually mask macro atoms out of the rate matrix calculation, especially H and He. How then do we calculate things.

I do think some of what is going on points to a need to do some extreme vetting of the atomic data, perhaps with some kind of ancillary code.

jhmatthews · 2019-04-01T15:00:35Z

Possibly, although of course the original error was actually being caught by my test routines.

The rate matrix calculation is too fold. We have a rate matrix for the ionization state (if you use matrix ionization modes) for simple ions, and we have a rate matrix for all macro-ions which is for the level populations including the ionization fractions. The two systems of equations are both solved inside the same loop but the data copied to the macro-ions levels is from the second calculation (macro_pops).

Corrections to enable C345 macro atom models to avoid various errors. See #486 #385 #423 #425 * changes to set a max ion stage if it is not Z+1 * fixed if statement in matrix_ion * Changed version to 83a

jhmatthews · 2019-04-08T16:36:49Z

These seem to be working well now. There's a caveat that we should add collision strengths for Carbon, but this issue ca be closed.

kslong mentioned this issue Sep 15, 2018

photon_checks #434

Closed

jhmatthews mentioned this issue Sep 19, 2018

Negative jbar contribution in macro-atom run #436

Closed

kslong added a commit that referenced this issue Sep 26, 2018

Updates to get_atomicdata to address #425

a212a39

This change causes get_atomic_data to ignore macro levels that are in excess of the number specified in the corresponding ion line.

ssim mentioned this issue Oct 4, 2018

Minor fix for macro atom pop solver #438

Merged

jhmatthews mentioned this issue Nov 8, 2018

get_atomic_data is called twice for restarts with -r flag #448

Closed

jhmatthews pushed a commit that referenced this issue Nov 22, 2018

added density floor for levden array in macro_pops; see #425 and #438

0a05836

jhmatthews mentioned this issue Nov 22, 2018

Dealing with singular rate matrixes with low photon numbers #196

Closed

kslong closed this as completed Dec 2, 2018

kslong reopened this Mar 19, 2019

jhmatthews mentioned this issue Apr 2, 2019

macro atoms - nebular concentrations fails to converge #493

Closed

jhmatthews added the macro-atom Issues to do with macro-atoms label Apr 3, 2019

kslong mentioned this issue Apr 6, 2019

Max ion stage #502

Merged

kslong added a commit that referenced this issue Apr 6, 2019

Max ion stage (#502)

3d210de

Corrections to enable C345 macro atom models to avoid various errors. See #486 #385 #423 #425 * changes to set a max ion stage if it is not Z+1 * fixed if statement in matrix_ion * Changed version to 83a

jhmatthews closed this as completed Apr 8, 2019

Carbon macro atom models are failing #425

Carbon macro atom models are failing #425

Comments

kslong commented Aug 19, 2018 • edited Loading

kslong commented Sep 6, 2018

kslong commented Sep 8, 2018

kslong commented Sep 11, 2018

kslong commented Sep 15, 2018 • edited Loading

kslong commented Sep 16, 2018

kslong commented Sep 16, 2018 • edited Loading

saultyevil commented Sep 16, 2018

kslong commented Sep 16, 2018 via email

kslong commented Sep 17, 2018

jhmatthews commented Sep 17, 2018

kslong commented Sep 17, 2018

kslong commented Sep 21, 2018

kslong commented Sep 22, 2018

kslong commented Sep 24, 2018

kslong commented Sep 25, 2018

kslong commented Sep 25, 2018

jhmatthews commented Sep 26, 2018

ssim commented Sep 26, 2018 via email

kslong commented Sep 26, 2018 via email

kslong commented Sep 26, 2018

kslong commented Sep 27, 2018

ssim commented Oct 4, 2018

kslong commented Oct 4, 2018 via email

kslong commented Oct 5, 2018

ssim commented Oct 5, 2018 via email

jhmatthews commented Oct 5, 2018

ssim commented Oct 7, 2018

jhmatthews commented Oct 24, 2018

jhmatthews commented Nov 2, 2018

jhmatthews commented Nov 6, 2018

jhmatthews commented Nov 8, 2018

jhmatthews commented Nov 8, 2018

jhmatthews commented Nov 22, 2018

kslong commented Dec 2, 2018 • edited Loading

kslong commented Mar 19, 2019

kslong commented Mar 19, 2019

jhmatthews commented Apr 1, 2019

jhmatthews commented Apr 1, 2019

kslong commented Apr 1, 2019

jhmatthews commented Apr 1, 2019

jhmatthews commented Apr 8, 2019

kslong commented Aug 19, 2018 •

edited

Loading

kslong commented Sep 15, 2018 •

edited

Loading

kslong commented Sep 16, 2018 •

edited

Loading

kslong commented Dec 2, 2018 •

edited

Loading