Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Carbon macro atom models are failing #425

Closed
kslong opened this issue Aug 19, 2018 · 41 comments
Closed

Carbon macro atom models are failing #425

kslong opened this issue Aug 19, 2018 · 41 comments
Labels
macro-atom Issues to do with macro-atoms

Comments

@kslong
Copy link
Collaborator

kslong commented Aug 19, 2018

This is an issue that Keara discovered. While the CV models seem to be working with her C34 and C345 macro atom models, the AGN models do not. Her discussion is here

ksl confirms that some of these models are failing. In particularly, running in multiprocessor mode, fiducial_agn_c34.pf had a segmentation fault. The same model running in single processor mode, died after too many sane_checks, as did fiducial_agn_345.pf. It would seem as if the first step is track down the cause of all of the sane checks.

@kslong
Copy link
Collaborator Author

kslong commented Sep 6, 2018

The model.pf files that caused this problem have to be updated to reflected the fact that pairwise_pow has been reduced as an ionization mode.

After I made these modifications and I still see problems in multprocessor mode, but it seems to be running fine in single processor modes. The failure mode was not quite the same for the 34 model ther was a segmentation adult in multiprocessor mode, for the 345 model I got a more generic mpi failure message.

@kslong
Copy link
Collaborator Author

kslong commented Sep 8, 2018

I tried running in multiprocessor mode with a smaller wind (30 x 30) instead of 100. This time in mulitprocesor mode I still get a segemetatin fault for the 34 model. The last lines of the output file before the segfault look like this:

Thread 0 happy after broadcast.
!!python: Number of ionizing photons 2.03478e+56 lum of ionizing photons 6.34341e+45
mc_estimator_normalise: bb stimulated correction factor is out of bound. Abort.
stimfac -nan, i 26, line[config[i].bbu_jump[j]].nconfigu 29
estimators: den_config (xplasma, i) 0  den_config (xplasma, line[config[i].bbu_jump[j]].nconfigu) 0 
mc_estimator_normalise: bb stimulated correction factor is out of bound. Abort.
stimfac -nan, i 27, line[config[i].bbu_jump[j]].nconfigu 30
estimators: den_config (xplasma, i) 0  den_config (xplasma, line[config[i].bbu_jump[j]].nconfigu) 0 
mc_estimator_normalise: bb stimulated correction factor is out of bound. Abort.
stimfac -nan, i 29, line[config[i].bbu_jump[j]].nconfigu 30
estimators: den_config (xplasma, i) 0  den_config (xplasma, line[config[i].bbu_jump[j]].nconfigu) 0 
init_freebound: Creating recombination coefficients
init_freebound: Creating recombination emissivites between 0.000000e+00 and 1.000000e+50
[science4:584842] *** Process received signal ***
[science4:584842] Signal: Segmentation fault (11)

@kslong
Copy link
Collaborator Author

kslong commented Sep 11, 2018

On my Mac, in single processor mode. short_34.pf fails with

error_count: This error will no longer be logged: macro_pops: found unreasonable populations in cell %i, so adopting dilute BBody excitation


Error summary: Something is drastically wrong for any error to occur so much!

Recurrences --  Description
        1 -- rdchoice: Deprecated use of rdchoice, please replace answer to %s with its string equivalent %s 
        1 -- Over 1 GIGABYTE of photon structure allocated. Could cause serious problems.
       59 -- Get_atomic_data: Simple line data ignored for Macro-ion: %d
        7 -- getatomic_data: line input incomplete: %s
      248 -- get_atomicdata: Could not interpret line %d in file %s: %s
       12 -- error_count: This error will no longer be logged: %s
     1969 -- Get_atomic_data: more than one electron yield record for inner_cross %i
       32 -- get_atomicdata: No inner electron yield data for inner cross section %i
        1 -- get_atomicdata: macro atom but no topbase xsection! ion %i z %i istate %i, yet marked as topbase xsection!
        1 -- get_wind_params: zdom[ndom].rmax 0 for wind type %d
        1 -- wind2d: Cell %3d (%2d,%2d) in domain %d has %d corners in wind, but zero volume
        1 -- check_grid: velocity changes by >1,000 km/s in %i cells
        1 -- check_grid: some cells have large changes. Consider modifying zlog_scale or grid dims
       12 -- trans_phot: %ld photons were lost due to DFUDGE (=%8.4e) pushing them outside of the wind after scatter
     4899 -- mc_estimator_normalise: bb stimulated correction factor is out of bound. Abort.
     4899 -- stimfac %g, i %d, line[config[i].bbu_jump[j]].nconfigu %d
      544 -- spectral_estimators: too few photons (1 or 0) in cell %d band %d to produce a model
        3 -- spectral_estimators: Alpha cannot be bracketed (%e %e)in band %i cell %i- setting w to zero
       11 -- photon_checks: nphot  origin  freq     freqmin    freqmax
      381 -- photon_checks: %6d %5d %10.4e %10.4e %10.4e freq out of range
      957 -- walls: distance %g<0. Position %g %g %g 
        2 -- spectral_estimators: multiple photons but only one frequency seen in %d band %d
   100001 -- sane_check: %d %e
    41030 -- macro_pops: level %i has frac. pop. %8.4e in cell %i
     2565 -- macro_pops: found unreasonable populations in cell %i, so adopting dilute BBody excitation
    51278 -- macro_pops: bbd rate is %8.4e in cell/matom %i
     7692 -- macro_pops: ion %i has frac. pop. %8.4e in cell %i

and short_345 fails with

error_count: This error will no longer be logged: macro_pops: found unreasonable populations in cell %i, so adopting dilute BBody excitation


Error summary: Something is drastically wrong for any error to occur so much!

Recurrences --  Description
        1 -- rdchoice: Deprecated use of rdchoice, please replace answer to %s with its string equivalent %s 
        1 -- Over 1 GIGABYTE of photon structure allocated. Could cause serious problems.
       77 -- Get_atomic_data: Simple line data ignored for Macro-ion: %d
        7 -- getatomic_data: line input incomplete: %s
      248 -- get_atomicdata: Could not interpret line %d in file %s: %s
       10 -- error_count: This error will no longer be logged: %s
     1969 -- Get_atomic_data: more than one electron yield record for inner_cross %i
       32 -- get_atomicdata: No inner electron yield data for inner cross section %i
        1 -- get_atomicdata: macro atom but no topbase xsection! ion %i z %i istate %i, yet marked as topbase xsection!
        1 -- get_wind_params: zdom[ndom].rmax 0 for wind type %d
        1 -- wind2d: Cell %3d (%2d,%2d) in domain %d has %d corners in wind, but zero volume
        1 -- check_grid: velocity changes by >1,000 km/s in %i cells
        1 -- check_grid: some cells have large changes. Consider modifying zlog_scale or grid dims
       10 -- trans_phot: %ld photons were lost due to DFUDGE (=%8.4e) pushing them outside of the wind after scatter
      364 -- spectral_estimators: too few photons (1 or 0) in cell %d band %d to produce a model
        9 -- photon_checks: nphot  origin  freq     freqmin    freqmax
      323 -- photon_checks: %6d %5d %10.4e %10.4e %10.4e freq out of range
      671 -- walls: distance %g<0. Position %g %g %g 
       13 -- mc_estimator_normalise: bb stimulated correction factor is out of bound. Abort.
       13 -- stimfac %g, i %d, line[config[i].bbu_jump[j]].nconfigu %d
        2 -- spectral_estimators: multiple photons but only one frequency seen in %d band %d
    36912 -- macro_pops: level %i has frac. pop. %8.4e in cell %i
     4761 -- macro_pops: ion %i has frac. pop. %8.4e in cell %i
     1192 -- macro_pops: found unreasonable populations in cell %i, so adopting dilute BBody excitation
       12 -- sobolev: den_ion has negative density %g %g %g %g %g %g
       12 -- sobolev: continuing by setting tau to zero
   100001 -- sane_check: %d %e
    58329 -- macro_pops: bbd rate is %8.4e in cell/matom %i

@kslong kslong mentioned this issue Sep 15, 2018
@kslong
Copy link
Collaborator Author

kslong commented Sep 15, 2018

This is an update on where we are with this problem. The short runs of the agn model appear to have been fixed in the sense that the program does not fall over by a changed that @jhmatthews made to macro atoms, which he will explain later. Errors associated with this macro atom problem are still be reported and James is investigating them.

At least two other problems have been seen:

For long runs photon_checks is seeing too many photons out of the frequency boundary. This is most likely due to the effects of Doppler shifts on the original distribution, as is discussed in #434. For now, ksl has modified dev so it does not fall over at this point.

A second problem in long runs, is that we are getting an error in some multiprocessor runs of
Error: Matom: jumped 1000000 times with no emission abort. Like the photon_checks, when this happens, the program aborts (because as @ssim something is terribly wrong). ksl notes that there may be a related problem #385. At any rate this may be very difficult to debug as it has only appeared (at least so far) in multiprocessor mode.

(Additionally, in multiprocessor mode, the program is often failing with either a segmentation fault in one of the channles, or a general mpi failure, suggesting some kind of problems with mpi)

@kslong
Copy link
Collaborator Author

kslong commented Sep 16, 2018

@jhmatthews @ssim So, on linux now,
h10_standard80.pf.txt
fails in multiprocessor mode. The error that occurs on one of the threads is
Error: bb_estimators_increment: trying to add negative contribution to jbar. Abort.
With 16 processors, the failure is in ionization cycle 7 as photons are going through the wind.

The only difference that I know of between this .pf file and the one that has historically been run is that we are not using ionization mode 9, not 7 (which Nick removed).

We clearly need a better way of diagnosing these types of problems, e.g to dump the windsave file and all of the information about the photon that where the problem occurred.

@kslong
Copy link
Collaborator Author

kslong commented Sep 16, 2018

@jhmatthews @ssim

I installed valgrind on the linux machine. It shows errrors, most of which I do not know how to interpret, even for the 1d_sn model.
screenshot_825

And many more for a short macro-atom rum. For the macro runs, thinks like the plasma cells are implicated.

@saultyevil
Copy link
Collaborator

Do you know if this fails in multiprocessor mode with just one process? It might help narrow down if something weird if happening when MPI is turned on.

@kslong
Copy link
Collaborator Author

kslong commented Sep 16, 2018 via email

@kslong
Copy link
Collaborator Author

kslong commented Sep 17, 2018

Here is a complete output from Valgrind, for the following .pf file

short_34.pf.txt

which ran to completion with the following errors:

screenshot_835

The valgrind result is

valgrind.txt

I'm still not sure how to interpret this, but if you look say at the first set of results, it looks like one is trying to change a value at line 762 of recomb.c to something that was allocated by calloc_dyn_plasma at line 648 of grid_wind.

Later on, it looks like one is trying to use an uninitialized value of somthing in macro_pops at line 493.

At the end, there is a leak summary.

 ==3518128== LEAK SUMMARY:
==3518128==    definitely lost: 242,968 bytes in 1,047 blocks
==3518128==    indirectly lost: 0 bytes in 0 blocks
==3518128==      possibly lost: 0 bytes in 0 blocks
==3518128==    still reachable: 125,958,908 bytes in 4,739 blocks
==3518128==         suppressed: 0 bytes in 0 blocks

According to some of the Valgrind documentation any problem that results in definitely lost memory should be fixed.

@jhmatthews
Copy link
Collaborator

So I'm struggling to keep up a bit here. As far as I can tell we have a number of different issues with likely different causes and it might be easier to debug if we document or investigate these one by one, possibly in separate threads. Here are some observations, separated out by issue.

  • The issue caused by a negative jbar for (model h10_standard80.pf.txt) can only be caused by either a non-positive tau_sobolev, a non-positive fabs(dvds) or a non-positive weight_of_packet, each of which is something to worry about, so we should figure out which.
  • The errors in mc_estimator_normalize (model vshort_34.pf.txt) actually occurs in CV runs too and I think they are caused by having zero populations in the upper levels of certain Carbon ions -- the densities of the upper and lower ions are both zero so the stim_fac quantity is nan and doesn't get used (as it has to be between 0 and 1 to get used). In which case, it isn't really an error at all and the logic of the code should be tweaked slightly. That is unless we are really supposed to have a minimum density in upper ions?

@kslong
Copy link
Collaborator Author

kslong commented Sep 17, 2018

@jhmatthews Basically, I agree here, but I think we need to figure out a path forward where we can work on the various issues at once, since they all contribute the problem of getting the C34 and C345 modles to work.

There are now two types of problems, or at least there are two basic ways the problems are manifested:

  • Rare problems associated with the macro atom machinery
  • Problems associated with multiprocessing.

It is clear that the rare problems are causing problems with muliti-processing. This is because we have adopted a more "rigorous" approach when the macro-atom problems are discovered than we have elsewhere in the program, that is we in most cases exit the program. Our philosophy elsewhere has been to try to get the program to "soldier" on. (Note that my comment here is not exactly fair; there are other places where we would exit if we got an error, but generally when we have encountered these in the past, we either fixed the error or figured out a way to get the program to continue.)

An important point here is that the macro atom issues we are encountering here are rare. Unless there is a deeper underlying problem, which there could be, they are just like any of the other errors we have bypassed in the past. Therefore, unless just looking at the code will permit a solution, I think we need a way to issue an error message and have the program move on. We should print out as much as we can about the situation at the time of the error, but we should "soldier" on. At the point that we can get Python not to crash, I agree with James that a separate git issue is warranted, since as he points out these errors may have different causes and solutions and this thread will be impossible to dissect at a a later date..

@kslong
Copy link
Collaborator Author

kslong commented Sep 21, 2018

So after making a small mod so that negative jbar would not cause the program to crash, I ran short_345 and short_34. When compiled without mpicc, short 34 had a segmentation fault in single processor mode; when compiled with gcc it seems to be running past where the segmentation fault occured. short_345 running on mpicc it has ultimately failed with the following end of the log:

screenshot_839

@kslong
Copy link
Collaborator Author

kslong commented Sep 22, 2018

I have been checking what happens with short_34 and short_345 with the new scheme for handling simple atoms in macro atom mode turned off. short_345 seems to run fine, with few errors. short_34 still has the odd problem where it produces a segfault when compiled with mpicc and run with a single processor, but not when compiled and run with gcc.

@kslong
Copy link
Collaborator Author

kslong commented Sep 24, 2018

After the changes @jhmatthews has made for #436 short_345 appears to run using mpi, but short_34 is still having segmentation faults. James fixed one of these, but we are still seeing these at another point in the program, which I have finally located as occurring in init_freebound. I doubt this has anything to do with the jbar problem. I am suspicious @ssim that it has something to do with the atomic data file, something that get_atomic_data is not catching.

The strange thing about this segmentation fault is that it does not always appear, depending on how exactly the code is compiled. In particular, it always occurs with a standard complilation but it does not happen in single processor mode, and it does not seem to happen compiled in multiprocessor mode with debugging turned on (which changes some of the compilation defaults).

@kslong
Copy link
Collaborator Author

kslong commented Sep 25, 2018

Progress, I think, @jhmatthews @ssim I believe I understand now why segfault occurred sometimes with short_34.pf and sometimes not. It all depends on whether the -03 switch is included. Our standard has been to not include this when the D option is used. When I add the -03 switch to the debug options in Make, then I get the segfault, on this file:
vshort_34.pf.txt

The segfault on linux occurs in single and multiprocessor mode, when complied with
make -D python (so this is mpicc and -03).

Given that it occurs in multiprocessor mode, I have been able to hone in on the error. It was not quite where I thought. Instead it was in these lines

screenshot_842
specically line 376 in macro_gov

The actual error is coming from line 373, which is returning a very large negative value.

This would be expected to cause a segfault.

Note that with -O3, gdb does not work as well as without it as some variables have been "optimized out". Consequently, I cannot look at a lot of the values in the routine. So one is going to need to be clever to actually figure out what the problem is, and it may be that just adding Log statements will cause the program to work again. I remain of the opinion that there something not quite write about the C 34 data file, or something that get_atomic_data is not catching.

@kslong
Copy link
Collaborator Author

kslong commented Sep 25, 2018

@ssim @jhmatthews I am betting you all are in a position to figure this out now. I added some print statements to the loop where the seg fault was:

screenshot_844

and here is what got printed out:

screenshot_843

@jhmatthews
Copy link
Collaborator

I'm not sure if this is a lead or not, but I had a quick look at the data files. The CIIICIVCV elem ions file has:

#IonV    C   6   1   1   11.26000 1000   5     1s^22s^22p^2(3P_0) 
#IonV    C   6   2   2   24.38400 1000   5     1s^22s^22p(2P^o_{1/2}) 
IonM    C   6   3   1   47.88800 1000   20     1s^22s^2(1S_0) 
IonM    C   6   4   2   64.49400 1000   20     1s^22s(2S_{1/2}) 
IonM    C   6   5   1  392.09000 1000   20     1s^2(1S_0) 
IonM    C   6   6   2  489.99700 1000   1     1s(2S_{1/2}) 
#IonV    C   6   7   2 1.0000e+20   0   0     Bare 

while CIIICIV has:

#IonV    C   6   1   1   11.26000 1000   5     1s^22s^22p^2(3P_0) 
#IonV    C   6   2   2   24.38400 1000   5     1s^22s^22p(2P^o_{1/2}) 
IonM    C   6   3   1   47.88800 1000   10     1s^22s^2(1S_0) 
IonM    C   6   4   2   64.49400 1000   5    1s^22s(2S_{1/2}) 
IonM    C   6   5   1  392.09000 1000   1     1s^2(1S_0) 
#IonV    C   6   6   2  489.99700 1000   0     1s(2S_{1/2}) 
#IonV    C   6   7   2 1.0000e+20   0   0     Bare 

i.e. nlte is set to 5 for CIV which is the ion that causes the issue. I'm wondering if this is causing a problem? I don't know if this info is actually used though. Knox, can you change the numbers and try again? I can't seem to reproduce the segfault.

@ssim
Copy link
Collaborator

ssim commented Sep 26, 2018 via email

@kslong
Copy link
Collaborator Author

kslong commented Sep 26, 2018 via email

kslong added a commit that referenced this issue Sep 26, 2018
This change causes get_atomic_data to ignore macro levels that are in
excess of the number specified in the corresponding ion line.
@kslong
Copy link
Collaborator Author

kslong commented Sep 26, 2018

As noted by @jhmatthews, the problem with the data file was that there were more macro-levels than was specified in the ion line. The code did not check for this.

After a conversation with @ssim I have made changes to get_atomic_data.c so that macro-levels in excess of the number given in the corresponding ion result in an error, but are ignored. This follows our general philosophy of ignoring extra lines in the atomic data file. In this case, it means one can easily reduce the complexity of a line by reducing the number of levels (at least under the assumption) that the first n levels are the most important. So for example, the change from h20 to h10 could be carried out by changing a single entry in the ions data.

@kslong
Copy link
Collaborator Author

kslong commented Sep 27, 2018

So, I tried running the full simulations which Keara began with on 8 processors each on linux, and fiducial_agn_34 ran to completion, but fiducial_agn_345 failed after about 8 ionization cycles on one of the threads, which caused the program to crash. The diagnostic messages presented before the crash were:

screenshot_845

Here is what that portion of the code says:

screenshot_847

Evidently d2, the "density" of the upper level here, is a lot bigger than d1, and since the g's of the upper and lower level are both 3, this is going to cause xden_ion to be negative .. which ulitmately gets to the exit statement.

@ssim
Copy link
Collaborator

ssim commented Oct 4, 2018

I'm not 100% sure if this is a robust fix but Knox, can you try this with the code alteration in the PR I just made: #438

If this helps we could accept the PR as a step forward. If it doesn't help then we'll close it

@kslong
Copy link
Collaborator Author

kslong commented Oct 4, 2018 via email

@kslong
Copy link
Collaborator Author

kslong commented Oct 5, 2018

Unfortunately @ssim @jhmatthews this did not solve the problem. Although fidicial_agn_c34 is still running, fiducial_agn_345 failed in one of the 8 threads I ran at the end of the cycle 19 of 30. The messages at the end of the diag file for that thread are shown below:

screenshot_857

Note that I have not yet merged the changes Stuart made in macro_pots into the dev branch, I just ran his updated version. So if that needs to be incorporated, then one of us needs to merge it.

@ssim
Copy link
Collaborator

ssim commented Oct 5, 2018 via email

@jhmatthews
Copy link
Collaborator

I also don't understand this. I'll try and work through the code a bit before the telecon.

@ssim
Copy link
Collaborator

ssim commented Oct 7, 2018

@jhmatthews Could this be that we need a fabs() around somewhere? I.e. the "levden_temp" variable we use for sanity checking before writing the populations ...can this get messed up if both "this_ion_density" and "populations[]" are both negative. Maybe we should do the "sane check" directly on "populations" without the division into a fractional population?

@jhmatthews
Copy link
Collaborator

Hi Stuart. Finally had a look at this. I think you are right in that the sane check should really be done on the absolute populations rather than fractional ones, but I'm not sure this can explain the error since we do also check for negative "this_ion_density", so that situation would still lead to "insane", and we know the error still happens even with your fix in #438 .

Let's discuss at Friday's telecon, but I'm basically totally stuck on how to proceed here. One thing is that before I come to Belfast I think we need to be able reproduce this problem without running on 30 cores overnight, say by restarting from a wind save file.

@jhmatthews
Copy link
Collaborator

My above comment is incorrect, because the ordering of the checks wasn't quite what I had thoughts. Instead we should check all populations and densities for sanity before we copy anything over, and we should do the populations checks on absolute rather than fractional pops.

I've pushed what I think is the correct change here. It would be good if @ssim could review my change and @kslong could run the model that produces the error with the latest commit from #438

@jhmatthews
Copy link
Collaborator

With the latest fix in pull #438 we are still getting errors and perhaps different ones. Now we are getting

bf_estimator_increment: A packet survived an optical depth of 65.9711
bf_estimator_increment: freq_av 5.66755e+15, ft 3.28805e+15

followed slightly later by

Error: Matom: jumped 1000000 times with no emission. Abort.

Having looked at the first error in the source code, the quantity that is equal to 65.9711 is the optical depth which is calculated from this formula for filling factor of one: kap_bf[nn] * weight_of_packet, but there are also some multiplications by variables and divisions by the same variable (because we record a different quantity in the estimators). It's not clear to me if this is definitely wrong, but it's suspicious.

The second error is obviously more serious. How could lots of jumps occur?

  • if two levels are close to detailed balance?
  • if the energy difference between the levels is vanishingly small w.r.t the absolute energy of the transitions - p_emit is proportional to energy gap but p_jump proportional to gap. Even so, you would think you would jump to another level that had a reasonable emission probability
  • possibly a combination of the above with some incorrect normalisation to artificially inflate the jumping probability for one specific transition.
  • a flat-out bug or book keeping error
  • ???

Would be good to know more about the transitions/ions in question.

Mayyyybe worth running the models with the latest fix but with BF_SIMPLE_EMISSIVITY_APPROACH set to 0 in python.h, to ensure that none of the recent bound free changes are messing anything up here - However, this is unlikely, since we are in macro-atom rather than simple machinery.

Confusing.

@jhmatthews
Copy link
Collaborator

Having delved into this further I've verified that nans are still being copied into the levden array on occasion in macro_pops, which implies that my proposed catch is still not working and there must be a basic error in my coding.

@jhmatthews
Copy link
Collaborator

Ok, so I think this is happening because the rate to the C5 ion is zero, which means the total ion density is zero. Because we store things as a fractional population, we do 0.0/0.0 which is nan.

I will fix this by catching the case that total_ion_density is zero. Note that we normally have a DENSITY_MIN for other non-macro ions, which I can use if we prefer. It's not clear if this will fix Knox's problem on more threads but it explains what I am seeing.

@jhmatthews
Copy link
Collaborator

The above comment I had got fixed in pull #438 - but there are some comments on that pull request about this error still persisting:

Error: macro atom level has no way out 11 0 0 

this is happening for a fairly convoluted sequence of reasons as follows:

  • in macro_pops, I had a density floor for ions but not for levels. As a result, the levden array can contain 0.
  • when you calculated the stimulated recomb rate there's a factor of n_u/n_l - if n_l= 0 then you get a nan for the jumping probability. This happens in a bound-free upwards jump.
  • As a result, the normalisation of the jumping probabilities is also nan. So when you try and figure out where the packet jumps to, you are asking if a finite number is less than a nan which is never true, so the "n" counter in matom() never gets incremented and you jump to the first level even if that level had nothing to do with the levels that caused the issue.

The actual jump that causes the nan is CIV ground state to CV ground state. We should check it is reasonable that there is nothing in CIV or CV (I have checked that the gamma for that transition is zero).

To fix this I have added a density floor in levden (agnwinds/python@0a05836) which ensures that you don't have 0s in denominators of density ratios. This runs much better (currently on cycle 16 of fiducial_agn_c34.pf).

The code is still spitting out a bunch of errors in estimators.c due to stimulated correction factors being out of bounds. It's not surprising you get the error, since if all fractional populations in an ion are set to 1e-20 then the code thinks that is a population inversion - in fact it also multiplies 1e-20 in levden by 1e-20 in density[nion].

So I think we just have a couple of decisions to make with how we flag ions where this is happening and how we ensure that errors are suppressed where they should be, but that we still catch genuine issues. It's not clear to me exactly what the best decision is.

@kslong
Copy link
Collaborator Author

kslong commented Dec 2, 2018

We are now able to run Keara's agn problems, having solved #196, and the results are plausible. Although we clearly want expect to to do more testing of these input files, most notably with for CVs, I am going to declare victory here and close this issue.

Changes associated with this issue have been merged into dev.

@kslong kslong closed this as completed Dec 2, 2018
@kslong kslong reopened this Mar 19, 2019
@kslong
Copy link
Collaborator Author

kslong commented Mar 19, 2019

@jhmatthews @ssim Reviewing the what is happening with my attempt to run a cv model with C345, it is clear that although the cause may not be the same, the error that is being produced is the same as we were seeing before. So rather than open a new issue about this, I've reopened the old issue.

ixvel_c345.pf.txt

This takes overnight to run on 16 processors. It also seems to fail during the spectral cycles, but that is not associated with the 'no way out' error.

I have also run the same model with c34 only, and this does not fail.

@kslong
Copy link
Collaborator Author

kslong commented Mar 19, 2019

I've done a bit more looking at where the 'no way out' was occuring. During the ionization cycles, it occurs randomly once or twice an ionization cycle at most.

I also said above that the c345 model fails in the spectral cycles. This is due to

Error: cdf_gen_from_array: all y's were zero or xmin xmax out of range of array x-- returning uniform distribution 0

which appears to be happening while

Calculating macro-atom and k-packet sensitivities- this might take a while...


Finally, and this may or may not be helpful, I tried to restart the model and it failed almost immediately, with the same type of errors, but now there are more of the 'no way out' errors as well.

@jhmatthews
Copy link
Collaborator

I'm not sure exactly where best to document this, but here is an issue I have uncovered which is related to #451 #486 . The issue is as follows. When one reads in VFKY cross-sections in the C34 or C345 models, the VFKY data can incorrectly be used for the top ion.

here's the offending piece of code around line 1400 of get_atomicata.c, activated by the PhotVfkyS keyword:

          /* Check that there is an ion which has the same ionization state as this record
             otherwise it must be a VFKY style record and so read with that format */

          else if (strncmp (word, "PhotVfkyS", 8) == 0)
          {
            // It's a VFKY style photoionization record, beginning with the summary record
            sscanf (aline, "%*s %d %d %d %d %le %d\n", &z, &istate, &islp, &ilv, &exx, &np);


            for (n = 0; n < np; n++)
            {
              //Read the topbase photoionization records
              if (fgets (aline, LINELENGTH, fptr) == NULL)
              {
                Error ("Get_atomic_data: Problem reading Vfky photoionization record\n");
                Error ("Get_atomic_data: %s\n", aline);
                exit (0);
              }
              sscanf (aline, "%*s %le %le", &xe[n], &xx[n]);
              lineno++;

            }

            for (nion = 0; nion < nions; nion++)
            {
              if (ion[nion].z == z && ion[nion].istate == istate)
              {
                if (ion[nion].phot_info == -1)
                {
                 [COPY TO PHOT_TOP STRUCTURE] 

Let's take the C345 model as an example. In that case, the C6 ion is the top one and should act like the "continuum" - i.e. there's no way you can go above C6. But, if you still provide a C6 xsection in the data then it gets to this if statement and has no way of knowing not to read in the xsection.

One can change the if statement to this: if (ion[nion].phot_info == -1 && ion[nion].macro_info != 1). If I do that, the following error goes away, but I get this error: calc_pi_rate: No photoionization xsection for ion 8 (element 6, ion state 6), which causes the code to exit.

@jhmatthews
Copy link
Collaborator

Note that this initial issue could have feasibly caused bad errors because the macro_info was different in the ion and phot structures, and I could foresee weird numbers being stored in the wrong places. So we should rerun a c345 model once we fix this.

The calc_pi_rate error is caused by trying to calculate the rate matrix for Carbon even though it's a macro-atom. It never actually stores these populations but it does calculate them. It still wants to calculate a photoionization rate for this ion because it only stops calculating them when Z = istate - 1 (see if (ion[mm].istate != ion[mm].z + 1) in l. 128 matrix_ion.c.

How to fix this... I guess we should probably mask macro ions out of the rate matrix? We need to ensure the code is smart enough not to try and calculate CVII in that instance then. We should probably also change the if statement so it checks against the highest ion, rather than the actual charge to be consistent with the elem_ions files? @Higginbottom @ssim @kslong

@kslong
Copy link
Collaborator Author

kslong commented Apr 1, 2019

Hard for me to believe we can actually mask macro atoms out of the rate matrix calculation, especially H and He. How then do we calculate things.

I do think some of what is going on points to a need to do some extreme vetting of the atomic data, perhaps with some kind of ancillary code.

@jhmatthews
Copy link
Collaborator

Possibly, although of course the original error was actually being caught by my test routines.

The rate matrix calculation is too fold. We have a rate matrix for the ionization state (if you use matrix ionization modes) for simple ions, and we have a rate matrix for all macro-ions which is for the level populations including the ionization fractions. The two systems of equations are both solved inside the same loop but the data copied to the macro-ions levels is from the second calculation (macro_pops).

@jhmatthews jhmatthews added the macro-atom Issues to do with macro-atoms label Apr 3, 2019
@kslong kslong mentioned this issue Apr 6, 2019
kslong added a commit that referenced this issue Apr 6, 2019
Corrections to enable C345 macro atom models to avoid various errors.   See #486 #385 #423 #425 

* changes to set a max ion stage if it is not Z+1

* fixed if statement in matrix_ion

* Changed version to 83a
@jhmatthews
Copy link
Collaborator

These seem to be working well now. There's a caveat that we should add collision strengths for Carbon, but this issue ca be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
macro-atom Issues to do with macro-atoms
Projects
None yet
Development

No branches or pull requests

4 participants