Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"index 0 is out of bounds" ERROR in newref "train_gender_model" #24

Closed
rgiannico opened this issue Nov 29, 2018 · 18 comments
Closed

"index 0 is out of bounds" ERROR in newref "train_gender_model" #24

rgiannico opened this issue Nov 29, 2018 · 18 comments

Comments

@rgiannico
Copy link

rgiannico commented Nov 29, 2018

Hi leraman,
I'm getting this weird error while using 'newref' on my 88 training samples:
$ WisecondorX newref *.npz myref.npz --nipt --binsize 50000 --cpus 12

[INFO - 2018-11-29 14:50:00]: Creating new reference
[INFO - 2018-11-29 14:50:00]: Importing data ...
[INFO - 2018-11-29 14:50:00]: Loading: Sample_01.npz
[INFO - 2018-11-29 14:50:00]: Binsize: 5000
[...]
[INFO - 2018-11-29 14:50:18]: Loading: Sample_88.npz
[INFO - 2018-11-29 14:50:18]: Binsize: 5000
Traceback (most recent call last):
  File "/storage/conda/anaconda2/envs/wisecondorx_v1.0.1/bin/WisecondorX", line 11, in <module>
    load_entry_point('WisecondorX==1.0.1', 'console_scripts', 'WisecondorX')()
  File "/storage/conda/anaconda2/envs/wisecondorx_v1.0.1/lib/python2.7/site-packages/wisecondorX/main.py", line 361, in main
    args.func(args)
  File "/storage/conda/anaconda2/envs/wisecondorx_v1.0.1/lib/python2.7/site-packages/wisecondorX/main.py", line 55, in tool_newref
    genders, trained_cutoff = train_gender_model(samples)
  File "/storage/conda/anaconda2/envs/wisecondorx_v1.0.1/lib/python2.7/site-packages/wisecondorX/newref_tools.py", line 48, in train_gender_model
    cut_off = gmm_x[local_min_i][0]
IndexError: index 0 is out of bounds for axis 0 with size 0

For debugging purposes I also added this line logging.info('function train_gender_model sorted_gmm_y: {} local_min_i: {} gmm_x: {}' .format(sorted_gmm_y, local_min_i, gmm_x)) to the newref.tools.py and it also printed out this:

[INFO - 2018-11-29 14:50:20]: function train_gender_model sorted_gmm_y: [1.83156350e-16 2.15748202e-16 2.54060382e-16 ... 0.00000000e+00
 0.00000000e+00 0.00000000e+00] local_min_i: (array([], dtype=int64),) gmm_x: [0.00000000e+00 4.00080016e-06 8.00160032e-06 ... 1.99919984e-02
 1.99959992e-02 2.00000000e-02]

Do you have any idea on what is going on?
If you need more debugging prints just tell me :)

@rgiannico rgiannico changed the title train_gender_model index 0 is out of bounds error "index 0 is out of bounds" ERROR in newref "train_gender_model" Nov 29, 2018
@leraman
Copy link
Collaborator

leraman commented Nov 29, 2018

Hi @rgiannico,

This function fits a Gaussian mixture model with two components to the Y-read-fraction, which is used to separate male from female feti.

I'm not quite sure what's going on yet, but we could try two things: you could share your .npz files, and I'll push an update to make WisecondorX robust against it, or if that's not a possibility, would you mind uncommenting the 'plotting code', and re-run the software locally? It should yield an image like this: (could you share this image?)

nipt_gmm

@rgiannico
Copy link
Author

rgiannico commented Nov 30, 2018

Thank you @leraman ,
I actually had to add 3 lines to the plotting script to fix a couple or errors (I report here with comments if you are interested):

    import matplotlib           # I added this an the following line
    matplotlib.use('Agg')       # I need this to fix a matplotlib error discussed here: https://stackoverflow.com/questions/37604289/tkinter-tclerror-no-display-name-and-no-display-environment-variable
    import matplotlib.pyplot as plt
    fig, ax = plt.subplots(figsize=(10, 6))
    ax.hist(y_fractions, bins=50, normed=True)
    ax.plot(gmm_x, gmm_y, 'r-', label='Gaussian mixture fit')
    ax.set_xlim([0.001, 0.01])
    ax.legend(loc='best')
    plt.savefig('gender_model_gaussian.png')    # I need it because I'm in a server without X11
    plt.show()

An this is my gaussian:
image

I suppose it could it be related to the fact I only have 12 male fetuses over the total 88 NIPT samples.
I can ask permission to send you my 88 npz files, but ... just an idea... don't you think a better solution could be to let the user define with a metadata file the fetal sex for each training sample instead of guessing?
Do you think this is possible or you need my npz files?

Thank you :)

@leraman
Copy link
Collaborator

leraman commented Nov 30, 2018

Hi @rgiannico

Indeed, you probably have to add more male cases. The problem is, WisecondorX looks for a local minimum in the bimodel, and there is none. I didn't release this was possible. I'll look into it. In the meantime, maybe these parameters will work, at line 28:
gmm = GaussianMixture(n_components=2, covariance_type='full', reg_covar=1e-99, max_iter=10000, tol=1e-99)

Anyway, as the manual states, it's always a good idea to try to include more or less the same amount of males as females. Nevertheless, manual gender assignment during reference creation could indeed be a solution, I'll think about it.

@rgiannico
Copy link
Author

rgiannico commented Nov 30, 2018

Ok thank you @leraman ,
Great! It's working now, there is a local minimum and the reference has been produced!
image
Two more questions to better understand how much "important" is this training samples gender bias:

  1. Do you think I can use this "max_iter=10000, tol=1e-99" reference for prediction step without more code modifications or it could lead to wrong predictions on test samples (I suppose mostly on fetal sex prediction)?
  2. I also had planned some more female fetus training samples, they are already extracted and ready for sequencing, do you suggest to procede sequencing and just use those new parameters for reference creation to fix the bias? Or you strongly and absolutely suggest NOT to add more female fetus to the training pool to avoid feeding the gender bias?

Thank you

@leraman
Copy link
Collaborator

leraman commented Nov 30, 2018

  1. You can, but don't forget the reg_covar=1e-99. This won't lead to any 'wrong' predictions. The gender prediction is only used for the reference creation (for NIPT anyway): for the autosomal reference, all samples are used, however, only females are used for the gonosomal reference. For you, this only implies that fewer female samples will be used for the gonosomal reference than what's actually present in your set.
  2. Well, the more reference samples the better I guess, yet, I would opt for male feti: we noted that normalization performance generally (and slightly) increased when using both female and male feti compared to using e.g. only female or only male feti.

Good luck!

@chantisakee
Copy link

Hi leraman,
I'm getting this weird error while using 'newref' on my 88 training samples:
$ WisecondorX newref *.npz myref.npz --nipt --binsize 50000 --cpus 12

[INFO - 2018-11-29 14:50:00]: Creating new reference
[INFO - 2018-11-29 14:50:00]: Importing data ...
[INFO - 2018-11-29 14:50:00]: Loading: Sample_01.npz
[INFO - 2018-11-29 14:50:00]: Binsize: 5000
[...]
[INFO - 2018-11-29 14:50:18]: Loading: Sample_88.npz
[INFO - 2018-11-29 14:50:18]: Binsize: 5000
Traceback (most recent call last):
  File "/storage/conda/anaconda2/envs/wisecondorx_v1.0.1/bin/WisecondorX", line 11, in <module>
    load_entry_point('WisecondorX==1.0.1', 'console_scripts', 'WisecondorX')()
  File "/storage/conda/anaconda2/envs/wisecondorx_v1.0.1/lib/python2.7/site-packages/wisecondorX/main.py", line 361, in main
    args.func(args)
  File "/storage/conda/anaconda2/envs/wisecondorx_v1.0.1/lib/python2.7/site-packages/wisecondorX/main.py", line 55, in tool_newref
    genders, trained_cutoff = train_gender_model(samples)
  File "/storage/conda/anaconda2/envs/wisecondorx_v1.0.1/lib/python2.7/site-packages/wisecondorX/newref_tools.py", line 48, in train_gender_model
    cut_off = gmm_x[local_min_i][0]
IndexError: index 0 is out of bounds for axis 0 with size 0

For debugging purposes I also added this line logging.info('function train_gender_model sorted_gmm_y: {} local_min_i: {} gmm_x: {}' .format(sorted_gmm_y, local_min_i, gmm_x)) to the newref.tools.py and it also printed out this:

[INFO - 2018-11-29 14:50:20]: function train_gender_model sorted_gmm_y: [1.83156350e-16 2.15748202e-16 2.54060382e-16 ... 0.00000000e+00
 0.00000000e+00 0.00000000e+00] local_min_i: (array([], dtype=int64),) gmm_x: [0.00000000e+00 4.00080016e-06 8.00160032e-06 ... 1.99919984e-02
 1.99959992e-02 2.00000000e-02]

Do you have any idea on what is going on?
If you need more debugging prints just tell me :)

Hi, I do have a same error as you
could you please tell me what would you finally do to fix the code?

Thanks in advance :)

@leraman
Copy link
Collaborator

leraman commented Sep 11, 2019

Hi @chantisakee

Which version are you using? How much samples are in your reference? Did you include both male and female feti?

@rgiannico
Copy link
Author

Hi @chantisakee ,
At the current version you should not get this error because leraman added the gmm = GaussianMixture(n_components=2, covariance_type='full', reg_covar=1e-99, max_iter=10000, tol=1e-99) line of code to the wisecondorX/newref_tools.py script to be more 'stringent' to discern between males and females feti.

I had this error because my training samples had unbalaced fetal sex (too many female feti compared to the male feti or vice-versa).
I suggest you to use the latest Wisecondorx version and make sure you have a more balanced fetal sex distribution for your training samples.

( p.s: nice dog though :P ^^ )

@chantisakee
Copy link

chantisakee commented Sep 11, 2019

Hi @chantisakee

Which version are you using? How much samples are in your reference? Did you include both male and female feti?

  1. i'm quit sure that using the latest version and i already found that the newref.py was modified as you were describe. But i still got the same error as this issue.
  2. unfortunately, i have only 10 healthy samples for reference set creation. is that enough for using wisecondorx?
  3. my purpose is finding Copy Number Variation not NIPT so my input samples are human WGS not from maternal cf-DNA (i mean pregnant woman) and i'm quite not sure about gender of my input data.

@leraman
Copy link
Collaborator

leraman commented Sep 11, 2019

Hi @chantisakee

I believe 10 samples might be too small for the gaussian mixture model to work reliably. I'll implement a workaround so you can make a reference anyway.

@chantisakee
Copy link

Hi @chantisakee ,
At the current version you should not get this error because leraman added the gmm = GaussianMixture(n_components=2, covariance_type='full', reg_covar=1e-99, max_iter=10000, tol=1e-99) line of code to the wisecondorX/newref_tools.py script to be more 'stringent' to discern between males and females feti.

I had this error because my training samples had unbalaced fetal sex (too many female feti compared to the male feti or vice-versa).
I suggest you to use the latest Wisecondorx version and make sure you have a more balanced fetal sex distribution for your training samples.

( p.s: nice dog though :P ^^ )

Yeah, i found that the version that i already downloaded were modified ;w; but i still got an error. i'm quite not sure that is it from my input data or not. Unfortunately, I have only 10 healthy samples for reference creation and the gender of my input is also missing. My goal using this software is finding CNV from human WGS.
Btw, thanks for your answering ^-^ and the compliment for my dog lol :)

@chantisakee
Copy link

Hi @chantisakee

I believe 10 samples might be too small for the gaussian mixture model to work reliably. I'll implement a workaround so you can make a reference anyway.

Thank you very much for your help :)

@leraman leraman mentioned this issue Sep 11, 2019
@leraman
Copy link
Collaborator

leraman commented Sep 11, 2019

Hi @chantisakee

I've updated WisecondorX. You can download the latest version using

pip install -U git+https://github.com/CenterForMedicalGeneticsGhent/WisecondorX

During reference creation, you can now manually set the chromosome Y fraction cutoff using --yfrac, which overrules Gaussian mixture modeling. I'm guessing (not sure) you only have female samples, so I would try --yfrac 1.

@chantisakee
Copy link

Hi @chantisakee

I've updated WisecondorX. You can download the latest version using

pip install -U git+https://github.com/CenterForMedicalGeneticsGhent/WisecondorX

During reference creation, you can now manually set the chromosome Y fraction cutoff using --yfrac, which overrules Gaussian mixture modeling. I'm guessing (not sure) you only have female samples, so I would try --yfrac 1.

Hi @leraman,
thanks so much for your help :}. Now reference creation step works fine.
but there are still some problems in prediction step and i got this error message..

[INFO - 2019-09-11 21:48:50]: Starting CNA prediction
[INFO - 2019-09-11 21:48:50]: Importing data ...
[INFO - 2019-09-11 21:48:51]: Normalizing autosomes ...
[INFO - 2019-09-11 21:50:05]: Normalizing gonosomes ...
[WARNING - 2019-09-11 21:51:16]: Non-numeric values found in weights -- reference too small. Circular binary segmentation and z-scoring will be unweighted
[INFO - 2019-09-11 21:51:16]: Executing circular binary segmentation ...
Error in parse_con(txt, bigint_as_char) :
lexical error: malformed number, a digit is required after the minus sign.
gender": "F", "results_r": [[-Infinity, -Infinity, -Infinity
(right here) ------^
Calls: read_json ... parse_json -> parse_and_simplify -> parseJSON -> parse_con
Execution halted
[CRITICAL - 2019-09-11 21:51:17]: Rscript failed: Command '['Rscript', '/tarafs/biobank/data/modules/.local/easybuild/software/Miniconda3/4.4.10/envs/wisecondorX/lib/python2.7/site-packages/wisecondorX/include/CBS.R', '--infile', '/tarafs/biobank/bio0001-human/NIPT/WISECONDORX/Thalassemia/DownSampling/script/HS06006_CBS_tmp_01.json']' returned non-zero exit status 1

what should do?

Thanks,
Chantisa

Ps. Do i have to create a new topic?

@leraman
Copy link
Collaborator

leraman commented Sep 12, 2019

Can you take a look at your .npz files? Are you sure they are not empty? Which reference genome did you use during mapping?

@chantisakee
Copy link

Can you take a look at your .npz files? Are you sure they are not empty? Which reference genome did you use during mapping?

Hi leraman, sorry for late answering.
yes, you are right! i prepared test samples incorrectly and now it works fine :).

Anyway, I just wondering that what is the minimum bin size for copy number variation prediction by wisecondorX?? can i down to 2000 bp?

@leraman
Copy link
Collaborator

leraman commented Sep 23, 2019

It depends on your sequencing depth. WisecondorX is developped for 15 kb and up, but if your coverage is >1x you might get good results for 2000 bp. Running time will increase though.

@chantisakee
Copy link

Thanks for your suggestion @leraman

so i've tried setting the bin size via reference set creation process as 2000 bp. After that I did CNA prediction process and it turned out that

[INFO - 2019-09-23 10:46:59]: Starting CNA prediction
[INFO - 2019-09-23 10:46:59]: Importing data ...
Traceback (most recent call last):
File "/tarafs/biobank/data/modules/.local/easybuild/software/Miniconda3/4.4.10/envs/wisecondorX/bin/WisecondorX", line 12, in
sys.exit(main())
File "/tarafs/biobank/data/modules/.local/easybuild/software/Miniconda3/4.4.10/envs/wisecondorX/lib/python2.7/site-packages/wisecondorX/main.py", line 400, in main
args.func(args)
File "/tarafs/biobank/data/modules/.local/easybuild/software/Miniconda3/4.4.10/envs/wisecondorX/lib/python2.7/site-packages/wisecondorX/main.py", line 155, in tool_test
if not ref_file['is_nipt']:
File "/tarafs/biobank/data/modules/.local/easybuild/software/Miniconda3/4.4.10/envs/wisecondorX/lib/python2.7/site-packages/numpy/lib/npyio.py", line 262, in getitem
raise KeyError("%s is not a file in the archive" % key)
KeyError: 'is_nipt is not a file in the archive'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants