Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requesting information on how to create sample and panel files. #9

Open
vlakhujani opened this issue Mar 16, 2020 · 3 comments
Open

Comments

@vlakhujani
Copy link

Where can I find more information on how to create the panel and the sample fies?

I went through the paper and it says

  • (1) Genome Analysis Toolkit (GATK) DoC interval summary files
  • (2) a panel design containing target exons, and
  • (3) a sample file with gender and/or midpool groupings.

The first file is not mentioned in the github ReadMe

I am really confused. Please help.

@theodorc
Copy link
Owner

(1) we use version 3 of the GATK software from Broad Institute to compute the Depth of Coverage on a given bam file.

@vlakhujani
Copy link
Author

vlakhujani commented Mar 20, 2020

@theodorc

I am looking at the usage doc. Where is the GATK v3 file used as input ?

Additionally, how do I create panel file?

Exon_Target          Gene_Exon      Call_CNV  RefSeq
1:1220087-1220186    SNP_1          N         rs2144440
1:3083663-3083762    SNP_2          N         rs2651899
1:3611843-3611942    SNP_3          N         rs3765731
1:6279321-6279420    RNF207-001_18  N         rs846111
1:8487274-8487373    SNP_4          N         rs301797
1:11850737-11850955  MTHFR-001_11   Y         NM_005957_cds_0
1:11851264-11851383  MTHFR-001_10   Y         NM_005957_cds_1
1:11852335-11852436  MTHFR-001_9    Y         NM_005957_cds_2
1:11853964-11854146  MTHFR-001_8    Y         NM_005957_cds_3

The Gene_Exon column contains what ? SNP Ids or gene / exon ids? Also, the "RefSeq" column contains dbsnp rs ids ? is that correct ? I also see NM ids (transcript ids)?

And finally, Call_CNVs column contains yes/no values - how to make that decision?

@theodorc
Copy link
Owner

theodorc commented Apr 1, 2020

Sorry for the late response. Hope the comments below helps.

  1. For GATK, see the config file. In there is variable to specify the directory (and file name format) where you have the GATK Depth of Coverage file: GATKDIR=GATK_DoC/[SAMPLE_FCLBC].DATA.sample_interval_summary

  2. Panel file is created by yourself in your favorite editor. It is usually based on the capture designed you used for the sequencing. For example, a cancer panel will contain genes for cancer and their exon target coordinates etc...

  3. The Gene_Exon column is the name of the target exon used. In the example, I used gene MTHFR and -001 for the transcript id, and _11 for exon. The same idea for RefSeq column.

  4. Finally the Call_CNV is designates whether you want to include this given target in the analysis. Usually you say N if you know somehow this target is not reliable when the data is produced (ie. target is too small or data is known to be noisy).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants