Bradysia coprophila is a fungus gnat (fly) that carries germline restricted chromosomes (GRCs or 'L' chromosomes). We sequenced the germ tissue in B. coprophila and identified the GRCs using K-mer and coverage based techniques. We then compared the genes on the GRCs to those in the core genome (i.e. autosomes and X chromosome) to explore how the GRCs evolved in this species. The code in this repository was used in the analyses in this manuscript (https://doi.org/10.1101/2021.02.08.430288).
Within this repository, L is a synonym of germ-line restricted chromosome (GRC).
Later paths should be replaced by URLs in public repositories.
-
raw reads: ENA project PRJEB44837 https://www.ebi.ac.uk/ena/browser/view/PRJEB44837?show=reads
-
data used to generate all figures in manuscript: tables/figure_data.tar.gz
-
softmasked genome (Illumina):
/data/ross/mealybugs/analyses/sciara_coprophila/18_repeat/repeats/Scop_repeatmask/scop.spades2.min1kb.trimmed.fasta.masked
-
annotation:
/data/ross/mealybugs/analyses/sciara_coprophila/20_braker/braker.gff3
-
pacbio genome:
/data/ross/mealybugs/analyses/sciara_coprophila/Pacbio/4_racon2_sr/sciara.germ.pb.illumina1.rb.racon6pe.fasta
-
RNAseq of germ and soma (Males and Females):
/data/ross/mealybugs/raw/11791_Ross_Laura/raw_data/20190812/
and Mgerm/body or Fgerm/body directory -
all vs all blasts
- genes (nucleotides)
data/genome_wide_paralogy/genes_all_vs_all.blast
- proteins (transcribed genes)
data/genome_wide_paralogy/proteins_all_vs_all.blast
- genes (nucleotides)
-
repeatmodeler/masker for genome annotation:
/data/ross/mealybugs/analyses/Sciara-L-chromosome/data/repeats
Below is a list of manuscript sections containing analysis documentation and scripts used for those analyses
- associated data:
- table linking Illumina contigs to annotated genes with assignments:
data/gene.assignment.tab.tsv
- associated r script:
scripts/gene.num.tab.R
- associated r script:
- gene ID and mean coverage for that gene:
data/gene_cov_table.tsv
- analysis documentation
- associated scripts
- table with scores and assignments for all Illumina contigs
data/scaffold_assignment_tab_full.tsv
-
associated data
- table with gene pairs in reciprocal blasts and gene cov, percent alignment between the blast pairs, length of genes, and assignments for genes. Before filtering based or gene/alignment length:
data/ntgene_recip_blast_cov.tsv
- table with all info from paralog exploration, paralog ID, cov, type, and paralog freq after filtering based on gene/alignment length :
data/filtered_paralog_tab.tsv
- table with gene pairs in reciprocal blasts and gene cov, percent alignment between the blast pairs, length of genes, and assignments for genes. Before filtering based or gene/alignment length:
-
associated data for colinearity analysis
- table of colinear blocks between chromosomes. Contains gene ID, chromosome assignment, scf in Illumina, scf in colinear block, block info, order in block and paralog ID, and assignment, and mean gene cov:
data/full_colinear_tab.tsv
- associated r script:
scripts/colinear_exploration.R
- table of colinear blocks between chromosomes. Contains gene ID, chromosome assignment, scf in Illumina, scf in colinear block, block info, order in block and paralog ID, and assignment, and mean gene cov:
-
associated data
- per gene coverage:
data/gene.cov.braker.annotation.tsv
-
associated scripts
-
associated data
- Google drive "Sciara_L_chromosome" (written documents; you should have access if you are supposed to have an access)
- Code that decides about assigments what is L / Lc / Lk / Lp / AX / AXp (we can latter one separate A and X using either male coverage or the reference genome). In total the assigment fractions now are (Mbp):
L | Lc | A | Ac | X | Xc | NA |
---|---|---|---|---|---|---|
154.1 | 6.8 | 162.4 | 9 | 52.9 | 2.4 | 10.1 |