Skip to content

SexChrLab/XYalign

Repository files navigation

Background

The high degree of similarity between gametologous sequences on the sex chromosomes can lead to the misalignment of sequencing reads and substantially affect variant calling. Here we present XYalign, a new tool that (1) quickly infers sex chromosome ploidy in NGS data, (2) remaps reads based on the inferred sex chromosome complement of the individual, and (3) outputs quality, depth, and allele-balance metrics across chromosomes.

Citation

Webster TH; Couse M; Grande BM; Karlins E; Phung T; Richmond PA; Whitford W; Wilson MA. 2019. Identifying, understanding, and correcting technical artifacts on the sex chromosomes in next-generation sequencing data. GigaScience 8(7): giz074. DOI: https://doi.org/10.1093/gigascience/giz074

If you use XYalign or discuss/correct for bias in mapping on the sex chromosomes, please cite this article.

Using XYalign

See full documentation at Read The Docs -- Under construction

Post any questions you have at the XYalign Google Group

Post any bugs/issues to XYalign's issues page on Github

Quick start and examples

Installing XYalign

XYalign has only been tested on Linux and Mac systems. We recommend users install and manage XYalign (and programming environments) using Conda. To do this

  1. First download and install either Anaconda or Miniconda (both work well, Miniconda is a lightweight version of Anaconda).

  2. Finish installation with the following commands to install XYalign and all of its dependancies in an environment called "xyalign_env":


conda config --add channels defaults

conda config --add channels conda-forge

conda config --add channels bioconda

conda create -n xyalign_env xyalign

  1. Load your new environment (containing XYalign and all related programs) with:
source activate xyalign_env

See Bioconda and Conda documentation for more information.

Prepare a sex-specific reference genome

Assuming XYalign is installed correctly with all associated programs and is available in your PATH (see "Installing XYalign above"), you can use the command (assume the following is on one line):

xyalign --PREPARE_REFERENCE --ref reference.fasta
--xx_ref_out /path/to/reference.XXonly.fasta
--xy_ref_out /path/to/reference.XY.fasta
--x_chromosome chrX
--y_chromosome chrY
--reference_mask mask.bed
--output_dir output_directory

In the above command, reference.fasta is the original reference genome, /path/to/reference.XXonly.fasta and /path/to/reference.XY.fasta are the full paths to and names of the desired output references for XX and XY samples, respectively. chrX and chrY are the exact names of the X and Y chromosome scaffolds in the assembly. mask.bed is some bed file containing regions that should be masked in both output fastas. output_directory is the name of a directory into which the logfile and other intermediate files will be deposited.

Analyze a single bam file to explore sex chromosome content, etc.

You can use the command (assume the following is on one line):

xyalign --CHARACTERIZE_SEX_CHROMS
--ref reference.fasta
--bam sample1.bam
--output_dir sample1_results
--sample_id sample1
--cpus 4
--window_size 5000
--chromosomes chr19 chrX chrY
--x_chromosome chrX
--y_chromosome chrY

In the above command, reference.fasta is the full path to the reference genome used to generate the bam file, sample1.bam is the full path to the bam file sample1_results is our desired output directory, and sample1 is the name of our sample. we're using four cores (--cpus 4) and 5kb nonoverlapping windows for analysis. We're analyzing three chromosomes named chr19, chrX, and chrY, and our X and Y scaffolds in the reference are named chrX and chrY.

Our output of interest will be in sample1_results/plots and sample1_results/results. Tables (.csv) of depth and mapq measurements per window will in sample1_results/bed with "full_dataframe" in their file names. BED files containing windows passing ("highquality") and failing ("lowquality") filtering thresholds will also be in sample1_results/bed.

Relevant flags for filtering variants include:

	--variant_site_quality
	--variant_genotype_quality
	--variant_depth

Relevant flags for filtering windows include:

	--mapq_cutoff
	--min_depth_filter
	--max_depth_filter
	--min_variant_count

You can get details about these (and more) flags with the command:

	xyalign -h

Analyze multiple bam files to determine sex chromosome complement, identify sex chromosome scaffolds, etc.

xyalign --CHROM_STATS
--chromosomes chr1 chr8 chr19 chrX chrY
--bam sample1.bam sample2.bam sample3.bam
--ref null
--sample_id bam_comparison1
--output_dir bam_comparison1_results

In the above command, we're analyzing five chromosomes in three different bam files. We provide null as our reference because it's not used in these analyses. --sample_id now becomes the name of our comparison (it's used in file names, etc.) and our output will be located in bam_comparison1_results/results. We could also use --use_counts to force XYalign to simply use counts of reads on each chromosome in comparisons.