The high degree of similarity between gametologous sequences on the sex chromosomes can lead to the misalignment of sequencing reads and substantially affect variant calling. Here we present XYalign, a new tool that (1) quickly infers sex chromosome ploidy in NGS data, (2) remaps reads based on the inferred sex chromosome complement of the individual, and (3) outputs quality, depth, and allele-balance metrics across chromosomes.
Webster TH; Couse M; Grande BM; Karlins E; Phung T; Richmond PA; Whitford W; Wilson MA. 2019. Identifying, understanding, and correcting technical artifacts on the sex chromosomes in next-generation sequencing data. GigaScience 8(7): giz074. DOI: https://doi.org/10.1093/gigascience/giz074
If you use XYalign or discuss/correct for bias in mapping on the sex chromosomes, please cite this article.
See full documentation at Read The Docs -- Under construction
Post any questions you have at the XYalign Google Group
Post any bugs/issues to XYalign's issues page on Github
XYalign has only been tested on Linux and Mac systems. We recommend users install and manage XYalign (and programming environments) using Conda. To do this
-
First download and install either Anaconda or Miniconda (both work well, Miniconda is a lightweight version of Anaconda).
-
Finish installation with the following commands to install XYalign and all of its dependancies in an environment called "xyalign_env":
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
conda create -n xyalign_env xyalign
- Load your new environment (containing XYalign and all related programs) with:
source activate xyalign_env
See Bioconda and Conda documentation for more information.
Assuming XYalign is installed correctly with all associated programs and is available
in your PATH
(see "Installing XYalign above"), you can use the command
(assume the following is on one line):
xyalign --PREPARE_REFERENCE --ref reference.fasta
--xx_ref_out /path/to/reference.XXonly.fasta
--xy_ref_out /path/to/reference.XY.fasta
--x_chromosome chrX
--y_chromosome chrY
--reference_mask mask.bed
--output_dir output_directory
In the above command, reference.fasta
is the original reference genome,
/path/to/reference.XXonly.fasta
and /path/to/reference.XY.fasta
are the
full paths to and names of the desired output references for XX and XY samples,
respectively. chrX
and chrY
are the exact names of the X and Y chromosome
scaffolds in the assembly. mask.bed
is some bed file containing regions that
should be masked in both output fastas. output_directory
is the name of a
directory into which the logfile and other intermediate files will be deposited.
You can use the command (assume the following is on one line):
xyalign --CHARACTERIZE_SEX_CHROMS
--ref reference.fasta
--bam sample1.bam
--output_dir sample1_results
--sample_id sample1
--cpus 4
--window_size 5000
--chromosomes chr19 chrX chrY
--x_chromosome chrX
--y_chromosome chrY
In the above command, reference.fasta
is the full path to the reference genome
used to generate the bam file, sample1.bam
is the full path to the bam file
sample1_results
is our desired output directory, and sample1
is the name of
our sample. we're using four cores (--cpus 4
) and 5kb nonoverlapping
windows for analysis. We're analyzing three chromosomes named chr19
,
chrX
, and chrY
, and our X and Y scaffolds in the reference are named
chrX
and chrY
.
Our output of interest will be in sample1_results/plots
and sample1_results/results
. Tables (.csv) of depth and mapq measurements per window
will in sample1_results/bed
with "full_dataframe" in their file names. BED files containing windows passing ("highquality") and failing ("lowquality") filtering
thresholds will also be in sample1_results/bed
.
Relevant flags for filtering variants include:
--variant_site_quality
--variant_genotype_quality
--variant_depth
Relevant flags for filtering windows include:
--mapq_cutoff
--min_depth_filter
--max_depth_filter
--min_variant_count
You can get details about these (and more) flags with the command:
xyalign -h
Analyze multiple bam files to determine sex chromosome complement, identify sex chromosome scaffolds, etc.
xyalign --CHROM_STATS
--chromosomes chr1 chr8 chr19 chrX chrY
--bam sample1.bam sample2.bam sample3.bam
--ref null
--sample_id bam_comparison1
--output_dir bam_comparison1_results
In the above command, we're analyzing five chromosomes in three different bam files.
We provide null
as our reference because it's not used in these analyses.
--sample_id
now becomes the name of our comparison (it's used in file names, etc.)
and our output will be located in bam_comparison1_results/results
. We could also use
--use_counts
to force XYalign to simply use counts of reads on each chromosome in
comparisons.