This project is currently at an experimental stage and has not been validated by an experienced geneticist. Please use at your own risk.
With the advent of stricter data privacy laws in many jurisdictions, it has become impossible for some researchers to use the Michigan Imputation Server to phase and impute genotype data. This project allows you to use the open source code behind the server on your local workstation or high-performance compute cluster, together with the 1000 Genomes Phase 3 v5 reference, and the rest of the ENIGMA Imputation Protocol, .
This document assumes that you have either Singularity
or Docker
installed on your system.
The container comes with all the software needed to run the imputation protocol. You can either do this manually based on the official instructions, or use the step-by-step guide below.
The guide makes some suggestions for folder names (e.g. mds
, raw
, qc
), but these can also be chosen freely. The only exceptions to this rule are the folders cloudgene
, hadoop
, downloads
, input
and output
inside the working directory (in /data
inside the container).
-
You need to download the container file using one of the following commands. This will use approximately one gigabyte of storage.
Container platform Version Command Singularity 3.x wget http://download.gwas.science/singularity/imputation-protocol-latest.sif
Docker docker pull gwas.science/imputation-protocol:latest
-
You will now need to create a working directory that can be used for intermediate files and outputs. This directory should be empty and should have sufficient space available. We will store the path of the working directory in the variable
working_directory
, and then create the new directory and some subfolders.Usually, you will only need to do this once, as you can re-use the working directory for multiple datasets. Note that this variable will only exist for the duration of your terminal session, so you should re-define it if you exit and then resume later.
export working_directory=/mnt/scratch/imputation mkdir -p -v ${working_directory}/{raw,mds,qc}
-
Copy your raw data to the
raw
subfolder of the working directory. If you have multiple.bed
file sets that you want to process, copy them all.cp -v my_sample.bed my_sample.bim my_sample.fam ${working_directory}/raw
-
Next, start an interactive shell inside the container using one of the following commands.
The
--bind
(Singularity) or--volume
(Docker) parameters are used to make the working directory available inside the container at the path/data
. This means that in all subsequent commands, we can use the path/data
to refer to the working directory.Container platform Command Singularity singularity shell --hostname localhost --bind ${working_directory}:/data --bind /tmp imputation-protocol-latest.sif
Docker docker run --interactive --tty --volume ${working_directory}:/data --bind /tmp gwas.science/imputation-protocol /bin/bash
-
Inside the container, we will first go to
/data/mds
, and run the scriptenigma-mds
for your.bed
file set. The script creates the filesmdsplot.pdf
andHM3_b37mds2R.mds.csv
, which are summary statistics that you will need to share with your working group as per the ENIGMA Imputation Protocol.Note that this script will create all output files in the current folder, so you should use
cd
to change to the/data/mds/sample
folder before running it.If you have multiple
.bed
file sets, you should run the script in a separate folder for each one. Otherwise the script may overwrite previous results when you run it again.If you have just one dataset:
mkdir /data/mds/sample cd /data/mds/sample enigma-mds --bfile /data/raw/my_sample
Alternatively, for multiple datasets:
mkdir /data/mds/{sample_a,sample_b} cd /data/mds/sample_a enigma-mds --bfile /data/raw/sample_a cd /data/mds/sample_b enigma-mds --bfile /data/raw/sample_b
-
Next, we will set up our local instance of the Michigan Imputation Server.
The
setup-hadoop
command will start a Hadoop instance on your computer, which consists of four background processes. When you are finished processing all your samples, you can stop them with thestop-hadoop
command. If you are using Docker, then these processes will be stopped automatically when you exit the container shell.The
setup-imputationserver
script will then verify that the Hadoop instance works, and then install the 1000 Genomes Phase 3 v5 genome reference that will be used for imputation (around 15 GB of data, so it may take a while).If you are resuming analyses in an existing working directory, and do not still have the Hadoop background processes running, then you should re-run the setup commands. If they are still running, then you can skip this step.
setup-hadoop --n-cores 8 setup-imputationserver
If you encounter any warnings or messages while running these commands, you should consult with an expert to find out what they mean and if they may be important. However, processing will usually complete without issues, even if some warnings occur.
If something important goes wrong, then you will usually see a clear error message that contains the word "error". Please note that if the commands take more than an hour to run the setup, then that may also indicate that an error occurred.
-
Next, go to
/data/qc
, and runenigma-qc
for your.bed
file sets. This will drop any strand ambiguous SNPs, then screen for low minor allele frequency, missingness and *Hardy-Weinberg equilibrium*, then remove duplicate SNPs (if necessary), and finally convert the data to sorted.vcf.gz
format for imputation.The script places intermediate files in the current folder, and the final
.vcf.gz
files in/data/input/my_sample
where they can be accessed by theimputationserver
script in the next step.Note that this script will create some output files in the current folder, so you should use
cd
to change to the/data/qc/sample
folder (or similar) before running it.The input path is hard-coded, because the imputation server is quite strict and expects a directory with just the
.vcf.gz
files and nothing else, so to avoid any problems we create that directory automatically.If you have just one dataset:
mkdir /data/qc/sample cd /data/qc/sample enigma-qc --bfile /data/raw/my_sample --study-name my_sample
Alternatively, for multiple datasets:
mkdir /data/qc/{sample_a,sample_b} cd /data/qc/sample_a enigma-qc --bfile /data/raw/sample_a --study-name sample_a cd /data/qc/sample_b enigma-qc --bfile /data/raw/sample_b --study-name sample_b
-
Finally, run the
imputationserver
command for the correct sample population. This parameter can be set tomixed
if the sample has multiple populations or the population is unknown.If a specific population (not
mixed
) is specified, then the imputation server will compare the minor allele frequency of each variant in the sample to the population reference using a chi-squared test, and excludes outliers. If the population ismixed
, then the step is skipped.Regardless of the population paramater, the imputation step always uses the entire 1000 Genomes Phase 3 v5 genome reference.
If you see any errors (such as
obvious stand flips detected
), you may need to follow one or more of the workarounds in the troubleshooting section.imputationserver --study-name my_sample --population mixed
This process will likely take a few hours, and once it finishes for all your
.bed
file sets, you can exit the container using theexit
command.All outputs can be found in the working directory created earlier. The quality control report can be found at
${working_directory}/output/my_sample/qcreport/qcreport.html
(only if the population is notmixed
), and the imputation results at${working_directory}/output/my_sample/local
. The.zip
files are encrypted with the passwordpassword
. -
To merge all output files into a compact and portable
.zip
archive, the container includes themake-archive
command. It will create a single output file at${working_directory}/my_sample.zip
with all output files.make-archive --study-name my_sample
Add the `--zstd-compress` option to the command to use a more efficient compression algorithm. This will take longer to run, but will create a smaller output file (around 70% smaller).
-
Once you have finished processing all your datasets, you can stop all background processes of the imputation server with the
stop-hadoop
command. Then you can exit the container, move the output files to a safe location and delete the working directory (in case you need the disk space).stop-hadoop exit mv ${working_directory}/my_sample.zip /storage rm -rI ${working_directory}
You will need to convert your data before you start. Usually, this is very straight-forward with the plink
command.
plink --ped my_sample.ped --map my_sample.map --make-bed --out my_sample
Error: No chunks passed the QC step. Imputation cannot be started!
This error happens when your input data uses a different genome reference from what the imputation server expects (hg19
). You can do this manually with LiftOver
. To automatically do that, the container comes with the check hg19
command, which is based on the RICOPILI command buigue
.
check hg19 --bfile /data/raw/my_sample
The command will create a .bed
file set at /data/raw/my_sample.hg19
which will have the correct genome reference. You should now use this corrected file to re-run the enigma-qc
script, and then retry running the imputationserver
command.
Error: More than 100 obvious strand flips have been detected. Please check strand. Imputation cannot be started!
If the imputationserver
command fails with this error, then you will need to resolve strand flips in your data. To automatically do that, the container comes with the check flip
command, which is based on the RICOPILI called checkflip4
and check-bim.
check flip --bfile /data/raw/my_sample
The command will create a .bed
file set at /data/raw/my_sample.check-flip
which will have all strand flips resolved. You should now use this corrected file to re-run the enigma-qc
script, and then retry running the imputationserver
command.
Job execution failed: Velocity could not be initialized!
You likely have bad permissions in your working directory. You can either try to fix them, or start over with a fresh working directory.
Warning: At least one VCF allele code violates the official specification; other tools may not accept the file. (Valid codes must either start with a '<', only contain characters in {A,C,G,T,N,a,c,g,t,n}, be an isolated '*', or represent a breakend.)
This message occurs for example when you have indels in your raw data. You can ignore this message, because the Michigan Imputation Server will remove these invalid variants in its quality control step.
This suggests that your Hadoop instance may not be accepting new jobs. The fastest way to solve this is to stop and delete the instance, and then to re-run the setup.
# stop and delete
stop-hadoop
rm -rf /data/hadoop
# re-run setup
setup-hadoop --n-cores 8
setup-imputationserver
Please delete the /data/cloudgene
folder inside the container and run setup-imputationserver
again.