-
Notifications
You must be signed in to change notification settings - Fork 22
Beginner's Guide to the Demonstration Analysis
##Introduction For demonstration purposes, we provide a set of BAM files to illustrate running of commands and pipelines. Complete running of all pipelines may take a few hours to several days. This tutorial assumes that you are evaluating the system on "consumer hardware" with limited resources. However, we highly recommend a system with at least 8 cpus, 64gb of ram, and three hard drives with at least 2TB of space each.
This tutorial also assumes that you have completed the GMS installation as detailed in the Beginner's Guide to Installation. For a brief demonstration of the GMS you may wish to instead start with the: Quick Tour in a Pre-configured Virtual Machine.
###Setting up the GMS installation for low resources
If you have already run the prime-system.pl
script you should skip to the section 'Running the GMS Pipelines' below. To run the GMS in a low resources environment, you should have used the "low_resources" and "memory" options described in the Install instructions or Beginner's Guide to Installation tutorial:
./setup/prime-system.pl --data=hcc1395_1tenth_percent --sync=tarball --low_resources --memory=Xgb --metadata=setup/metadata/b6e5b2395bab44289123e9114817a92d.dat
If you wish you may also set GENOME_USER_EMAIL to the GMS-user's email address in /etc/genome.conf. This is the email address where the user wishes to receive emails when open-lava jobs complete.
After you have made these changes, reboot or send reboot signal (if working remotely):
sudo reboot
###Using the down-sampled data Rather than run this tutorial on the complete demonstration dataset (as described in the GMS paper) we offer the option to use an artificially down-sampled data set, for instructional purposes only. These down-sampled data should have been downloaded along with the complete dataset during GMS installation. The data set was created as follows:
- Sample 0.1% of total reads from all chromosomes.
- Sample a similar number of reads from just chromosomes 21 and 22.
- Extract all reads aligned to a targeted set of 50 genes representing all chromosomes.
The reads from (1) ensure at least some minimum coverage across all windows of the genome. This is necessary to avoid errors from certain pipeline steps in whole genome somatic variation. The reads from (2) allow sufficient depth of a large enough region (chr21 and chr22) to identify some putative CNVs and SVs. The reads from (3) ensure sufficient depth for high-quality SNV and Indel calls to be made for at least some genes.
To install the demo data you should have chosen the "hcc1395_1tenth_percent" option described in the Install instructions or Beginner's Guide to Installation tutorial:
./setup/prime-system.pl --data=hcc1395_1tenth_percent --sync=tarball --low_resources --memory=Xgb --metadata=setup/metadata/b6e5b2395bab44289123e9114817a92d.dat
##Running the GMS pipelines ###Starting a build First, we will build the genotype microarray models with the following commands:
genome model build start "name='hcc1395-normal-snparray'"
genome model build start "name='hcc1395-tumor-snparray'"
This should look something like:
Once those have completed, let's start exome reference alignments:
genome model build start "name='hcc1395-normal-refalign-exome'"
genome model build start "name='hcc1395-tumor-refalign-exome'"
When exome reference alignments have completely successfully we can run the exome somatic variation pipeline:
genome model build start "name='hcc1395-somatic-exome'"
Again, once that is completed we can repeat the pattern for WGS reference alignments:
genome model build start "name='hcc1395-normal-refalign-wgs'"
genome model build start "name='hcc1395-tumor-refalign-wgs'"
And, then WGS somatic variation:
genome model build start "name='hcc1395-somatic-wgs'"
Similarly for RNAseq we will first run top-hat alignments and cufflinks expression estimation for both tumor and normal:
genome model build start "name='hcc1395-normal-rnaseq'"
genome model build start "name='hcc1395-tumor-rnaseq'"
Then calculate differential expression metrics:
genome model build start "name='hcc1395-differential-expression'"
Finally, when all builds above have succeeded we can launch the medseq (aka clinseq) build:
genome model build start "name='hcc1395-clinseq'"
###Tracking the progress of a build
Each time you run a genome model build start
command you may wish to follow the progress of the resulting build. This can be done using several linux, openlava, and genome commands. If started successfully, the build ID will have been reported upon running the genome model build start
command. You can use this id to 'view' progress of the build. If you didn't take note of it you can find it by using a genome model build list
command or by searching/browsing in the web viewer. Let's see what builds we currently have underway. In this case I have limited the columns of output for display purposes. Run without the --show
option to see a more detailed output. As you can see, there are several existing succeeded builds and two running builds.
genome model build list --show id,model_id,model_name,run_by,status
Next, using the build ID and genome model build view
command we can see details of any of these builds. In this case, we are looking at the report for a 'Running' exome reference alignment build. At top are general details about the build including the 'Data Directory' where all build data will be stored. See the documentation on the location and description of results files in GMS pipelines. The workflow section lists all steps in the pipeline and their current status. Finally, paths to error logs are shown for running steps and crashed steps (if any).
genome model build view 77bf344b12d94186b7a19191fcfe65ea
Each step in the workflow represents a job that is submitted to the job queue. With the openlava bjobs
command we can see what jobs have been submitted to the queue and their status. As you can see, at the moment there are 13 jobs in the queue, all running concurrently. You can also use bhosts
to get a general summary of status in the queue and the linux command top
to see what jobs are actually running on the system.
###Trouble-shooting a failed build
If a build fails it will be listed as "Failed" status when queried by genome model build list
, genome model build view
or if selected in the web viewer. As described above, a genome model build view
for the failed build id will list failed steps in the workflow and provide paths to detailed error log files. This is normally the starting point for trouble-shooting a failed build. The nature of the error may indicate a problem with the input data or a bug in the software. Bugs can be submitted to https://github.com/genome/gms/issues.
###Abandoning a running build
In some cases, for example running the demonstration data on a computer with limited resources, it can sometimes occur that a workflow enters a "hung state". This happens if there are parent workflow steps running that require child processes to succeed but insufficient job slots are available to start those child processes. The workflow may get into a state where it is waiting for additional job slots to become available that never will. The final results of the build workflow are non-stochastic. However the exact timing and order of workflow steps are subject to some stochasticity. Therefore, this problem can sometimes be resolved by simply abandoning and restarting the build. The other option is obviously to provide more resources. To abandon a build you can issue a command like:
genome model build abandon 77bf344b12d94186b7a19191fcfe65ea
Then restart the build as before with a command like:
genome model build start "name='hcc1395-somatic-wgs'"
###Viewing results Results files will be associated with each build that has completed. To view the paths to each results directory you can do something like this:
genome model build list --show model_name,status,data_directory
Or to view the last complete build of a specific clin-seq model you might do something like this:
genome model clin-seq list --filter name='hcc1395-clinseq' --show +last_complete_build.data_directory
Or you can view the build summary where the 'Data Directory' will be listed near the top like so:
genome model build view model.name='hcc1395-clinseq'
Furthermore, if you have access to a GUI environment you could open the GMS Web Viewer in your internet browser. Additional commands to help you browse results can also be found in the Useful GMS Commands tutorial. Finally a detailed description of where to find important results file for various pipelines (including example paths) can be found in the Location and decription of results files page.