Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for gVCF Conversion and Index Building in Nextflow Pipeline #56

Closed
hyunhwan-bcm opened this issue Aug 15, 2024 · 4 comments
Closed
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@hyunhwan-bcm
Copy link
Contributor

hyunhwan-bcm commented Aug 15, 2024

Summary:

We propose adding support for gVCF files in our Nextflow pipeline. This feature includes building a reference genome index (for both hg19 and hg38) and converting gVCF files to VCF format when <NON_REF> tags are encountered - I believe this is specific for gvcf file by gatk. The conversion process will be part of a new VCF_PRE_PROCESS process, ensuring streamlined integration into the pipeline.

Tasks:

  1. Build Reference Index for hg19 and hg38:

    • Either use pre-built reference indexes or build them during the pipeline's execution (only once).
    • Here’s a basic script for building the index:
    #!/bin/bash
    wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
    gunzip hg19.fa.gz
    sed 's/>chr/>/g' hg19.fa > num_prefix_hg19.fa
    samtools faidx num_prefix_hg19.fa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y M > final_hg19.fa
    gatk CreateSequenceDictionary R=final_hg19.fa
  2. gVCF to VCF Conversion:

    • Check if the input VCF contains <NON_REF> entries within the first 10,000 lines.
    • If found, convert gVCF to VCF using the following steps:
      • Rename chromosomes and remove the ID field.
      • Sort the VCF.
      • Use GATK's GenotypeGVCFs to process the file.
      • Apply VariantFiltration to filter out low-quality variants.
    • The following bash script demonstrates the conversion process:
    #!/bin/bash
    set -e
    
    # Define input/output paths and reference genome
    chrmap_file="/home/hwan/aim_data_dependencies_04_24_2024_sasi/bcf_annotate/chrmap.txt"
    reference_genome="./ref/final_hg19.fa"
    input_file="$1"
    output_file="${input_file%.vcf.gz}.no_g.vcf.gz"
    
    # Step 0: Check for <NON_REF>
    zcat "$input_file" | head -n 10000 | grep -q "<NON_REF>" || { echo "It's not gVCF"; exit 1; }
    
    # Step 1: Annotate and remove ID field
    echo "Step 1: Annotating and removing ID field..." | tee -a "$log_file"
    bcftools annotate --rename-chrs "$chrmap_file" -x ID "$input_file" -Oz -o step1.vcf.gz 2>> "$log_file"
    tabix step1.vcf.gz
    bcftools view -r 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y step1.vcf.gz -Oz -o step1_1.vcf.gz
    
    # Step 2: Sort the VCF file
    echo "Step 2: Sorting the VCF file..." | tee -a "$log_file"
    bcftools sort step1_1.vcf.gz -Oz -o step2.vcf.gz 2>> "$log_file"
    
    # Step 2.1: Index step2.vcf.gz with tabix
    echo "Indexing step2.vcf.gz with tabix..." | tee -a "$log_file"
    tabix -p vcf step2.vcf.gz 2>> "$log_file"
    
    # Step 3: Genotype GVCFs with GATK
    echo "Step 3: Running GenotypeGVCFs with GATK..." | tee -a "$log_file"
    gatk GenotypeGVCFs -R "$reference_genome" -V step2.vcf.gz -O step3.vcf.gz --allow-old-rms-mapping-quality-annotation-data 2>> "$log_file"
    
    # Step 4: Filter based on quality with GATK
    echo "Step 4: Running VariantFiltration with GATK..." | tee -a "$log_file"
    gatk VariantFiltration -V step3.vcf.gz -O step4.vcf.gz --filter-expression "QUAL < 30.0" --filter-name "LowQual" 2>> "$log_file"
    
    # Rename the final output file
    echo "Renaming the final output file..." | tee -a "$log_file"
    mv step4.vcf.gz "$output_file"
    
    # Index the final output using tabix
    echo "Indexing the final output with tabix..." | tee -a "$log_file"
    tabix -p vcf "$output_file" 2>> "$log_file"
    
    # Display the number of non-comment lines (ignore lines starting with #)
    lines=$(zcat "$output_file" | grep -v '^#' | wc -l)
    echo "File: $output_file has $lines lines (excluding comment lines)." | tee -a "$log_file"
    
    # Clean up intermediate files
    echo "Cleaning up intermediate files..." | tee -a "$log_file"
    rm -f step*.vcf.gz*
    
    # Calculate and print the running time
    end_time=$(date +%s)
    runtime=$((end_time - start_time))
    echo "Script completed in $runtime seconds." | tee -a "$log_file"
  3. Incorporation into VCF_PRE_PROCESS:

    • The gVCF to VCF conversion logic will be added to the VCF_PRE_PROCESS stage.
    • Ensure efficient handling of large datasets and minimize redundant computation.

Benefits:

  • Streamlines the conversion of gVCF to VCF files while preserving necessary data.

Additional Information:

  • The reference genome index needs to be built only once per workflow.
  • Please ensure that all file paths and environment dependencies are correctly defined before integrating the scripts.

Would you like to make any modifications or additions to this draft before posting it?

@hyunhwan-bcm hyunhwan-bcm added the enhancement New feature or request label Aug 15, 2024
@hyunhwan-bcm hyunhwan-bcm added this to the v1.0 milestone Aug 15, 2024
@jylee-bcm
Copy link
Contributor

jylee-bcm commented Aug 15, 2024

Hi, @hyunhwan-bcm can you also review the other options listed here: https://chatgpt.com/share/43eb88c5-9850-49a7-a0db-793289daea0f

I think it would be great if we can use vcftools or bcftools options so we don't need to install, and it does not require us to get reference files.

@jylee-bcm
Copy link
Contributor

Sorry but I am in difficulty at understanding the meaning of Taks2 and Task3.

It seems the task2 already incorporated some logics of VCF_PRE_PROCESS, but there is something missed too.

Those are in

  1. bcftools annotate --rename-chrs ... command

Those are missed

  1. bcftools annotate --set-id +'%CHROM\\_%POS\\_%REF\\_%FIRST_ALT' command
  2. bcftools filter ${params.run_id}-add-id.vcf.gz -i'FILTER == "PASS"'

Is it intended as they are already covered by new commands? or just mistakes by ChatGPT?

@hyunhwan-bcm
Copy link
Contributor Author

This needs to be executed before VCF_PRE_PROCESS, and bcftools annotate --rename-chrs is required to drastically reduce many non-variant records. After that, we will proceed with the other two parts.

@jylee-bcm
Copy link
Contributor

The gVCF to VCF conversion logic will be added to the VCF_PRE_PROCESS stage.

Then this is irrelevant, and safe to be ignored, right?

jylee-bcm added a commit that referenced this issue Aug 15, 2024
hyunhwan-bcm added a commit that referenced this issue Aug 15, 2024
Add GVCF to VCF logic. Close #56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants