Support for gVCF Conversion and Index Building in Nextflow Pipeline #56

hyunhwan-bcm · 2024-08-15T05:27:29Z

Summary:

We propose adding support for gVCF files in our Nextflow pipeline. This feature includes building a reference genome index (for both hg19 and hg38) and converting gVCF files to VCF format when <NON_REF> tags are encountered - I believe this is specific for gvcf file by gatk. The conversion process will be part of a new VCF_PRE_PROCESS process, ensuring streamlined integration into the pipeline.

Tasks:
Build Reference Index for hg19 and hg38:

Either use pre-built reference indexes or build them during the pipeline's execution (only once).

Here’s a basic script for building the index:
#!/bin/bash
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
gunzip hg19.fa.gz
sed 's/>chr/>/g' hg19.fa > num_prefix_hg19.fa
samtools faidx num_prefix_hg19.fa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y M > final_hg19.fa
gatk CreateSequenceDictionary R=final_hg19.fa
gVCF to VCF Conversion:

Check if the input VCF contains <NON_REF> entries within the first 10,000 lines.

If found, convert gVCF to VCF using the following steps:

Rename chromosomes and remove the ID field.

Sort the VCF.

Use GATK's GenotypeGVCFs to process the file.

Apply VariantFiltration to filter out low-quality variants.

The following bash script demonstrates the conversion process:
#!/bin/bash
set -e

# Define input/output paths and reference genome
chrmap_file="/home/hwan/aim_data_dependencies_04_24_2024_sasi/bcf_annotate/chrmap.txt"
reference_genome="./ref/final_hg19.fa"
input_file="$1"
output_file="${input_file%.vcf.gz}.no_g.vcf.gz"

# Step 0: Check for <NON_REF>
zcat "$input_file" | head -n 10000 | grep -q "<NON_REF>" || { echo "It's not gVCF"; exit 1; }

# Step 1: Annotate and remove ID field
echo "Step 1: Annotating and removing ID field..." | tee -a "$log_file"
bcftools annotate --rename-chrs "$chrmap_file" -x ID "$input_file" -Oz -o step1.vcf.gz 2>> "$log_file"
tabix step1.vcf.gz
bcftools view -r 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y step1.vcf.gz -Oz -o step1_1.vcf.gz

# Step 2: Sort the VCF file
echo "Step 2: Sorting the VCF file..." | tee -a "$log_file"
bcftools sort step1_1.vcf.gz -Oz -o step2.vcf.gz 2>> "$log_file"

# Step 2.1: Index step2.vcf.gz with tabix
echo "Indexing step2.vcf.gz with tabix..." | tee -a "$log_file"
tabix -p vcf step2.vcf.gz 2>> "$log_file"

# Step 3: Genotype GVCFs with GATK
echo "Step 3: Running GenotypeGVCFs with GATK..." | tee -a "$log_file"
gatk GenotypeGVCFs -R "$reference_genome" -V step2.vcf.gz -O step3.vcf.gz --allow-old-rms-mapping-quality-annotation-data 2>> "$log_file"

# Step 4: Filter based on quality with GATK
echo "Step 4: Running VariantFiltration with GATK..." | tee -a "$log_file"
gatk VariantFiltration -V step3.vcf.gz -O step4.vcf.gz --filter-expression "QUAL < 30.0" --filter-name "LowQual" 2>> "$log_file"

# Rename the final output file
echo "Renaming the final output file..." | tee -a "$log_file"
mv step4.vcf.gz "$output_file"

# Index the final output using tabix
echo "Indexing the final output with tabix..." | tee -a "$log_file"
tabix -p vcf "$output_file" 2>> "$log_file"

# Display the number of non-comment lines (ignore lines starting with #)
lines=$(zcat "$output_file" | grep -v '^#' | wc -l)
echo "File: $output_file has $lines lines (excluding comment lines)." | tee -a "$log_file"

# Clean up intermediate files
echo "Cleaning up intermediate files..." | tee -a "$log_file"
rm -f step*.vcf.gz*

# Calculate and print the running time
end_time=$(date +%s)
runtime=$((end_time - start_time))
echo "Script completed in $runtime seconds." | tee -a "$log_file"
Incorporation into VCF_PRE_PROCESS:

The gVCF to VCF conversion logic will be added to the VCF_PRE_PROCESS stage.

Ensure efficient handling of large datasets and minimize redundant computation.
Benefits:

Streamlines the conversion of gVCF to VCF files while preserving necessary data.

Additional Information:

The reference genome index needs to be built only once per workflow.

Please ensure that all file paths and environment dependencies are correctly defined before integrating the scripts.

Would you like to make any modifications or additions to this draft before posting it?

The text was updated successfully, but these errors were encountered:

jylee-bcm · 2024-08-15T15:13:22Z

Hi, @hyunhwan-bcm can you also review the other options listed here: https://chatgpt.com/share/43eb88c5-9850-49a7-a0db-793289daea0f

I think it would be great if we can use vcftools or bcftools options so we don't need to install, and it does not require us to get reference files.

jylee-bcm · 2024-08-15T18:30:40Z

Sorry but I am in difficulty at understanding the meaning of Taks2 and Task3.

It seems the task2 already incorporated some logics of VCF_PRE_PROCESS, but there is something missed too.

Those are in

bcftools annotate --rename-chrs ... command

Those are missed

bcftools annotate --set-id +'%CHROM\\_%POS\\_%REF\\_%FIRST_ALT' command
bcftools filter ${params.run_id}-add-id.vcf.gz -i'FILTER == "PASS"'

Is it intended as they are already covered by new commands? or just mistakes by ChatGPT?

hyunhwan-bcm · 2024-08-15T18:34:24Z

This needs to be executed before VCF_PRE_PROCESS, and bcftools annotate --rename-chrs is required to drastically reduce many non-variant records. After that, we will proceed with the other two parts.

jylee-bcm · 2024-08-15T18:39:17Z

The gVCF to VCF conversion logic will be added to the VCF_PRE_PROCESS stage.

Then this is irrelevant, and safe to be ignored, right?

Add GVCF to VCF logic. Close #56

hyunhwan-bcm added the enhancement New feature or request label Aug 15, 2024

hyunhwan-bcm added this to the v1.0 milestone Aug 15, 2024

hyunhwan-bcm assigned jylee-bcm Aug 15, 2024

jylee-bcm added a commit that referenced this issue Aug 15, 2024

Add GVCF to VCF logic. Close #56

5b8c651

hyunhwan-bcm mentioned this issue Aug 15, 2024

Add GVCF to VCF logic. Close #56 #58

Merged

hyunhwan-bcm added a commit that referenced this issue Aug 15, 2024

Merge pull request #58 from LiuzLab/issue56

658fdbb

Add GVCF to VCF logic. Close #56

hyunhwan-bcm closed this as completed Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for gVCF Conversion and Index Building in Nextflow Pipeline #56

Support for gVCF Conversion and Index Building in Nextflow Pipeline #56

hyunhwan-bcm commented Aug 15, 2024 •

edited

Loading

Summary:

Tasks:

Benefits:

Additional Information:

jylee-bcm commented Aug 15, 2024 •

edited

Loading

jylee-bcm commented Aug 15, 2024

hyunhwan-bcm commented Aug 15, 2024

jylee-bcm commented Aug 15, 2024

Support for gVCF Conversion and Index Building in Nextflow Pipeline #56

Support for gVCF Conversion and Index Building in Nextflow Pipeline #56

Comments

hyunhwan-bcm commented Aug 15, 2024 • edited Loading

Summary:

Tasks:

Benefits:

Additional Information:

jylee-bcm commented Aug 15, 2024 • edited Loading

jylee-bcm commented Aug 15, 2024

Those are in

Those are missed

hyunhwan-bcm commented Aug 15, 2024

jylee-bcm commented Aug 15, 2024

hyunhwan-bcm commented Aug 15, 2024 •

edited

Loading

jylee-bcm commented Aug 15, 2024 •

edited

Loading