Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve maxatac prepare #107

Closed
4 tasks done
tacazares opened this issue Jun 10, 2022 · 1 comment
Closed
4 tasks done

Improve maxatac prepare #107

tacazares opened this issue Jun 10, 2022 · 1 comment
Labels
bug Something isn't working enhancement New feature or request

Comments

@tacazares
Copy link
Member

tacazares commented Jun 10, 2022

The maxatac prepare function was initially created as a convenience function for filtering, inferring Tn5 sites, and converting cut -site level coverage to min-max normalized bigwig tracks in one step. This function calls on the bash scripts that were used by our snakemake/cwl/bash workflows for ATAC-seq data processing. This will most likely be how most users prepare data as opposed to going through each step individually, so we should think about improving the user experience.

  • The biggest bug we should work on is adding error reporting and some code to catch if the proper outputs were not produced when using python to run our shell scripts. @anthonybejjani was using maxatac prepare to prepare scATAC-seq fragment files. He was able to run the script to completion and did not get an error message that:
  1. A problem with bedGraphToBigWig finding the correct libssl library bedGraphToBigWig: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory #102
  2. All of the expected output files were not produced.

The problem is that it appears to the user that the run completes correctly, despite having encountered some error during running the shell script. We should add more logging information during processing. We should also look into whether we should use python to execute the shell commands as opposed to just running a shell script from python with the commands internally. We could also add code to the shell script to catch problems with execution.

  • We should also add a test for maxatac functions that will make sure pybigwig is installed and can find numpy correctly, before running through the entire workflow. This is related to maxatac prepare erroring out at normalization #96. We should point to the fix if the issue is detected.

  • Double check that all of the unnecessary bedgraphs and intermediate files are removed to save space. We might want to add flags for whether to save specific intermediate files.

  • Update and add better logging messages for different processes running. At least add messages for major events like saving files or removing files. We could also have a final printout that has the names and locations of all files and their sizes.

@tacazares tacazares added bug Something isn't working enhancement New feature or request labels Jun 10, 2022
@tacazares tacazares changed the title Improvemaxatac prepare Improve maxatac prepare Jun 10, 2022
@diamremk
Copy link

Hey, I gave maxatac prepare this input:

`#!/bin/bash
#BSUB -W 12:00
#BSUB -M 200000
#BSUB -n 24
#BSUB -J kessi_maxATAC_prepare
#BSUB -o /data/miraldiLab/team/kessi/out_files/maxATAC_prepare_test_071522.out
#BSUB -e /data/miraldiLab/team/kessi/err_files/maxATAC_prepare_test_071522.err

module load anaconda

#module load anaconda (I don't remember the exact number but I think it is 3 you can check with module avail)
module load anaconda3/1.0.0

conda activate your env

#conda activate YOUR_CONDA_ENV_NAME # This env should have maxatac installed
source activate maxatac_test

No need for the sh infront of maxatac

maxatac prepare
-i /data/miraldiLab/team/kessi/snakeATAC/output_dir/Th0_48h_r1/alignments/Th0_48h_r1.bam
-o /data/miraldiLab/team/kessi/maxATAC
-prefix Th0_48h_r1
--blacklist_bed /users/dia6sx/opt/maxatac/data/mm10/mm10_maxatac_blacklist.bed
--blacklist_bw /users/dia6sx/opt/maxatac/data/mm10/mm10_maxatac_blacklist.bw
--chrom_sizes /users/dia6sx/opt/maxatac/data/mm10/mm10.chrom.sizes
-chroms chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19`

`[2022-07-15 16:28:03,370]
Input file: /data/miraldiLab/team/kessi/snakeATAC/output_dir/Th0_48h_r1/alignments/Th0_48h_r1.bam
Input chromosome sizes file: /users/dia6sx/opt/maxatac/data/mm10/mm10.chrom.sizes
Tn5 cut sites will be slopped 20 bps on each side
Input blacklist file: /users/dia6sx/opt/maxatac/data/hg38/hg38_maxatac_blacklist.bw
Output filename: Th0_48h_r1
Output directory: /data/miraldiLab/team/kessi/maxATAC
Using a millions factor of: 20000000
Using 12 threads to run job.
[2022-07-15 16:28:03,471]
Generate the normalized signal tracks.
[2022-07-15 16:28:03,471]
Working on a bulk ATAC-seq BAM file
Getting the number of reads in the BAM file
[2022-07-15 16:28:10,236]
Processing BAM to bigwig. Running eduplication
fixmate: invalid option -- '@'
Usage: samtools fixmate <in.nameSrt.bam> <out.nameSrt.bam>
Options:
-r Remove unmapped reads and secondary alignments
-p Disable FR proper pair check
-c Add template cigar ct tag
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-O, --output-fmt FORMAT[,OPT[=VAL]]...
Specify output format (SAM, BAM, CRAM)
--output-fmt-option OPT[=VAL]
Specify a single output file format option in the form
of OPTION or OPTION=VALUE
--reference FILE
Reference sequence FASTA FILE [null]

As elsewhere in samtools, use '-' as the filename for stdin/stdout. The input
file must be grouped by read name (e.g. sorted by name). Coordinated sorted
input is not accepted.
index: invalid option -- '@'
Usage: samtools index [-bc] [-m INT] <in.bam> [out.index]
Options:
-b Generate BAI-format index for BAM files [default]
-c Generate CSI-format index for BAM files
-m INT Set minimum interval size for CSI indices to 2^INT [14]
[main] unrecognized command 'markdup'
index: invalid option -- '@'
Usage: samtools index [-bc] [-m INT] <in.bam> [out.index]
Options:
-b Generate BAI-format index for BAM files [default]
-c Generate CSI-format index for BAM files
-m INT Set minimum interval size for CSI indices to 2^INT [14]
rm: cannot remove ‘Th0_48h_r1_fixmate.bam.bai’: No such file or directory
[main_samview] random alignment retrieval only works for indexed BAM or CRAM files.
index: invalid option -- '@'
Usage: samtools index [-bc] [-m INT] <in.bam> [out.index]
Options:
-b Generate BAI-format index for BAM files [default]
-c Generate CSI-format index for BAM files
-m INT Set minimum interval size for CSI indices to 2^INT [14]
rm: cannot remove ‘Th0_48h_r1_deduped.bam.bai’: No such file or directory
bedGraphToBigWig: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory
[2022-07-15 16:29:36,562]
Min-max normalize signal tracks
[2022-07-15 16:29:36,565]
Normalization
Input bigwig file: /data/miraldiLab/team/kessi/maxATAC/Th0_48h_r1_IS_slop20_RP20M.bw
Output filename: /data/miraldiLab/team/kessi/maxATAC/Th0_48h_r1_IS_slop20_RP20M_minmax01.bw
Output directory: /data/miraldiLab/team/kessi/maxATAC
Using min-max normalization
[2022-07-15 16:29:36,583]
Calculating stats per chromosome
[urlOpen] Couldn't open /data/miraldiLab/team/kessi/maxATAC/Th0_48h_r1_IS_slop20_RP20M.bw for reading
[urlOpen] Couldn't open /data/miraldiLab/team/kessi/maxATAC/Th0_48h_r1_IS_slop20_RP20M.bw for reading
[pyBwOpen] bw is NULL!
Traceback (most recent call last):
File "/users/dia6sx/.conda/envs/maxatac_test/bin/maxatac", line 24, in
sys.exit(main(sys.argv[1:]))
File "/users/dia6sx/.conda/envs/maxatac_test/bin/maxatac", line 20, in main
args.func(args)
File "/users/dia6sx/.conda/envs/maxatac_test/lib/python3.9/site-packages/maxatac/analyses/prepare.py", line 142, in run_prepare
run_normalization(args)
File "/users/dia6sx/.conda/envs/maxatac_test/lib/python3.9/site-packages/maxatac/analyses/normalize.py", line 64, in run_normalization
min_value, max_value, median, mad, mean_value, std_value = get_genomic_stats(bigwig_path=args.signal,
File "/users/dia6sx/.conda/envs/maxatac_test/lib/python3.9/site-packages/maxatac/utilities/normalization_tools.py", line 24, in get_genomic_stats
with pyBigWig.open(bigwig_path) as input_bigwig:
RuntimeError: Received an error during file opening!`

There seems to be a problem when preparing a bam file from snakeATAC. Note I am able to bypass the prepare step by using the bw generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants