bug related to bowtie indexing #2

cooketho · 2016-10-05T19:39:17Z

I'm getting the error shown below. The proximal cause seems to be that hic_exp.py populates the dictionary dict_fragments from a file in the main output directory named "list_contig_names.txt". The reason for the error is the file in my directory is empty, thus the dictionary is empty, and on line 76 we therefore get a key error. I can see on line 39 that this file is generated from a call to bowtie2-inspect -n . The problem for me is there is no genome index.

This is where some more detailed documentation would come in handy. Am I supposed to generate the index myself? If so, this needs to be explicitly stated in the readme. Also what is "Bowtie folder" in the advanced options? When I enter stuff into the advanced options and click "apply" it gives the warning message "Could not find bowtie folder". This also needs to be explained.

Also, as an urelated side-note, 22270 seconds (i.e. 6 hours) seems like a long time for just making a restriction map of a 1.2 Gb genome. This should probably be optimized. I'm guessing your problem is the biopython restriction module. In the past I've noticed that some (but not all) of those functions were super slow, to the point that I just used regular expressions as a workaround. Just an idea.

Restriction map generated in 22270.876373 s
filling list of contigs ..
[]
filling dictionnary of fragments ...
Traceback (most recent call last):
File "main.py", line 278, in OnAlign
ncpu=self.ncpu)
File "/home/tom/Desktop/HiC-Box-master/analysis_main.py", line 102, in analyze
len_paired_wise_fastq)
File "/home/tom/Desktop/HiC-Box-master/hic_exp.py", line 76, in init
dict_fragments[a_tmp[1]].append(int(a_tmp[0]))
KeyError: 'gi|526059867|ref|NW_004823088.1| Melopsittacus undulatus
unplaced genomic scaffold, Melopsittacus_undulatus_6.3
budgerigar_v6.3_scf900160251875, whole genome shotgun sequence'

baudrly · 2016-10-10T07:35:06Z

There needs to be a folder containing bowtie2 (named "bowtie") in the box's main folder. Just extracting a bowtie archive into the main folder will do. That's the default path that's being searched by HiC-Box when trying to launch the alignment step, and the related advanced option named "bowtie folder" is used if you want an alternative folder path. I tried to clarify the index building step.

KeyErrors at this step of the generation process usually happen when there's some kind of mismatch between the bowtie index files and the genome itself. Was the genome modified in any way between the index being built and the alignment being launched? In any case, please try deleting the index files before rebuilding them and running the alignment right away, and report back if you encounter any further issues.

About the side-note, you're right - some parts of the box are unoptimized and we're currently trying to rewrite these parts, including the one you mentioned. The main reason is that it was originally written years ago with mostly small genomes in mind - the original paper on GRAAL (using much of the same code) tests it on small genomes such as that of yeast or Trichoderma, and speeds were deemed sufficient at that time. However, GRAAL itself does run very well on large genomes - we have an internal, optimized version in the works that quickly reassembles gigabyte-sized ones with little to no issue, and it should be put online after most rough edges are smoothed out somewhat.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug related to bowtie indexing #2

bug related to bowtie indexing #2

cooketho commented Oct 5, 2016

baudrly commented Oct 10, 2016 •

edited

Loading

bug related to bowtie indexing #2

bug related to bowtie indexing #2

Comments

cooketho commented Oct 5, 2016

baudrly commented Oct 10, 2016 • edited Loading

baudrly commented Oct 10, 2016 •

edited

Loading