Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bio.SeqIO: AttributeError when parsing GenBank file #4274

Closed
MrTomRod opened this issue Apr 4, 2023 · 3 comments · Fixed by #4275
Closed

Bio.SeqIO: AttributeError when parsing GenBank file #4274

MrTomRod opened this issue Apr 4, 2023 · 3 comments · Fixed by #4275

Comments

@MrTomRod
Copy link

MrTomRod commented Apr 4, 2023

Setup

I am reporting a problem with Biopython version, Python version, and operating
system as follows:

3.10.10 (main, Feb  8 2023, 00:00:00) [GCC 13.0.1 20230208 (Red Hat 13.0.1-0)]
CPython
Linux-6.2.9-300.fc38.x86_64-x86_64-with-glibc2.37
1.81

Steps to reproduce

import urllib.request
from io import StringIO
from Bio import SeqIO

# download file
gbk_content = urllib.request.urlopen('https://cloud.bioinformatics.unibe.ch/index.php/s/eD7ycWjz6p628C3/download/assembly.phages_combined.gbk').read().decode()

# try to parse file
for seq_record in SeqIO.parse(StringIO(gbk_content), "genbank"):
    pass

Expected behaviour

GenBank file is parsed normally.

Actual behaviour

Traceback (most recent call last):
  File "/opt/JetBrains/apps/PyCharm-P/ch-0/231.8109.197/plugins/python/helpers/pydev/pydevconsole.py", line 364, in runcode
    coro = func()
  File "<input>", line 1, in <module>
  File "/home/username/PycharmProjects/venvs/vibr_annotate/lib64/python3.10/site-packages/Bio/SeqIO/Interfaces.py", line 72, in __next__
    return next(self.records)
  File "/home/username/PycharmProjects/venvs/vibr_annotate/lib64/python3.10/site-packages/Bio/GenBank/Scanner.py", line 516, in parse_records
    record = self.parse(handle, do_features)
  File "/home/username/PycharmProjects/venvs/vibr_annotate/lib64/python3.10/site-packages/Bio/GenBank/Scanner.py", line 499, in parse
    if self.feed(handle, consumer, do_features):
  File "/home/username/PycharmProjects/venvs/vibr_annotate/lib64/python3.10/site-packages/Bio/GenBank/Scanner.py", line 470, in feed
    self._feed_feature_table(consumer, self.parse_features(skip=False))
  File "/home/username/PycharmProjects/venvs/vibr_annotate/lib64/python3.10/site-packages/Bio/GenBank/Scanner.py", line 420, in _feed_feature_table
    consumer.location(location_string)
  File "/home/username/PycharmProjects/venvs/vibr_annotate/lib64/python3.10/site-packages/Bio/GenBank/__init__.py", line 727, in location
    location = Location.fromstring(location_line, length, is_circular, stranded)
  File "/home/username/PycharmProjects/venvs/vibr_annotate/lib64/python3.10/site-packages/Bio/SeqFeature.py", line 859, in fromstring
    loc.strand = strand
AttributeError: 'NoneType' object has no attribute 'strand'
import sys; print(sys.version)
import platform; print(platform.python_implementation()); print(platform.platform())
import Bio; print(Bio.__version__)
3.10.10 (main, Feb  8 2023, 00:00:00) [GCC 13.0.1 20230208 (Red Hat 13.0.1-0)]
CPython
Linux-6.2.9-300.fc38.x86_64-x86_64-with-glibc2.37
1.81

Workaround

Downgrade biopython from 1.81 to 1.79. (1.80 has the same bug.)

More info

The ugly .gbk file was generated by VIBRANT 1.2.1.

@peterjc
Copy link
Member

peterjc commented Apr 4, 2023

This breaks on the current code too, so the current release Biopython 1.81 would be affected too. The problem is a very messed up first feature:

LOCUS       scaffold_0_fragment_5                 35432 bp    DNA     linear   VRL 2023-04-03
DEFINITION  scaffold_0_fragment_5.
COMMENT     Annotated using VIBRANT v1.2.1
FEATURES             Location/Qualifiers
     source          /organism="scaffold_0_fragment_5"
     CDS             1..684
                     /locus_tag="scaffold_0_fragment_5_145"
                     /protein_id="PF07553.11"
                     /product="Host cell surface-exposed lipoprotein"
                     /translation="MQQERQSWYQKSWFIILTLLFIFPLGLFLMWRYAHWKNWLKLIV
                     SSVYIISLVLILLFQVSLLNENKTNQIEHASTMKEKSNINNVKTTKNKNIEKSTQTDK
                     QNSVNLKQNTKDQNNNANDEETSTTSEQNVAIAQAKSYANTLPISKKSLYKQLTSEYG
                     EKYPADVAQYAVDHISVDYKMNALRLAKSYVKNINISNQALYDQLVSENGEGFTPEEA
                     QYAINHLDR*"
...

This should probably be:

LOCUS       scaffold_0_fragment_5                 35432 bp    DNA     linear   VRL 2023-04-03
DEFINITION  scaffold_0_fragment_5.
COMMENT     Annotated using VIBRANT v1.2.1
FEATURES             Location/Qualifiers
     source          1..35432
                     /organism="scaffold_0_fragment_5"
     CDS             1..684
                     /locus_tag="scaffold_0_fragment_5_145"
                     /protein_id="PF07553.11"
                     /product="Host cell surface-exposed lipoprotein"
                     /translation="MQQERQSWYQKSWFIILTLLFIFPLGLFLMWRYAHWKNWLKLIV
                     SSVYIISLVLILLFQVSLLNENKTNQIEHASTMKEKSNINNVKTTKNKNIEKSTQTDK
                     QNSVNLKQNTKDQNNNANDEETSTTSEQNVAIAQAKSYANTLPISKKSLYKQLTSEYG
                     EKYPADVAQYAVDHISVDYKMNALRLAKSYVKNINISNQALYDQLVSENGEGFTPEEA
                     QYAINHLDR*"...

With Biopython 1.79 this gave a warning, BiopythonParserWarning: Couldn't parse feature location: '/organism="scaffold_0_fragment_5"', and continued - with lots more warnings.

peterjc added a commit to peterjc/biopython that referenced this issue Apr 4, 2023
Should close biopython#4274, a regression in Biopython 1.80
peterjc added a commit that referenced this issue Apr 5, 2023
Should close #4274, a regression in Biopython 1.80
@peterjc
Copy link
Member

peterjc commented Apr 5, 2023

Are you able to update your copy of Biopython by installing the latest source code from GitHub?

The latest code (will be in Biopython 1.82) will now parse the file again, with the following warnings:

  • BiopythonParserWarning: Could not parse feature location '/organism="scaffold_0_fragment_5"'; setting feature location to None.
  • BiopythonParserWarning: The NCBI states double-quote characters like " should be escaped as "" (two double - quotes), but here it was not: '"Phage portal protein, SPP1 Gp6-like"
  • BiopythonParserWarning: The NCBI states double-quote characters like " should be escaped as "" (two double - quotes), but here it was not: '"Phage tail assembly chaperone protein, TAC"'
  • BiopythonParserWarning: The NCBI states double-quote characters like " should be escaped as "" (two double - quotes), but here it was not: '"dut, DUT; dUTP pyrophosphatase [EC:3.6.1.23]"'
  • BiopythonParserWarning: The NCBI states double-quote characters like " should be escaped as "" (two double - quotes), but here it was not: '"sinR; XRE family transcriptional regulator, master regulator for biofilm formation"'
  • BiopythonParserWarning: Invalid indentation for sequence line
  • BiopythonParserWarning: Premature end of file in sequence data

The last one is about missing the // line at the end of the record.

@MrTomRod
Copy link
Author

MrTomRod commented Apr 6, 2023

Are you able to update your copy of Biopython by installing the latest source code from GitHub?

Did that, it works nicely as it used to with the old version.

Thanks for fixing this. Bioinformaticians produce some strange code... -.-

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants