Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: unknown status keyword 'dsgvo_service_control' in marked section #468

Open
snarfed opened this issue Aug 26, 2024 · 2 comments
Assignees

Comments

@snarfed
Copy link

snarfed commented Aug 26, 2024

Hi! First off, huge thanks for maintaining feedparser. It's legendary! We're all lucky to have it.

I hit a new (to me) AssertionError today when parsing the RSS at https://snrk.de/feed/ . Here's the relevant RSS snippet:

<content:encoded><![CDATA[
  ...
  <p><strong>If you don&#8217;t like that, don&#8217;t use snrk.de!</strong><![dsgvo_service_control]></p>
  ...
]]></content:encoded>

...and here's the assert:

>>> feedparser.parse(rss)
Traceback (most recent call last):
  File ".../site-packages/feedparser/api.py", line 263, in parse
    saxparser.parse(source)
  File ".../python3.11/xml/sax/expatreader.py", line 111, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File ".../python3.11/xml/sax/xmlreader.py", line 125, in parse
    self.feed(buffer)
  File ".../python3.11/xml/sax/expatreader.py", line 217, in feed
    self._parser.Parse(data, isFinal)
  File "/private/tmp/pythonA3.11-20240402-4978-3ygh5v/Python-3.11.9/Modules/pyexpat.c", line 477, in EndElement
  File ".../python3.11/xml/sax/expatreader.py", line 395, in end_element_ns
    self._cont_handler.endElementNS(pair, None)
  File ".../site-packages/feedparser/parsers/strict.py", line 124, in endElementNS
    self.unknown_endtag(localname)
  File ".../site-packages/feedparser/mixin.py", line 321, in unknown_endtag
    method()
  File ".../site-packages/feedparser/namespaces/_base.py", line 488, in _end_content
    value = self.pop_content('content')
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/feedparser/mixin.py", line 629, in pop_content
    value = self.pop(tag)
            ^^^^^^^^^^^^^
  File ".../site-packages/feedparser/mixin.py", line 548, in pop
    output = _sanitize_html(output, self.encoding, self.contentparams.get('type', 'text/html'))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/feedparser/sanitizer.py", line 883, in _sanitize_html
    p.feed(html_source)
  File ".../site-packages/feedparser/html.py", line 156, in feed
    super(_BaseHTMLProcessor, self).feed(data)
  File ".../site-packages/sgmllib.py", line 98, in feed
    self.goahead(0)
  File ".../site-packages/sgmllib.py", line 168, in goahead
    k = self.parse_declaration(i)
        ^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/feedparser/html.py", line 351, in parse_declaration
    return sgmllib.SGMLParser.parse_declaration(self, i)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../python3.11/_markupbase.py", line 91, in parse_declaration
    return self.parse_marked_section(i)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../python3.11/_markupbase.py", line 154, in parse_marked_section
    raise AssertionError(
AssertionError: unknown status keyword 'dsgvo_service_control' in marked section

Is this expected? Should I catch AssertionError everywhere I use feedparser? Any other thoughts?

feedparser 6.0.11, Python 3.11.9. Maybe related to #378...but not exactly the same. Thanks in advance!

@kurtmckee kurtmckee self-assigned this Aug 26, 2024
@kurtmckee
Copy link
Owner

Thanks for the kind words! This is definitely unexpected, and I'll take a look at this. For now, it may be necessary to catch AssertionError. 😞

@PaulKalbitzer
Copy link

We were able to trigger a similar assertion.

"unknown status keyword 'n' in marked section"

We were able to narrow down the cause of the problem to the following segment in our input.

<description >XC#&lt;![n%</description>

We think it is the character combination <![ or as well as &lt;![ or **&#60;![**, which effectively renders to <![.

The problem seems to be the parsing of marked sections, from the error trace we could see that 'parse_marked_section' is mistakenly called, although it is not a marked section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants
@kurtmckee @snarfed @PaulKalbitzer and others