Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling HTML with file_object=sys.stdin raises TypeError #2023

Closed
ViktorShev opened this issue Dec 14, 2023 · 1 comment
Closed

Calling HTML with file_object=sys.stdin raises TypeError #2023

ViktorShev opened this issue Dec 14, 2023 · 1 comment
Labels
bug Existing features not working as expected
Milestone

Comments

@ViktorShev
Copy link
Contributor

ViktorShev commented Dec 14, 2023

This is the current issue I am facing while trying to use sys.stdin as a file_object argument to the HTML class:

cat abc.html | python3 ./weasyprint_wrapper.py
Traceback (most recent call last):
  File "/home/viktor/Trabajo/wedevelop-hr/workspaces/backend/src/service_providers/profile_template_pdf_creator/./weasyprint_wrapper.py", line 6, in <module>
    HTML(file_obj=sys.stdin).write_pdf(target=sys.stdout.buffer)
  File "/home/viktor/.local/lib/python3.9/site-packages/weasyprint/__init__.py", line 165, in __init__
    result = html5lib.parse(
  File "/home/viktor/.local/lib/python3.9/site-packages/html5lib/html5parser.py", line 46, in parse
    return p.parse(doc, **kwargs)
  File "/home/viktor/.local/lib/python3.9/site-packages/html5lib/html5parser.py", line 284, in parse
    self._parse(stream, False, None, *args, **kwargs)
  File "/home/viktor/.local/lib/python3.9/site-packages/html5lib/html5parser.py", line 129, in _parse
    self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)
  File "/home/viktor/.local/lib/python3.9/site-packages/html5lib/_tokenizer.py", line 42, in __init__
    self.stream = HTMLInputStream(stream, **kwargs)
  File "/home/viktor/.local/lib/python3.9/site-packages/html5lib/_inputstream.py", line 141, in HTMLInputStream
    raise TypeError("Cannot set an encoding with a unicode input, set %r" % encodings)
TypeError: Cannot set an encoding with a unicode input, set ['override_encoding', 'transport_encoding']

This weasyprint_wrapper.py file is just a simple wrapper on top of the weasyprint Python library that is being utilized by our backend through NodeJS spawn to generate PDFs:

import sys

from weasyprint import HTML


HTML(file_obj=sys.stdin).write_pdf(target=sys.stdout.buffer)

This is all happening due to me having to update the old version of the weasyprint_wrapper.py file, which would call sys.stdin.read() and pass the result of that as a string= argument to the HTML class. This had to be changed due to the cloud machine that is running the backend running out of memory due to having to load the entire HTMLs on memory when multiple requests for PDF generation were issued. This was this "old" version i'm talking about:

import sys

from weasyprint import HTML


html_content = sys.stdin.read()
htmlPDF = HTML(string=html_content).write_pdf()
sys.stdout.buffer.write(htmlPDF)

Now, after some debugging I reached this piece of code in the html5lib source code:

if isUnicode:
        encodings = [x for x in kwargs if x.endswith("_encoding")]
        if encodings:
            raise TypeError("Cannot set an encoding with a unicode input, set %r" % encodings)

        return HTMLUnicodeInputStream(source, **kwargs)
    else:
        return HTMLBinaryInputStream(source, **kwargs)

As you can see theres a list comprehension that loops on kwargs and checks for existent keys that end with _encoding but it completely ignores the VALUE these keys hold, which in my case was None since as you can see in the "new" version of the weasyprint_wrapper.py I am only calling the HTML class with one singular argument.

The way we ended up with a non-empty array, which in turn evaluates to True and raises that TypeError was the call from the HTML's class __init__ to the html5lib's parse function:

with result as (source_type, source, base_url, protocol_encoding):
            if isinstance(source, str):
                result = html5lib.parse(source, namespaceHTMLElements=False)
            else:
                result = html5lib.parse(
                    source, override_encoding=encoding,
                    transport_encoding=protocol_encoding,
                    namespaceHTMLElements=False)

After determining that the source to parse the HTML from is not a string, it just calls the parse method with named args which could potentially be None as it is in my case, in turn, html5lib sees the existence of either override_encoding or transport_encoding and raises the error, even though these parameters were not specified. To fix my issue I patched this section of the code by filtering out named args that have a value of None before calling parse for our own use, but opened a PR just in case, maybe its worth reporting this in html5lib too.

#2022 <- I must say last time I used Python was around 3.1, so I'm not all caught up on best practices or types in python as I primarily work on NodeJS and TypeScript. Also my first contribution to open source 😄

Thank you for your time!

@liZe
Copy link
Member

liZe commented Dec 16, 2023

Closed by #2022.

@liZe liZe closed this as completed Dec 16, 2023
@liZe liZe added this to the 61.0 milestone Dec 16, 2023
@liZe liZe added the bug Existing features not working as expected label Dec 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Existing features not working as expected
Projects
None yet
Development

No branches or pull requests

2 participants