Calling HTML with file_object=sys.stdin raises TypeError #2023

ViktorShev · 2023-12-14T21:23:57Z

This is the current issue I am facing while trying to use sys.stdin as a file_object argument to the HTML class:

cat abc.html | python3 ./weasyprint_wrapper.py
Traceback (most recent call last):
  File "/home/viktor/Trabajo/wedevelop-hr/workspaces/backend/src/service_providers/profile_template_pdf_creator/./weasyprint_wrapper.py", line 6, in <module>
    HTML(file_obj=sys.stdin).write_pdf(target=sys.stdout.buffer)
  File "/home/viktor/.local/lib/python3.9/site-packages/weasyprint/__init__.py", line 165, in __init__
    result = html5lib.parse(
  File "/home/viktor/.local/lib/python3.9/site-packages/html5lib/html5parser.py", line 46, in parse
    return p.parse(doc, **kwargs)
  File "/home/viktor/.local/lib/python3.9/site-packages/html5lib/html5parser.py", line 284, in parse
    self._parse(stream, False, None, *args, **kwargs)
  File "/home/viktor/.local/lib/python3.9/site-packages/html5lib/html5parser.py", line 129, in _parse
    self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)
  File "/home/viktor/.local/lib/python3.9/site-packages/html5lib/_tokenizer.py", line 42, in __init__
    self.stream = HTMLInputStream(stream, **kwargs)
  File "/home/viktor/.local/lib/python3.9/site-packages/html5lib/_inputstream.py", line 141, in HTMLInputStream
    raise TypeError("Cannot set an encoding with a unicode input, set %r" % encodings)
TypeError: Cannot set an encoding with a unicode input, set ['override_encoding', 'transport_encoding']

This weasyprint_wrapper.py file is just a simple wrapper on top of the weasyprint Python library that is being utilized by our backend through NodeJS spawn to generate PDFs:

import sys

from weasyprint import HTML


HTML(file_obj=sys.stdin).write_pdf(target=sys.stdout.buffer)

This is all happening due to me having to update the old version of the weasyprint_wrapper.py file, which would call sys.stdin.read() and pass the result of that as a string= argument to the HTML class. This had to be changed due to the cloud machine that is running the backend running out of memory due to having to load the entire HTMLs on memory when multiple requests for PDF generation were issued. This was this "old" version i'm talking about:

import sys

from weasyprint import HTML


html_content = sys.stdin.read()
htmlPDF = HTML(string=html_content).write_pdf()
sys.stdout.buffer.write(htmlPDF)

Now, after some debugging I reached this piece of code in the html5lib source code:

if isUnicode:
        encodings = [x for x in kwargs if x.endswith("_encoding")]
        if encodings:
            raise TypeError("Cannot set an encoding with a unicode input, set %r" % encodings)

        return HTMLUnicodeInputStream(source, **kwargs)
    else:
        return HTMLBinaryInputStream(source, **kwargs)

As you can see theres a list comprehension that loops on kwargs and checks for existent keys that end with _encoding but it completely ignores the VALUE these keys hold, which in my case was None since as you can see in the "new" version of the weasyprint_wrapper.py I am only calling the HTML class with one singular argument.

The way we ended up with a non-empty array, which in turn evaluates to True and raises that TypeError was the call from the HTML's class __init__ to the html5lib's parse function:

with result as (source_type, source, base_url, protocol_encoding):
            if isinstance(source, str):
                result = html5lib.parse(source, namespaceHTMLElements=False)
            else:
                result = html5lib.parse(
                    source, override_encoding=encoding,
                    transport_encoding=protocol_encoding,
                    namespaceHTMLElements=False)

After determining that the source to parse the HTML from is not a string, it just calls the parse method with named args which could potentially be None as it is in my case, in turn, html5lib sees the existence of either override_encoding or transport_encoding and raises the error, even though these parameters were not specified. To fix my issue I patched this section of the code by filtering out named args that have a value of None before calling parse for our own use, but opened a PR just in case, maybe its worth reporting this in html5lib too.

#2022 <- I must say last time I used Python was around 3.1, so I'm not all caught up on best practices or types in python as I primarily work on NodeJS and TypeScript. Also my first contribution to open source 😄

Thank you for your time!

The text was updated successfully, but these errors were encountered:

liZe · 2023-12-16T22:20:39Z

Closed by #2022.

ViktorShev mentioned this issue Dec 14, 2023

Allow text-based file objects for HTML and CSS classes #2022

Merged

liZe closed this as completed Dec 16, 2023

liZe added this to the 61.0 milestone Dec 16, 2023

liZe added the bug Existing features not working as expected label Dec 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calling HTML with file_object=sys.stdin raises TypeError #2023

Calling HTML with file_object=sys.stdin raises TypeError #2023

ViktorShev commented Dec 14, 2023 •

edited

Loading

liZe commented Dec 16, 2023

Calling HTML with file_object=sys.stdin raises TypeError #2023

Calling HTML with file_object=sys.stdin raises TypeError #2023

Comments

ViktorShev commented Dec 14, 2023 • edited Loading

liZe commented Dec 16, 2023

ViktorShev commented Dec 14, 2023 •

edited

Loading