You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is the current issue I am facing while trying to use sys.stdin as a file_object argument to the HTML class:
cat abc.html | python3 ./weasyprint_wrapper.py
Traceback (most recent call last):
File "/home/viktor/Trabajo/wedevelop-hr/workspaces/backend/src/service_providers/profile_template_pdf_creator/./weasyprint_wrapper.py", line 6, in <module>
HTML(file_obj=sys.stdin).write_pdf(target=sys.stdout.buffer)
File "/home/viktor/.local/lib/python3.9/site-packages/weasyprint/__init__.py", line 165, in __init__
result = html5lib.parse(
File "/home/viktor/.local/lib/python3.9/site-packages/html5lib/html5parser.py", line 46, in parse
return p.parse(doc, **kwargs)
File "/home/viktor/.local/lib/python3.9/site-packages/html5lib/html5parser.py", line 284, in parse
self._parse(stream, False, None, *args, **kwargs)
File "/home/viktor/.local/lib/python3.9/site-packages/html5lib/html5parser.py", line 129, in _parse
self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)
File "/home/viktor/.local/lib/python3.9/site-packages/html5lib/_tokenizer.py", line 42, in __init__
self.stream = HTMLInputStream(stream, **kwargs)
File "/home/viktor/.local/lib/python3.9/site-packages/html5lib/_inputstream.py", line 141, in HTMLInputStream
raise TypeError("Cannot set an encoding with a unicode input, set %r" % encodings)
TypeError: Cannot set an encoding with a unicode input, set ['override_encoding', 'transport_encoding']
This weasyprint_wrapper.py file is just a simple wrapper on top of the weasyprint Python library that is being utilized by our backend through NodeJS spawn to generate PDFs:
import sys
from weasyprint import HTML
HTML(file_obj=sys.stdin).write_pdf(target=sys.stdout.buffer)
This is all happening due to me having to update the old version of the weasyprint_wrapper.py file, which would call sys.stdin.read() and pass the result of that as a string= argument to the HTML class. This had to be changed due to the cloud machine that is running the backend running out of memory due to having to load the entire HTMLs on memory when multiple requests for PDF generation were issued. This was this "old" version i'm talking about:
import sys
from weasyprint import HTML
html_content = sys.stdin.read()
htmlPDF = HTML(string=html_content).write_pdf()
sys.stdout.buffer.write(htmlPDF)
Now, after some debugging I reached this piece of code in the html5lib source code:
if isUnicode:
encodings = [x for x in kwargs if x.endswith("_encoding")]
if encodings:
raise TypeError("Cannot set an encoding with a unicode input, set %r" % encodings)
return HTMLUnicodeInputStream(source, **kwargs)
else:
return HTMLBinaryInputStream(source, **kwargs)
As you can see theres a list comprehension that loops on kwargs and checks for existent keys that end with _encoding but it completely ignores the VALUE these keys hold, which in my case was None since as you can see in the "new" version of the weasyprint_wrapper.py I am only calling the HTML class with one singular argument.
The way we ended up with a non-empty array, which in turn evaluates to True and raises that TypeError was the call from the HTML's class __init__ to the html5lib's parse function:
with result as (source_type, source, base_url, protocol_encoding):
if isinstance(source, str):
result = html5lib.parse(source, namespaceHTMLElements=False)
else:
result = html5lib.parse(
source, override_encoding=encoding,
transport_encoding=protocol_encoding,
namespaceHTMLElements=False)
After determining that the source to parse the HTML from is not a string, it just calls the parse method with named args which could potentially be None as it is in my case, in turn, html5lib sees the existence of either override_encoding or transport_encoding and raises the error, even though these parameters were not specified. To fix my issue I patched this section of the code by filtering out named args that have a value of None before calling parse for our own use, but opened a PR just in case, maybe its worth reporting this in html5lib too.
#2022 <- I must say last time I used Python was around 3.1, so I'm not all caught up on best practices or types in python as I primarily work on NodeJS and TypeScript. Also my first contribution to open source 😄
Thank you for your time!
The text was updated successfully, but these errors were encountered:
This is the current issue I am facing while trying to use
sys.stdin
as afile_object
argument to theHTML
class:This
weasyprint_wrapper.py
file is just a simple wrapper on top of the weasyprint Python library that is being utilized by our backend through NodeJSspawn
to generate PDFs:This is all happening due to me having to update the old version of the
weasyprint_wrapper.py
file, which would callsys.stdin.read()
and pass the result of that as astring=
argument to theHTML
class. This had to be changed due to the cloud machine that is running the backend running out of memory due to having to load the entire HTMLs on memory when multiple requests for PDF generation were issued. This was this "old" version i'm talking about:Now, after some debugging I reached this piece of code in the
html5lib
source code:As you can see theres a list comprehension that loops on kwargs and checks for existent keys that end with
_encoding
but it completely ignores the VALUE these keys hold, which in my case wasNone
since as you can see in the "new" version of theweasyprint_wrapper.py
I am only calling theHTML
class with one singular argument.The way we ended up with a non-empty array, which in turn evaluates to
True
and raises thatTypeError
was the call from theHTML
's class__init__
to thehtml5lib
's parse function:After determining that the source to parse the HTML from is not a string, it just calls the parse method with named args which could potentially be
None
as it is in my case, in turn, html5lib sees the existence of eitheroverride_encoding
ortransport_encoding
and raises the error, even though these parameters were not specified. To fix my issue I patched this section of the code by filtering out named args that have a value ofNone
before calling parse for our own use, but opened a PR just in case, maybe its worth reporting this in html5lib too.#2022 <- I must say last time I used Python was around 3.1, so I'm not all caught up on best practices or types in python as I primarily work on NodeJS and TypeScript. Also my first contribution to open source 😄
Thank you for your time!
The text was updated successfully, but these errors were encountered: