-
-
Notifications
You must be signed in to change notification settings - Fork 704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text repeated after some line breaks #2016
Comments
Hi, and thanks for the report! The problem about tables is not a technical bug in WeasyPrint: browsers don’t resize tables automatically. The table’s content doesn’t fit in the page width, and so the table is larger than the page. In a browser you can have a scrolling bar, but in a PDF you can’t. The usual solution for this is to use smaller font sizes and paddings for tables, for printed media. You can also put tables on pages that have a different size (e.g. landscape). The bug about repeating text is a real bug we can track in this issue. |
Hi thank you for replying. : ) I have gone through many different libraries and this one is by far the easiest. Can you suggest some options for me to try to fit the page content and not have repeated text? Is there a way for me to detect in weasyprint when a content is too big and then configure it to landscape, but just that page that goes over? I know the best solution in my case is to use a headless instance but I already have so many dependencies I could not ask of someone to go through configuring that too. I've gone through this repo and as it seems there is no flexbox support? It is also worth to mention that that render was done from a pip wikipedia package and html method on the wikipedia object. I have been reading this repo for a few days and i saw many had issues with td not being broken and similar syntax errors. This repo has been like the light at the end of the tunnel as I am struggling for a whole week just to get pdf to work. Interestingly epub in pypandoc doesn't complain, but pdflatex doesn't work as wkhtmltopdf is deprecated. Sorry if i was a bit verbose i got my question rejected on Stackoverflow as it wasn't descriptive enough. |
Sorry I read this again. I already have so much documentation and implementations in my head from this past week that it is hard to focus. Do you have any thoughts about repeating text? Could it be because i get a lot of CSS errors of "} expected" when i request the site? When i validate it doesn't seem like anything should be repeating the text https://validator.w3.org/nu/?doc=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FGrand_Rapids%2C_Michigan |
Good to know!
There’s no magical solution. Using a smaller font sizes and reducing paddings for tables is often enough.
Repeated text is a bug in WeasyPrint and will be fixed really soon: the fix is already written, I’ll just add some tests to avoid any regressions, and commit everything.
There’s no way to detect the size in CSS, but you can for example use the number of columns to set landscape pages, with something like: @page big-table {
size: A4 landscape;
}
table:has(td:nth-child(6)) {
page: big-table;
} (Not tested for real, but you get the idea.) There are other possible hacks, but it goes further than what we should talk about in a bug report!
Flexbox is supported (but its support is far from perfect) in
It’s common to find non-valid HTML in Wikipedia, but browsers are often designed to do anything they can to render something, even with these errors.
I hope that you’ll finally get the rendering you want!!
Your HTML files are probably encoded with UTF-8 and you can configure WeasyPrint to use it as well with the |
Thank you very much you really answered everything I need. I don't want to sound too bold but i got the impression that pdf is if not evil then downright menacing and cruel. :) I gave a whirl to Prince XML in the meantime and got surprised that even it rendered the page 1:1 based on html and not the styling. I am very confused but intrigued i spent a week on something that should have been a couple hours at most since i am annoyed if don't understand something. Why is the output so much different from url and from html? I would love to be able to plug in a file and get a similar result. And if I also may ask for some clarification for "smaller font sizes and reducing paddings for tables"?
gives error TypeError: can't multiply sequence by non-int of type 'float' I look forward to your commit. |
|
The stylesheets are probably different for some reason. Or some resources are broken because paths don’t work with files (and only work behind an HTTP server). Or… It’s difficult to know what’s going on, you have to carefully check the differences in HTML and in CSS to know what’s going on.
Something like table { font-size: 0.8em }
th, td { padding: 0.1em } But again, it won’t work in all cases (and it’s common in Wikipedia to find very large tables that don’t even fit in a browser window.)
It may be a problem caused by repeating test (I had an equivalent bug before the fix.) If you still have a crash after the fix, please open a new issue.
It’s technically OK, but adding a padding on everything is a bit strange, and changing the font size on |
The bug should now be fixed on the |
First I did not see a difference and then i forgot i needed to clone main. After that no matter what i do there is no repeated text. Outstanding work! Some tables are indeed too big to do without landscape. Really looking forward to see further development of flexbox. I originally thought that they ran a ML algorithm and had their engineers write low level catch all's, but procrastinating as I do when I hit a wall.. I later found out that Mercury is a Prolog alternative engineered on an Australian University from where Prince developers started their business. I am writing this to point out (although I'm sure someone with your experience very well came to the same conclusion long ago) that flexbox may be a very difficult topic to write by yourself. Maybe languages like Prolog are the most sane alternative where the code is compiled based on constraints that are given. Thank you so much for your help and patience. |
Even if it’s not based on WebKit, it’s based on the same specifications! Maybe there’s a bug.
Prince is a great piece of software. With Håkon in the team, they know CSS quite well. 😄
That’s actually not that difficult, because the specification is recent and really well written. Some parts about pagination were missing when the first drafts have been written, but it’s now pretty complete. Just as The limiting factor is dedicated time (i.e. sponsors), not technical difficulties … at least for this feature. Compared to flex, good old tables are a real nightmare, because they’ve been implemented in browsers before CSS was even born!
Well, 1 million downloads per month is already quite a great success for us! 😄 But yes, logic (pun intended) programming is theoretically interesting for web rendering. I suppose that browsers don’t use it mainly because it’s harder to find Prolog/Mercury developers than C++ developers!
Have fun with WeasyPrint 💜 |
There’s a bug: #2019 |
Thank you, looking forward to try this when its done. Right now I'm calling my render good enough and will patiently wait for SO to downvote and close my topic(again) because it was not descriptive enough. Because i can't provide a code cell to run if i can't install a dependency on the platform(in this case Prince), unless they expect me to dockerize the thing and host it for their own viewing pleasure. As ruthless a force of nature PDF is I am a bit of a masochist too. As I couldn't help myself to say hey maybe that h3 heading should not be at the bottom of the page(being wikipedia h2 is the title), and maybe i should have no more than 10 lines of widows and orphans. I tried to put it past me, but Prince's documentation drove the point further in line of "this and that is not pretty, this even worse etc." and now it really irks me. When that inevitably fails i hope that this is an easy fix and i can use your solution in the future. Right now I am blatantly crossing the line and I understand if you don't reply because you went above and beyond and closed the issue a few messages ago. I just didn't want to send unsolicited PMs. As far as I figure your solution is mostly for helping people cut costs for their company or starting a business and is a great way (as in people would expect it to cost money great) to generate receipts and invoices. |
That’s something you can do with
WeasyPrint supports a lot of paged media features, including of course
Widows, orphans and page breaks, but also footnotes, page margins, page counters, running elements, leaders, cloned box decorations… There are many, many more features related to pagination supported by WeasyPrint. A simple way learn more about some of them is to read CourtBouillon’s blog, particularly "CSS tricks" entries.
There’s a Matrix channel for longer discussions! 😄
Providing software (and sometimes advice 😁) for free is an important part of our vision of FOSS, but that’s only the tip of the iceberg: what we want to do with WeasyPrint goes beyond this. We build a powerful PDF generator based on open standards we want to defend and develop. We work hard to keep the code simple and maintainable, so that everyone can learn and contribute. We find great partners (💜) that we help to grow, and they help us back financially and morally to provide a better tool for them, and for everyone. |
This is what I tried to do and I later found out that many breaks are not even supported by a standard and to focus on using the break inside. I am really struggling with that and I read as much as I could find and at this point I'm basically trying different ways seeing if anything sticks. As you can see I'm not that well versed with web technologies and I am trying to find a community that could help me decipher should I clean up my html or is it that I am using the rule wrong. I hope that is ok to ask in the Matrix?
Your passion is contagious hopefully one day i too can find my special way of helping others. |
Hello.
I must thank everyone involved as what was supposed to be a simple project turned out to be a nightmare.
I tried most html to pdf libraries and so far this one seems to be most maintained ad keeping true to conventional styling ie. as i would use CTRL + P on a webpage.
File I'm using is parsed wikipedia html using wikipedia package.
I named it wiki_page.html. I used a local html as i originally used pypandoc to convert html to epub but ran into problems with pdflatex and later wkhtmltopdf from pdfkit. I decided I wouldnt use pypandoc for pdf as i am building a command line script and I cant expect users to download even MiKtex let alone anything else.
I will attach images of repeated text and cut table html file and pdf
Grand Rapids, Michigan.pdf
I suspect html file has syntax errors but I am not that well acquainted with that language.
Please let me know if i need to improve on this report and thank you for reading this issue.
https://www.mediafire.com/file/i935dpvn8t3lhlc/wiki_page.html/file
The text was updated successfully, but these errors were encountered: