Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory leak following rendering #220

Closed
ivanprice opened this issue Sep 25, 2014 · 14 comments
Closed

Possible memory leak following rendering #220

ivanprice opened this issue Sep 25, 2014 · 14 comments
Labels
performance Too slow renderings

Comments

@ivanprice
Copy link

First off: awesome work with WeasyPrint, it's super great !

We're using WeasyPrint under django and are experiencing a memory usage issue. if we make a big pdf (~360 pages with graphics) at the moment we do document.render() the memory jumps up a lot (1.4G), the problem is that once the pdf is made and the request is finished the webserver is still hanging onto that slice of memory, it seems never to be released.

Is anyone aware of memory release issues using WeasyPrint ? the relevant code is here:

#low mem here
html = weasyprint.HTML(StringIO(html_string.encode('utf-8')), encoding='utf8', base_url='file://')
#low mem here
html.write_pdf(target=out_filename)
#high mem here
del html
gc.collect()
#high mem here still

maybe we should be spawning a new process to run WeasyPrint in for such large PDFs ? we've tried calling gc.collect() to no effect.

any ideas would be appreciated, cheers

-ivan

@SimonSapin
Copy link
Member

WeasyPrint is known to use a lot of memory during rendering. Maybe that can be reduced, but I suspect not dramatically without deep refactoring.

Now, memory not being freed is another story. These may be relevant:

http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm
http://stackoverflow.com/questions/20489585/python-is-not-freeing-memory

It’s also possible that there is indeed a leak in WeasyPrint or one of its dependencies. Unfortunately, if that is the case, I don’t have an immediate idea of where to look, nor much time available to investigate. I’m interested in your findings if you want to look into it, though.

@liZe liZe added the performance Too slow renderings label Mar 8, 2016
@liZe liZe changed the title possible memory leak following html.render() ? Possible memory leak following rendering Jul 28, 2016
@liZe
Copy link
Member

liZe commented Mar 26, 2017

@ivanprice I've been using WeasyPrint to generate lots of big reports and documents, it can use a lot of memory but doesn't leak, at least for me. I've also spent a lot of time tracking memory usage in #384 and fixed all the memory leaks introduced by adding @font-face support, I'm pretty sure there's nothing wrong now. Do you have something new about this issue?

@excieve
Copy link

excieve commented May 29, 2017

I'm having a similar issue with WeasyPrint running on a Celery task. I'm using a following line of code to render PDF into a file-like object:

content_file = ContentFile(HTML(string=html, encoding='utf-8').write_pdf())

The html variable contains HTML string with a long (but simple) table.
Then it's saved on S3 and the task returns. The memory (all 1.4G of it) doesn't get reclaimed until I manually restart the Celery worker process. Tried deleting objects, GCing manually, etc. but nothing seems to help.

Whatever WeasyPrint objects shouldn't even exist by the point write_pdf() returns and the resulting PDF is less than 500K.

Any hints about where to even start looking into it?

@liZe
Copy link
Member

liZe commented May 29, 2017

Any hints about where to even start looking into it?

As explained in the comment above, there's no reliable way to free the memory in Python, even with the garbage collector. The way to know if there's a "real" memory leak is to call your rendering multiple times and see if the memory is growing more and more. I've tried hard to hunt and kill memory leaks in #384 but I may have missed some of them, especially with tables (see #70).

I'll try with big tables and see if I can reproduce. If you can provide a sample HTML file, it may help too.

@excieve
Copy link

excieve commented May 29, 2017

Here it is:
big_table.zip

Thanks for looking into it! Just for reference, I tried running the same function, which generates this PDF, in an IPython shell and it also doesn't reclaim the memory after the function returns. However, on multiple runs it grows only slightly past this 1.4G to 1.5G. So the bulk of it probably isn't really a memleak but still it's odd that such a huge amount of memory doesn't get reclaimed.

@excieve
Copy link

excieve commented May 29, 2017

The number of objects seems to increase substantially between the runs:

In [13]: len(gc.get_objects())
Out[13]: 162393
...
In [16]: len(gc.get_objects())
Out[16]: 274496

@liZe
Copy link
Member

liZe commented May 29, 2017

I tried this code with your sample:

import gc
import weasyprint

print(len(gc.get_objects()))
for i in range(3):
    weasyprint.HTML('/tmp/big_table.html', encoding='utf-8').write_pdf('/dev/null')
    print(len(gc.get_objects()))

I get:

60100
60391
60386
60386

It looks normal. I'm using:

  • Linux,
  • Python 3.5.3,
  • WeasyPrint 0.36,
  • Cairo 1.14.8,
  • Pango 1.40.6.

Note that using an interactive shell may introduce side effects. For example, the default Python shell keeps the result of the last command in a variable called _, keeping references to most of the rendering objects (that's huge, but that's another problem).

@excieve
Copy link

excieve commented May 30, 2017

I can confirm that this minimal example doesn't increase the object count between the runs so this might be something else. However, I can see that the memory is not being reclaimed anyway. The environment running this code for me is rather old: Ubuntu 12.04, Python 2.7.3, latest WeasyPrint, Cairo 1.10.2, Pango 1.30.0.

I thought maybe it's better with more recent dependencies so I ran it on my host with Fedora 25, Python 3.5.3, latest WeasyPrint, Cairo 1.14.8, Pango 1.40.5. In this case total memory usage is somewhat better but still doesn't reclaim anything until process shutdown.

I also tried it with Fedora's bundled WeasyPrint 0.22 and it actually had object count increasing, while using no more than 900M of RSS. It didn't reclaim either though.

@SimonSapin
Copy link
Member

When objects are garbage-collected the space they occupied becomes available for new Python objects but is not necessarily returned to the operating system, because of details of how the memory allocator works:

http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm

@excieve
Copy link

excieve commented May 30, 2017

@SimonSapin Thanks, I've read it but this can actually be a problem on a memory-constrained environment (such as typical EC2 instances). Basically there seem to be two possible ways from here:

  1. Somehow make the worker restart after a task, which utilises WeasyPrint
  2. Look into decreasing WeasyPrint's memory footprint when dealing with large tables

@liZe
Copy link
Member

liZe commented May 30, 2017

Somehow make the worker restart after a task, which utilises WeasyPrint

Killing the process is the only reliable way to get your memory back.

Look into decreasing WeasyPrint's memory footprint when dealing with large tables

#70 is for you!

@liZe
Copy link
Member

liZe commented May 30, 2017

As there's no evidence that there's a memory leak (even if I tried hard many times to find one), I think that we can close the bug. Decreasing WeasyPrint's memory footprint with large tables is a good idea, #70 is open to track this improvement.

@liZe liZe closed this as completed May 30, 2017
@excieve
Copy link

excieve commented May 30, 2017

Thank you @liZe and @SimonSapin for looking into it!

@jimr
Copy link

jimr commented Jun 2, 2017

@excieve FWIW we pass --maxtasksperchild 10 to our celery worker command to avoid ever-growing memory footprint for workers using WeasyPrint. This restarts workers every ten tasks which works pretty well for our workload and means we don't have any issues with memory consumption any more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Too slow renderings
Projects
None yet
Development

No branches or pull requests

5 participants