Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF Windows out of memory #338

Open
opoudjis opened this issue Feb 3, 2025 · 12 comments
Open

PDF Windows out of memory #338

opoudjis opened this issue Feb 3, 2025 · 12 comments
Assignees
Labels
bug Something isn't working

Comments

@opoudjis
Copy link
Contributor

opoudjis commented Feb 3, 2025

Under Windows in GHA, but not OSX or Ubuntu, 001-v5 and 002-v5 are running out of heap space when generating PDF; this is new. There seem to be processing issues with the SVG as well.

https://github.com/metanorma/metanorma-cli/actions/runs/13111529625/job/36587857593

@opoudjis opoudjis added the bug Something isn't working label Feb 3, 2025
@github-project-automation github-project-automation bot moved this to 🆕 New in Metanorma Feb 3, 2025
@opoudjis
Copy link
Contributor Author

opoudjis commented Feb 3, 2025

It is also catastrophically slow for Plateau, but I have already identified that this is a lutaml issue: metanorma/metanorma-plateau#159

@Intelligent2013 Intelligent2013 moved this from 🆕 New to 🏗 In progress in Metanorma Feb 4, 2025
@Intelligent2013
Copy link
Contributor

I've tried to generate PDF locally (Win 10) for XML from [github-pages](https://github.com/metanorma/mn-samples-plateau/actions/runs/13110934213/artifacts/2526423501), and the process ended ok.

Differences between GH and local machine:

From https://github.com/metanorma/metanorma-cli/actions/runs/13111529625/job/36587857593:

  • java -Xss10m -Xmx3g ...
  • from log [mn2pdf] Rendered page #1355.

Locally:

  • java -Xss5m -Xmx2072m ...
  • from log Rendered page #534.

I don't understand why 1355 pages on the GH...

@ReesePlews
Copy link
Contributor

hello @Intelligent2013 thank you for continuing the check this issue.

i will add the following information (based on my local build - win10) about the Plateau documents:

  • doc01 (sources/001-v5) has 1355 pages
  • doc02 (sources/002-v5) has 537 pages

perhaps each document is being built?

not sure if that is helpful information or not.

@Intelligent2013
Copy link
Contributor

@ReesePlews thank you! Actually, the issue occurs on the PDF generation for the document 001-v5 (1355 pages).

On my machine with JVM settings java -Xss5m -Xmx2072m ... I get the error java.lang.OutOfMemoryError: Java heap space on the PDF generation stage. But with java -Xss10m -Xmx3g ... the PDF creates in 5 minutes (PDF file size is 59Mb), but next step is PDF validation by VeraPDF tool. I wait already 35 minutes and the process still works:

Image

I think I'll get the same exception java.lang.OutOfMemoryError: Java heap space:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOfRange(Arrays.java:3664)
	at java.lang.String.<init>(String.java:207)
	at java.lang.StringBuilder.toString(StringBuilder.java:412)
	at org.verapdf.pdfa.validation.validators.BaseValidator.addAllLinkedObjects(BaseValidator.java:268)
	at org.verapdf.pdfa.validation.validators.BaseValidator.checkNext(BaseValidator.java:199)
	at org.verapdf.pdfa.validation.validators.BaseValidator.validate(BaseValidator.java:144)
	at org.verapdf.pdfa.validation.validators.BaseValidator.validate(BaseValidator.java:108)
	at org.metanorma.fop.VeraPDFValidator.validate(VeraPDFValidator.java:37)
	at org.metanorma.fop.PDFGenerator.convertmn2pdf(PDFGenerator.java:504)
	at org.metanorma.fop.PDFGenerator.process(PDFGenerator.java:320)
	at org.metanorma.fop.mn2pdf.main(mn2pdf.java:338)

Issue occurs due VeraPDF API.
The tip from veraPDF/veraPDF-library#952 - increase your java heap space.
I'll try to check the PDF in the simple standalone application with PDF checking only.

@ReesePlews
Copy link
Contributor

hello @Intelligent2013 thank you for the additional information.

i dont recall checking or installing java when i installed mn on my machine.

i am currently running in powershell (after the Plateau project is completed i will change to MS Linux WSL)

here is my java information from within powershell

PS E:\github\mn-samples-plateau> java --version
openjdk 17.0.14 2025-01-21
OpenJDK Runtime Environment Temurin-17.0.14+7 (build 17.0.14+7)
OpenJDK 64-Bit Server VM Temurin-17.0.14+7 (build 17.0.14+7, mixed mode, sharing)
PS E:\github\mn-samples-plateau>

but when i look at "java settings" app i see this:

User tab
Image

System tab

Image

could there be an issue with my java install between windows and powershell?

@Intelligent2013
Copy link
Contributor

here is my java information from within powershell
but when i look at "java settings" app i see this:

hello @ReesePlews the different versions in the powershell output (17.0.14) and java settings (1.8.0_441 and 1.8.0_431) means that you have 3 JVM installed. The path to the version 17.0.14 is the first path in the environment variable PATH, therefore you see 17.0.14 in the java --version output.

could there be an issue with my java install between windows and powershell?

no, mn2pdf tool is working well in any Java 1.8+ versions.

@Intelligent2013
Copy link
Contributor

Intelligent2013 commented Feb 5, 2025

The checking time by veraPDF in the simple standalone application takes 113sec.
The mn2pdf fails with the error java.lang.OutOfMemoryError: Java heap space.
After mn2pdf refactoring (memory optimization), the PDF checking is working ok, but takes 947sec.

All unused object set to null before veraPDF checking:

src = null;
...
xsltConverter = null;
fontcfg = null;
System.gc();

but the used heap memory size is still 650Mb:

Image

I'll investigate further.

@ReesePlews
Copy link
Contributor

hello @Intelligent2013 thank you for the additional information and checking. i will await your answer.

@Intelligent2013
Copy link
Contributor

Intelligent2013 commented Feb 9, 2025

Found memory leaks in the Apache FOP.

The simple program:

File fPDF = new File("D:\\Work\\Metanorma\\XML\\PLATEAU\\test.pdf");

OutputStream out = null;

try {

    TransformerFactory factory = TransformerFactory.newInstance();
    Transformer transformer = factory.newTransformer(); // identity transformer

    FopFactory fopFactory = FopFactory.newInstance(new File("D:\\Work\\Metanorma\\XML\\PLATEAU\\document.presentation.pdf.pdf_fonts_config.xml.out"));

    JEuclidFopFactoryConfigurator.configure(fopFactory);

    FOUserAgent foUserAgent = fopFactory.newFOUserAgent();

    foUserAgent.setProducer("Ribose Metanorma mn2pdf version " + Util.getAppVersion());

    foUserAgent.getEventBroadcaster().addEventListener(new LoggingEventListener());

    out = new FileOutputStream(fPDF);
    out = new BufferedOutputStream(out);
    
    String mime = MimeConstants.MIME_PDF;
    
    Fop fop = fopFactory.newFop(mime, foUserAgent, out);

    Source src = new StreamSource(new File("D:\\Work\\Metanorma\\XML\\PLATEAU\\test.pdf.fo.xml"));

    Result res = new SAXResult(fop.getDefaultHandler());

    transformer.transform(src, res);

} catch (Exception e) {
    System.out.println(e.toString());
} finally {
    out.close();
}

System.gc();

VeraPDFValidator v = new VeraPDFValidator();
v.validate(fPDF, PDF_UA_MODE);

Between System.gc(); and VeraPDFValidator v = new VeraPDFValidator(); the Java used memory size is 790Mb.
Eclipse Memory Analyzer tool found the problem:

Image

Image

Image

Currently, I don't figure out how to find the reason in the code.

Possible workaround solutions:

  • increase heap size (via JAVA_OPTS environment variable, but don't sure that it will work in docker)
  • extract the PDF validation by veraPDF step into the separate application.

@ronaldtse
Copy link
Contributor

  1. I believe VeraPDF should always be run in a separate container, they have official containers.
  2. FOP memory leaks. Are these related to our latest patches for vertical layout or irrelevant? We have to get someone to fix them.

@Intelligent2013
Copy link
Contributor

  1. I believe VeraPDF should always be run in a separate container, they have official containers.

Currently, the PDF checking by veraPDF docker container integrated into mn-native-pdf repository:
https://github.com/metanorma/mn-native-pdf/blob/master/.github/workflows/verapdfcheck.yml
I'll add it into mn-samples-plateau also.

Regarding,

... VeraPDF should always be run in a separate container...

We have scenario when user run the metanorma process via the command line on own machine. Do we need to require the user to install the Docker for PDF checking? Or would be better to integrate one more jar (verapdfchecker.jar, for example) into mn2pdf-ruby package and run it immediately after the mn2pdf.jar?

  1. FOP memory leaks. Are these related to our latest patches for vertical layout or irrelevant? We have to get someone to fix them.

I have to check the original Apache FOP with minimal dependencies.

@Intelligent2013
Copy link
Contributor

  1. FOP memory leaks. Are these related to our latest patches for vertical layout or irrelevant? We have to get someone to fix them.

@ronaldtse the memory leaks occur in the original Apache FOP without any changes.
The issue with simple testing application added: metanorma/xmlgraphics-fop#41

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: 🏗 In progress
Development

No branches or pull requests

4 participants