Text Documents are altered when merged #1058

Phiwatec · 2022-07-04T21:04:42Z

When using the PdfMerger to merge (append) two simple pdf files it merges them incorrectly.
The first is appended without a problem. But the second one is a mixture of both. If the pdfs are dissimilar this does not happen.
This happens both with pdfs created from LibreOffice Writer and an online convert Service.

$ python -m platform
Linux-5.10.0-13-amd64-x86_64-with-glibc2.31

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.4.1

$ python --version
Python 3.9.2

This is a minimal, complete example that shows the issue:

#Basic example from docs
from PyPDF2 import PdfFileMerger, PdfFileReader
merger = PdfFileMerger()
merger.append("LO_first.pdf")
merger.append("LO_second.pdf")
merger.write("LO_out.pdf")
merger.close()

There a six attached files:
LO_first.pdf and LO_second.pdf are files created with Libreoffice Writer.
LO_out.pdf is the resulting wrong file
online_first.pdf and online_second.pdf are created using an online convert service from a plaintext file
online_out.pdf is the resulting wrong file

When using Firefox two view the resulting document it does not show a hidden character:

When using Chrome or Okular it shows a non printable character:

In both cases it should be bcde and not abc.

If the files contain very similar content ( "test1", "test2", "test3",etc.) the first file is used for all pages.

Thanks in advance for looking at this behavior.

MartinThoma · 2022-07-04T21:14:28Z

Thank you for reporting the issue. I'll have a look this week (and I hope @MasterOdin / @pubpub-zz have some time as well 😅)

Phiwatec · 2022-07-04T21:24:19Z

Thank you :)

pubpub-zz · 2022-07-04T21:46:09Z

I think the issue is due to the fact that the two initial pdf files are identifying a font named /F1 which has a different definitions in the two files. the first one only refers 3 characters "a" "b" "c" associated with coding 01 02 03 whereas in second find 01 02 03 04 are associated with "b" "c" "d" "e"
after merging the page 2 refers to the same font object as on page 1, which is inducing the incorrect data.
this is a bug confirmed

MartinThoma · 2022-07-05T12:44:58Z

@Phiwatec This issue was likely the same as #1062. It was introduced in 2.4.1 via #207. It was fixed in 2.4.2 (released moments ago) via #1063. A test was added to prevent this issue from happening again.

Could you please check if things work with PyPDF2==2.4.2 for you again?

Phiwatec · 2022-07-05T13:59:37Z

Thanks for the quick reply. It now works perfectly fine :)
Thank you for your time.

MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfMerger The PdfMerger component is affected labels Jul 4, 2022

Phiwatec closed this as completed Jul 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text Documents are altered when merged #1058

Text Documents are altered when merged #1058

Phiwatec commented Jul 4, 2022 •

edited

Loading

MartinThoma commented Jul 4, 2022

Phiwatec commented Jul 4, 2022

pubpub-zz commented Jul 4, 2022

MartinThoma commented Jul 5, 2022 •

edited

Loading

Phiwatec commented Jul 5, 2022

Text Documents are altered when merged #1058

Text Documents are altered when merged #1058

Comments

Phiwatec commented Jul 4, 2022 • edited Loading

MartinThoma commented Jul 4, 2022

Phiwatec commented Jul 4, 2022

pubpub-zz commented Jul 4, 2022

MartinThoma commented Jul 5, 2022 • edited Loading

Phiwatec commented Jul 5, 2022

Phiwatec commented Jul 4, 2022 •

edited

Loading

MartinThoma commented Jul 5, 2022 •

edited

Loading