Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF attachments are not PDF/A-3 compliant #1869

Merged
merged 14 commits into from
Feb 8, 2024
Merged

Conversation

timoramsauer
Copy link
Contributor

I compile a file with file attachments using anchors.
pdf_variant is set to pdf/a-3b.

Checking the pdf with veraPDF problems are found:
image

Furthermore, ModDate is missing even though it is required by spec.

The following attributes were added:

  • EmbeddedFile/CreationDate (technically not required for compliance)
  • EmbeddedFile/ModDate
  • EmbeddedFile/MimeType
  • Filespec/AFRelationship
  • Catalog/AF

This pull request works for me as I only compile to pdf/a.
I understand that this might not be a general fit and I'm happy to adjust it.

However, I have some open questions:
Which attributes shall be part of all pdfs / just pdf/a and therefore likely reside in pdfa.py?

Where should the information come from?
Currently:

  • EmbeddedFile/CreationDate: manually defined when creating Attachment, file attribute (for local files) or current timestamp
    • Alternative: DocumentMetadata.CreationDate
  • EmbeddedFile/ModDate: same as CreationDate
  • EmbeddedFile/MimeType: mimetypes.guess_type(url, strict=False),
    • Alternative1: for remote files the http mime_type may also be used
    • Alternative2: the html type attribute should likely be preferred
  • Filespec/AFRelationship: manually defined when creating Attachment or currently 'Source' by default.
    • Alternative: Add an additional rel attribute. According to html spec, several rel types can be given (space seperated)
  • Catalog/AF: All attachments are also linked to the whole document:
    • Alternative: Attachment may also be linked to pages or marked content Source This seems more accurate. However, it needs some more implementation work and I currently prefer to have my attachments also show up as attachments in my document viewer in the attachment section.

There is currently no option in this pull request to not include these attributes.

@liZe
Copy link
Member

liZe commented Dec 6, 2023

Sorry for the awfully loooooooong time to comment this PR.

We need some time and some real-life testers to improve PDF/A and PDF/UA. We have other related issues we’d like to solve too, we’ll probably take care of all of them at the same time, sometime in the future.

@kesara
Copy link
Contributor

kesara commented Jan 31, 2024

I've tested this by generating a PDF for an RFC and producing PDF/A-3B compliant documents. 🎉

verapdf output:

  <jobs>
    <job>
      <item size="91980">
        <name>/docs/rfc9527.pdf</name>
      </item>
      <validationReport jobEndStatus="normal" profileName="PDF/A-3B validation profile" statement="PDF file is compliant with Validation Profile requirements." isCompliant="true">
        <details passedRules="144" failedRules="0" passedChecks="42171" failedChecks="0"></details>
      </validationReport>
      <duration start="1706726655753" finish="1706726656329">00:00:00.576</duration>
    </job>
  </jobs>

@liZe
Copy link
Member

liZe commented Feb 2, 2024

@timoramsauer @kesara Thanks!

I’ve updated the PR to handle PDF/A specificities only when actually generating PDF/A files. I’ve also (hopefully!) followed the different rules for A-1b, A-2b and A-3b (A-4b is not tested yet by VeraPDF). Tests are welcome!

@timoramsauer
Copy link
Contributor Author

This works for me. Thanls!

@kesara
Copy link
Contributor

kesara commented Feb 5, 2024

@liZe, For example in #2052, I get an error with the latest changes:

$ weasyprint --pdf-identifier foobar  --pdf-variant "pdf/a-3b" foobar.html foobar.pdf
Traceback (most recent call last):
  File "/Users/kesara/lab/WeasyPrint/venv/bin/weasyprint", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/kesara/lab/WeasyPrint/venv/lib/python3.11/site-packages/weasyprint/__main__.py", line 183, in main
    html.write_pdf(output, **options)
  File "/Users/kesara/lab/WeasyPrint/venv/lib/python3.11/site-packages/weasyprint/__init__.py", line 259, in write_pdf
    self.render(font_config, counter_style, **options)
  File "/Users/kesara/lab/WeasyPrint/venv/lib/python3.11/site-packages/weasyprint/document.py", line 391, in write_pdf
    pdf = generate_pdf(self, target, zoom, **options)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kesara/lab/WeasyPrint/venv/lib/python3.11/site-packages/weasyprint/pdf/__init__.py", line 297, in generate_pdf
    variant_function(
  File "/Users/kesara/lab/WeasyPrint/venv/lib/python3.11/site-packages/weasyprint/pdf/pdfa.py", line 59, in pdfa
    relationships = {
                    ^
  File "/Users/kesara/lab/WeasyPrint/venv/lib/python3.11/site-packages/weasyprint/pdf/pdfa.py", line 61, in <dictcomp>
    for attachment in attachments if attachment.md5}
                                     ^^^^^^^^^^^^^^
AttributeError: 'tuple' object has no attribute 'md5'

@liZe
Copy link
Member

liZe commented Feb 8, 2024

Everything should be fixed now, thanks for the feedback!

(And don’t hesitate to add a comment if there’s anything wrong.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants