Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-break space character in PDF causing load problem with text Brat sentence view #5269

Closed
GiantEnemyCrab opened this issue Jan 29, 2025 · 11 comments
Assignees
Milestone

Comments

@GiantEnemyCrab
Copy link
Contributor

Describe the bug and To Reproduce
Wen PDF file contains non-breaking space (U+00A0), it loads the document fine in PDF view, but when switch to Brat-sentence text view, it has an issue

Expected behavior
Documents load with non-breaking space in Brat sentence text view if it works with PDF view

Screenshots

Image

Please complete the following information:

  • Version and build ID: version 35
  • OS: Windows, Mac, Linux
  • Browser: Edge, Chrome
@GiantEnemyCrab GiantEnemyCrab added Triage 🐛 Bug Something isn't working labels Jan 29, 2025
@reckart
Copy link
Member

reckart commented Jan 29, 2025

That should actually have been fixed in recent versions. If you still have a project lying around from an older version, you can try this to fix your project:

Open the CAS Doctor in the project settings and then follow the steps below (leaving all the repair options that are enabled by default active as well)

  1. enable TrimAnnotationsRepair & run repair
  2. disable TrimAnnotationsRepair & enable RemoveZeroSizeTokensAndSentencesRepair & run repair
  3. disable RemoveZeroSizeTokensAndSentencesRepair & enable RemoveDanglingRelationsRepair & run repair

Hopefully, no errors should be left now and the document should render.

@reckart
Copy link
Member

reckart commented Jan 30, 2025

Probably duplicate of #5035

@GiantEnemyCrab
Copy link
Contributor Author

GiantEnemyCrab commented Jan 30, 2025

Thanks for the guidance on the CAS Doctor repair feature, but this PDF seems more tricky, and the issue couldn't be resolved via the CAS Doctor approach.

I wish I could share the exact document here but I can't due to the file containing sensitive data.
Hello world. \xa0Hello World., where \xa0 is the non-breaking space.

It still loads fine in PDF viewer mode, but not in text viewing mode of the PDF file. What I tried was Sentence (BRAT) view.

I use pymupdf and the work-around was to delete \xa0 non-break space character via pymupdf's "applyRedaction()" method to delete that character. Then it worked fine in INCEpTION whether PDF view or text view.

I am not sure of the exact condition of the error. It looks like there are other \xa0 characters that are not causing any issue, too.

@reckart
Copy link
Member

reckart commented Jan 31, 2025

Does CAS Doctor detect an issue or does it claim everything to be in order?

@GiantEnemyCrab
Copy link
Contributor Author

For INITIAL andCURATION, nothing to report for that document, but for annotation I am seeing UnreachableAnnotationsCheck saying Type webanno.custom.LayerName ~~~~~ has X unreachable instances where X is an integer.

@reckart
Copy link
Member

reckart commented Feb 3, 2025

The unreachable annotations are normal. This happens when a user deletes annotations. The stick around for a while until the document is opened for annotation again at which point they are garbage-collected. I think this is only an info-level message.

@reckart
Copy link
Member

reckart commented Feb 3, 2025

I have used pdfbox to create a little toy PDF that contains a 0xA0 char - but I don't seem to be having problems with that one either in the PDF or in the Brat (sentence-based) view.

nbsp.pdf

Image

@reckart
Copy link
Member

reckart commented Feb 3, 2025

Code I used with pdfbox v3:

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
import org.apache.pdfbox.pdmodel.font.Standard14Fonts.FontName;

public class PDFWithNBSP
{
    public static void main(String[] args) throws Exception
    {
        try (var document = new PDDocument()) {
            var page = new PDPage();
            document.addPage(page);

            var font = new PDType1Font(FontName.HELVETICA_BOLD);

            try (var contentStream = new PDPageContentStream(document, page)) {
                contentStream.setFont(font, 12);
                contentStream.beginText();
                contentStream.setLeading(14.5f);
                contentStream.newLineAtOffset(100, 700);
                var text = "Hello\u00A0World";
                contentStream.showText(text);
                contentStream.endText();
            }

            // Save the document
            document.save("output.pdf");
            System.out.println("PDF created successfully: output.pdf");
        }
    }
}

@GiantEnemyCrab
Copy link
Contributor Author

redacted.pdf

I uploaded file here that can reproduce the issue.
My discovery was that in fact, most of the case, non-breaking spaces happened to be ok. Sorry about misleading earlier, and I investigated more cases since then. But this one, I can't figure out what's going on.

The step to reproduce the error with this redacted.pdf:

  • import the document with PDF file format
  • Load the document in annotation view
  • Switch to BRAT (sentence-oriented)
  • It should result in error like the screen shot:

Image

@reckart reckart self-assigned this Feb 9, 2025
@reckart reckart removed the Triage label Feb 9, 2025
@reckart reckart added this to Kanban Feb 9, 2025
@github-project-automation github-project-automation bot moved this to 🔖 To do in Kanban Feb 9, 2025
@reckart reckart added this to the 35.3 milestone Feb 9, 2025
reckart added a commit that referenced this issue Feb 9, 2025
…xt Brat sentence view

- Use the central TrimUtils in the brat visualizer instead of a local copy (which was out-of-sync)
@reckart
Copy link
Member

reckart commented Feb 9, 2025

Thanks for helping to track this down. There was some duplicate code and the recent fix wrt. nbsp issues only fixed one copy, not the other. I have removed the second copy now, using only the one that works. Looks like that fixes the issue.

@GiantEnemyCrab
Copy link
Contributor Author

Thank you for applying the fix so quickly!

reckart added a commit that referenced this issue Feb 9, 2025
…k-space-character-in-PDF-causing-load-problem-with-text-Brat-sentence-view

#5269 - Non-break space character in PDF causing load problem with text Brat sentence view
@reckart reckart closed this as completed Feb 9, 2025
@github-project-automation github-project-automation bot moved this from 🔖 To do to 🍹 Done in Kanban Feb 9, 2025
reckart added a commit that referenced this issue Feb 9, 2025
* release/35.x:
  #5269 - Non-break space character in PDF causing load problem with text Brat sentence view
  #5278 - Merge sometimes not working when switching between documents in integrated curation mode
reckart added a commit that referenced this issue Feb 9, 2025
…xt Brat sentence view

- Bit of cleaning up
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🍹 Done
Development

No branches or pull requests

2 participants