Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF detected as application/octet-stream #285

Open
peric opened this issue May 10, 2022 · 5 comments
Open

PDF detected as application/octet-stream #285

peric opened this issue May 10, 2022 · 5 comments

Comments

@peric
Copy link

peric commented May 10, 2022

Attach the file for which the detection is inaccurate

Unfortunately, I am not able to share the original file, but I've tried to fake it and create a new one.

fake-pdf.pdf

Expected MIME type

application/pdf

Returned MIME type

application/octet-stream

Version of the library you are using

1.4.0

Output of go version

go version go1.18.1 darwin/arm6

Additional context

As mentioned above, I am not able to share the real PDF file, but I've tried to fake it.

If you try to open the file above, you'll see just a blank document. But, if you open the source of that file, you'll see that %PDF-1.4 is in the 2nd line and not in the 1st one. And this is exactly the same problem that I've found in my real PDF file - the only difference is that my real PDF file works locally (i.e. with Preview/MacOS) without any issues and the one I faked actually got blank after I moved the %PDF-1.4 to the 2nd line.

Therefore, I understand that my file is kinda corrupted (although, the customer got it from some accounting system) but it also works locally without any issues. Although, when I try to send it through the Detect function, it returns application/octet-stream.

Not sure if this is something that can or should be fixed, but let's see. Thanks in advance.

gabriel-vasile added a commit that referenced this issue May 16, 2022
For newline and BOM prefixed signatures. For #285
@gabriel-vasile
Copy link
Owner

Hi, @peric

Thank you for reporting this issue. It should be fixed.
You can upgrade to latest commit to test, if you don't want to wait until the next release:

go get -u github.com/gabriel-vasile/[email protected]

@peric
Copy link
Author

peric commented Apr 17, 2023

@gabriel-vasile

Hey there, it's me again 🙃

I stumbled upon similar example, so I'll mention it here instead of opening another issue (at least for now).

Basically, the beginning of the source for the file provided above looks like this:


%PDF-1.4
%�쏢
5 0 obj

In the example I currently have, the source starts like this:

-------------------------------28944242429299
Content-Disposition: form-data; name="example.pdf"; filename="example.pdf"
Content-Type: application/x-gzip

%PDF-1.4
%����

The file works typically when you try to open it with a PDF reader, although mimetype.Detect returns application/octet-stream.

Is this also something that can be covered with a similar solution? Thanks in advance

@gabriel-vasile
Copy link
Owner

gabriel-vasile commented Apr 20, 2023

Hi @peric,
Please help me debug this issue. The problem PDF should have been detected by the regular signature:

// usual pdf signature
[]byte("%PDF-"),

Please show what is the output of running xxd the_file.pdf | head -2 in command line.

Also, it would help knowing details about how this PDF was created, like:
image
If this is not against your privacy concerns, please show the output of
strings the_file.pdf | grep "Creator\|Producer".

@peric
Copy link
Author

peric commented May 11, 2023

Hi @peric, Please help me debug this issue. The problem PDF should have been detected by the regular signature:

// usual pdf signature
[]byte("%PDF-"),

Please show what is the output of running xxd the_file.pdf | head -2 in command line.

Also, it would help knowing details about how this PDF was created, like: image If this is not against your privacy concerns, please show the output of strings the_file.pdf | grep "Creator\|Producer".

Hey @gabriel-vasile, sorry for the late reply. The notification got lost somewhere and I forgot to provide you an answer. Also, we found a workaround in the meantime, so that's one more reason why I forgot to answer.

The output of xxd the_file.pdf | head -2:

00000000: 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d  ----------------
00000010: 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d32  ---------------2

And the output of strings the_file.pdf | grep "Creator\|Producer" (I deleted the last line which refers to the vendor):

<pdf:Producer>Antenna House PDF Output Library 7.1.1639</pdf:Producer>
<xmp:CreatorTool>AH CSS Formatter V7.1 MR2 for Linux64 : 7.1.3.50324 (2021-04-26T09:47+09)</xmp:CreatorTool>
/Creator (AH CSS Formatter V7.1 MR2 for Linux64 : 7.1.3.50324 \(2021-04-26T09:47+09\))
/Producer (Antenna House PDF Output Library 7.1.1639)
  <xmp:CreatorTool>AH CSS Formatter V7.1 MR2 for Linux64 : 7.1.3.50324 (2021-04-26T09:47+09)</xmp:CreatorTool>

Hope that helps.

@gabriel-vasile
Copy link
Owner

Thank you, @peric!

I generated some PDFs using Antenna House but couldn't reproduce the issue.

I have one more question to you: what's the output of

file --mime the_file.pdf

If it is application/pdf then I'll look more into the issue and how file does detection compared to mimetype.
If it is not application/pdf then I'm sorry, I don't think it's ok to add logic to detect any kind of corrupted files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants