Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve detection of Apple iWork 13 documents #1793

Closed
joao-fernando opened this issue Aug 1, 2023 · 20 comments
Closed

Improve detection of Apple iWork 13 documents #1793

joao-fernando opened this issue Aug 1, 2023 · 20 comments
Assignees

Comments

@joao-fernando
Copy link

Hello!

Apple iWork documents are being detected with application/vnd.apple.unknown.13 contentType and being identified as zip files:
image
image

When I export these files, they are saved as filename.numbers.zip

I've seen this behavior with Numbers, Page and Keynote documents.

@lfcnassif
Copy link
Member

lfcnassif commented Aug 1, 2023

This seems to be a non ideal detection by Tika library. Could you share some samples for testing and to validate a possible enhancement?

PS: Tagged this as enhancement because iWork files are zip files indeed, but special ones.

@lfcnassif lfcnassif changed the title Apple iWork documents are identified incorrectly Improve detection of Apple iWork documents Aug 1, 2023
@lfcnassif lfcnassif added dependencies Pull requests that update a dependency file need info labels Aug 1, 2023
@joao-fernando
Copy link
Author

Sure!
I've copied some samples into your network folder on EVIDENCIAS.

@lfcnassif
Copy link
Member

Thanks!

@joao-fernando
Copy link
Author

I was testing tika on my computer and using version 2.4.0 directly on the command line the content-type is correctly detected.
image

@lfcnassif
Copy link
Member

Hum, thanks for testing. I was going to test last Tika version, so that seems not necessary. Maybe Tika is giving different results if the input is a File or if input is a byte stream, I have seen this before... I will try to do some tests when I have some time.

@joao-fernando
Copy link
Author

I don't know if it helps to diagnose the case, but the file is inside a Time Machine backup.
Some of them are nested inside zip files: image.ad1/folder1/file1.zip>>folder2/file2.zip>>folder3/folder4/folder5/iwork_file.key

@lfcnassif
Copy link
Member

lfcnassif commented Aug 3, 2023

Had time to take a look. I got the same Tika GUI client output you got. But running detection on the command line I got:
image

Investigating, I found 2 issues here:

  • There are 2 Tika iWork parsers missing in our ParseConfig.xml file: org.apache.tika.parser.iwork.iwana.IWork13PackageParser and org.apache.tika.parser.iwork.iwana.IWork18PackageParser
  • Some parsers, like the 2 above, specialize the mediaType identified in the Signature detection step (application/vnd.apple.unknown.13 in this case) while parsing, this is not new to me. Unfortunately parsing is done after signature detection and categorization in IPED processing pipeline and specialized types returned by parsers are ignored today. Some parsers also depend on categorization (e.g. to decide if the file is a container to be expanded). I'm not sure how to workaround this, running categorization (and other tasks) again is an approach, but it doesn't seem a good solution to me...

@lfcnassif lfcnassif removed need info dependencies Pull requests that update a dependency file labels Aug 3, 2023
@lfcnassif lfcnassif self-assigned this Aug 3, 2023
@lfcnassif
Copy link
Member

lfcnassif commented Aug 3, 2023

Unfortunately parsing is done after signature detection and categorization in IPED processing pipeline

The "set true extension" step is also executed before parsing.

PS: Changing some of those tasks execution order in pipeline can bring other side effects...

@lfcnassif lfcnassif removed their assignment Aug 3, 2023
@lfcnassif
Copy link
Member

lfcnassif commented Aug 9, 2023

Just tested Tika detection programmatically in a standalone program, result was the same I got using the command line: application/vnd.apple.unknown.13

I also tested explicitly Tika's IWorkDetector, since it has specific code to detect iWork 13 and 18 files:
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-apple-module/src/main/java/org/apache/tika/detect/apple/IWorkDetector.java

Result was the same: application/vnd.apple.unknown.13

Following above method call stack, this is the important detection block into IWork13PackageParser:

        public static MediaType detectIfPossible(ZipEntry entry) {
            String name = entry.getName();
            if (!name.endsWith(".iwa")) {
                return null;
            }

            // Is it a uniquely identifying filename?
            if (name.equals("Index/MasterSlide.iwa") || name.startsWith("Index/MasterSlide-")) {
                return KEYNOTE13.getType();
            }
            if (name.equals("Index/Slide.iwa") || name.startsWith("Index/Slide-")) {
                return KEYNOTE13.getType();
            }

            // Is it the main document?
            if (name.equals("Index/Document.iwa")) {
                // TODO Decode the snappy stream, and check for the Message Type
                // =     2 (TN::SheetArchive), it is a numbers file;
                // = 10000 (TP::DocumentArchive), that's a pages file
                return UNKNOWN13.getType();
            }

            // Unknown
            return null;
        }

I found Index/Document.iwa into the samples and none of the entries checked before. So the TODO above means this is a Tika known limitation (not a bug).

And Tika is able to detect the type while parsing just because of the file extension:
https://github.com/apache/tika/blob/bf5da6691a7bf1044896e1c97f54c2ff94a8a422/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-apple-module/src/main/java/org/apache/tika/parser/iwork/iwana/IWork13PackageParser.java#L111

Removing the extension from the provided samples, they are detected as application/vnd.apple.unknown.13 by Tika App GUI.

I think we can also refine the file types based on extension at least, I'll try to implement that.

@lfcnassif
Copy link
Member

lfcnassif commented Aug 9, 2023

Commits above worked, iWork 13 files were correctly classified based on extension. But they brought a side effect: since those files are not put in "Compressed Files" category anymore, they stopped to be expanded by default. That resulted in less extracted information, since a lot of images weren't extracted anymore (and surprisingly some scanned PDF files from the pages document). That is expected and user can also configure Documents/Spreadsheets/Presentations to be expanded if desired.

@lfcnassif
Copy link
Member

Closed by commits:
1abb3aa
6249cfa
58f2d10
a545321

@lfcnassif lfcnassif changed the title Improve detection of Apple iWork documents Improve detection of Apple iWork 13 documents Aug 9, 2023
@joao-fernando
Copy link
Author

I agree that Document.iwa is not implemented yet.

But when I unzipped a Keynote file, I got the following files inside the INDEX folder:
image

Shouldn't tika detect this as a Keynote file ?!

if (name.equals("Index/Slide.iwa") || name.startsWith("Index/Slide-")) {
    return KEYNOTE13.getType();
}

@lfcnassif
Copy link
Member

lfcnassif commented Aug 9, 2023

I didn't see those entries, I think it should. Could you report the issue directly to Tika project?

@lfcnassif
Copy link
Member

If you have a sample keynote file without sensitive info possible to share publicly, I can report the issue in Tika Jira.

@joao-fernando
Copy link
Author

I don't have one available now.
I'll try to create one later.

@wladimirleite
Copy link
Member

If you have a sample keynote file without sensitive info possible to share publicly, I can report the issue in Tika Jira.

I have a few keynotes here, but I am not sure if they trigger the issue.

keynote.zip

@lfcnassif
Copy link
Member

I have a few keynotes here, but I am not sure if they trigger the issue.

keynote.zip

Thanks @tc-wleite! I was able to reproduce with your files after removing their extension. I'll open an issue at Tika and point it here for reference.

@lfcnassif
Copy link
Member

Took a look at Tika code again and I think I know what is happening, I'll report directly there.

@lfcnassif
Copy link
Member

lfcnassif commented Aug 9, 2023

Tika issue created, explanation and tested fix to be applied:
https://issues.apache.org/jira/browse/TIKA-4111

@lfcnassif
Copy link
Member

Tika issue created, explanation and tested fix to be applied:
https://issues.apache.org/jira/browse/TIKA-4111

Just fixed it on Tika. Took much more work than expected, some tests started to fail, but seems good now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants