Improve detection of Apple iWork 13 documents #1793

joao-fernando · 2023-08-01T18:31:52Z

Hello!

Apple iWork documents are being detected with application/vnd.apple.unknown.13 contentType and being identified as zip files:

When I export these files, they are saved as filename.numbers.zip

I've seen this behavior with Numbers, Page and Keynote documents.

The text was updated successfully, but these errors were encountered:

lfcnassif · 2023-08-01T18:41:14Z

This seems to be a non ideal detection by Tika library. Could you share some samples for testing and to validate a possible enhancement?

PS: Tagged this as enhancement because iWork files are zip files indeed, but special ones.

joao-fernando · 2023-08-02T11:32:39Z

Sure!
I've copied some samples into your network folder on EVIDENCIAS.

lfcnassif · 2023-08-02T11:37:17Z

Thanks!

joao-fernando · 2023-08-02T13:05:57Z

I was testing tika on my computer and using version 2.4.0 directly on the command line the content-type is correctly detected.

lfcnassif · 2023-08-02T14:29:45Z

Hum, thanks for testing. I was going to test last Tika version, so that seems not necessary. Maybe Tika is giving different results if the input is a File or if input is a byte stream, I have seen this before... I will try to do some tests when I have some time.

joao-fernando · 2023-08-02T16:20:40Z

I don't know if it helps to diagnose the case, but the file is inside a Time Machine backup.
Some of them are nested inside zip files: image.ad1/folder1/file1.zip>>folder2/file2.zip>>folder3/folder4/folder5/iwork_file.key

lfcnassif · 2023-08-03T13:42:40Z

Had time to take a look. I got the same Tika GUI client output you got. But running detection on the command line I got:

Investigating, I found 2 issues here:

There are 2 Tika iWork parsers missing in our ParseConfig.xml file: org.apache.tika.parser.iwork.iwana.IWork13PackageParser and org.apache.tika.parser.iwork.iwana.IWork18PackageParser
Some parsers, like the 2 above, specialize the mediaType identified in the Signature detection step (application/vnd.apple.unknown.13 in this case) while parsing, this is not new to me. Unfortunately parsing is done after signature detection and categorization in IPED processing pipeline and specialized types returned by parsers are ignored today. Some parsers also depend on categorization (e.g. to decide if the file is a container to be expanded). I'm not sure how to workaround this, running categorization (and other tasks) again is an approach, but it doesn't seem a good solution to me...

lfcnassif · 2023-08-03T15:00:42Z

Unfortunately parsing is done after signature detection and categorization in IPED processing pipeline

The "set true extension" step is also executed before parsing.

PS: Changing some of those tasks execution order in pipeline can bring other side effects...

lfcnassif · 2023-08-09T02:11:38Z

Just tested Tika detection programmatically in a standalone program, result was the same I got using the command line: application/vnd.apple.unknown.13

I also tested explicitly Tika's IWorkDetector, since it has specific code to detect iWork 13 and 18 files:
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-apple-module/src/main/java/org/apache/tika/detect/apple/IWorkDetector.java

Result was the same: application/vnd.apple.unknown.13

Following above method call stack, this is the important detection block into IWork13PackageParser:

        public static MediaType detectIfPossible(ZipEntry entry) {
            String name = entry.getName();
            if (!name.endsWith(".iwa")) {
                return null;
            }

            // Is it a uniquely identifying filename?
            if (name.equals("Index/MasterSlide.iwa") || name.startsWith("Index/MasterSlide-")) {
                return KEYNOTE13.getType();
            }
            if (name.equals("Index/Slide.iwa") || name.startsWith("Index/Slide-")) {
                return KEYNOTE13.getType();
            }

            // Is it the main document?
            if (name.equals("Index/Document.iwa")) {
                // TODO Decode the snappy stream, and check for the Message Type
                // =     2 (TN::SheetArchive), it is a numbers file;
                // = 10000 (TP::DocumentArchive), that's a pages file
                return UNKNOWN13.getType();
            }

            // Unknown
            return null;
        }

I found Index/Document.iwa into the samples and none of the entries checked before. So the TODO above means this is a Tika known limitation (not a bug).

And Tika is able to detect the type while parsing just because of the file extension:
https://github.com/apache/tika/blob/bf5da6691a7bf1044896e1c97f54c2ff94a8a422/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-apple-module/src/main/java/org/apache/tika/parser/iwork/iwana/IWork13PackageParser.java#L111

Removing the extension from the provided samples, they are detected as application/vnd.apple.unknown.13 by Tika App GUI.

I think we can also refine the file types based on extension at least, I'll try to implement that.

lfcnassif · 2023-08-09T03:24:24Z

Commits above worked, iWork 13 files were correctly classified based on extension. But they brought a side effect: since those files are not put in "Compressed Files" category anymore, they stopped to be expanded by default. That resulted in less extracted information, since a lot of images weren't extracted anymore (and surprisingly some scanned PDF files from the pages document). That is expected and user can also configure Documents/Spreadsheets/Presentations to be expanded if desired.

lfcnassif · 2023-08-09T03:31:27Z

Closed by commits:
1abb3aa
6249cfa
58f2d10
a545321

joao-fernando · 2023-08-09T13:01:11Z

I agree that Document.iwa is not implemented yet.

But when I unzipped a Keynote file, I got the following files inside the INDEX folder:

Shouldn't tika detect this as a Keynote file ?!

if (name.equals("Index/Slide.iwa") || name.startsWith("Index/Slide-")) {
    return KEYNOTE13.getType();
}

lfcnassif · 2023-08-09T13:28:29Z

I didn't see those entries, I think it should. Could you report the issue directly to Tika project?

lfcnassif · 2023-08-09T18:09:14Z

If you have a sample keynote file without sensitive info possible to share publicly, I can report the issue in Tika Jira.

joao-fernando · 2023-08-09T18:19:15Z

I don't have one available now.
I'll try to create one later.

wladimirleite · 2023-08-09T21:12:42Z

If you have a sample keynote file without sensitive info possible to share publicly, I can report the issue in Tika Jira.

I have a few keynotes here, but I am not sure if they trigger the issue.

keynote.zip

lfcnassif · 2023-08-09T23:08:01Z

I have a few keynotes here, but I am not sure if they trigger the issue.

keynote.zip

Thanks @tc-wleite! I was able to reproduce with your files after removing their extension. I'll open an issue at Tika and point it here for reference.

lfcnassif · 2023-08-09T23:24:25Z

Took a look at Tika code again and I think I know what is happening, I'll report directly there.

lfcnassif · 2023-08-09T23:42:56Z

Tika issue created, explanation and tested fix to be applied:
https://issues.apache.org/jira/browse/TIKA-4111

lfcnassif · 2023-08-10T22:54:00Z

Tika issue created, explanation and tested fix to be applied:
https://issues.apache.org/jira/browse/TIKA-4111

Just fixed it on Tika. Took much more work than expected, some tests started to fail, but seems good now.

lfcnassif added the enhancement label Aug 1, 2023

lfcnassif changed the title ~~Apple iWork documents are identified incorrectly~~ Improve detection of Apple iWork documents Aug 1, 2023

lfcnassif added dependencies Pull requests that update a dependency file need info labels Aug 1, 2023

lfcnassif removed need info dependencies Pull requests that update a dependency file labels Aug 3, 2023

lfcnassif self-assigned this Aug 3, 2023

lfcnassif added a commit that referenced this issue Aug 3, 2023

'#1793: add missing iWork Tika parsers

1abb3aa

lfcnassif removed their assignment Aug 3, 2023

lfcnassif self-assigned this Aug 9, 2023

lfcnassif added a commit that referenced this issue Aug 9, 2023

'#1793: makes app/vnd.apple.unknown.13 subtype of app/vnd.apple.iwork

6249cfa

lfcnassif added a commit that referenced this issue Aug 9, 2023

'#1793: classify more iWork mimetypes

58f2d10

lfcnassif added a commit that referenced this issue Aug 9, 2023

'#1793: specialize application/vnd.apple.unknown.13 based on extension

a545321

lfcnassif closed this as completed Aug 9, 2023

lfcnassif mentioned this issue Aug 9, 2023

Protected docx, xlsx, pptx could have stopped to be classified properly #1805

Closed

lfcnassif changed the title ~~Improve detection of Apple iWork documents~~ Improve detection of Apple iWork 13 documents Aug 9, 2023

lfcnassif added a commit that referenced this issue Aug 16, 2023

'#1793: add missing iWork Tika parsers

7c40838

lfcnassif added a commit that referenced this issue Aug 16, 2023

'#1793: makes app/vnd.apple.unknown.13 subtype of app/vnd.apple.iwork

193c93c

lfcnassif added a commit that referenced this issue Aug 16, 2023

'#1793: classify more iWork mimetypes

58e084d

lfcnassif added a commit that referenced this issue Aug 16, 2023

'#1793: specialize application/vnd.apple.unknown.13 based on extension

9b4c462

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve detection of Apple iWork 13 documents #1793

Improve detection of Apple iWork 13 documents #1793

joao-fernando commented Aug 1, 2023

lfcnassif commented Aug 1, 2023 •

edited

Loading

joao-fernando commented Aug 2, 2023

lfcnassif commented Aug 2, 2023

joao-fernando commented Aug 2, 2023

lfcnassif commented Aug 2, 2023

joao-fernando commented Aug 2, 2023

lfcnassif commented Aug 3, 2023 •

edited

Loading

lfcnassif commented Aug 3, 2023 •

edited

Loading

lfcnassif commented Aug 9, 2023 •

edited

Loading

lfcnassif commented Aug 9, 2023 •

edited

Loading

lfcnassif commented Aug 9, 2023

joao-fernando commented Aug 9, 2023

lfcnassif commented Aug 9, 2023 •

edited

Loading

lfcnassif commented Aug 9, 2023

joao-fernando commented Aug 9, 2023

wladimirleite commented Aug 9, 2023

lfcnassif commented Aug 9, 2023

lfcnassif commented Aug 9, 2023

lfcnassif commented Aug 9, 2023 •

edited

Loading

lfcnassif commented Aug 10, 2023

Improve detection of Apple iWork 13 documents #1793

Improve detection of Apple iWork 13 documents #1793

Comments

joao-fernando commented Aug 1, 2023

lfcnassif commented Aug 1, 2023 • edited Loading

joao-fernando commented Aug 2, 2023

lfcnassif commented Aug 2, 2023

joao-fernando commented Aug 2, 2023

lfcnassif commented Aug 2, 2023

joao-fernando commented Aug 2, 2023

lfcnassif commented Aug 3, 2023 • edited Loading

lfcnassif commented Aug 3, 2023 • edited Loading

lfcnassif commented Aug 9, 2023 • edited Loading

lfcnassif commented Aug 9, 2023 • edited Loading

lfcnassif commented Aug 9, 2023

joao-fernando commented Aug 9, 2023

lfcnassif commented Aug 9, 2023 • edited Loading

lfcnassif commented Aug 9, 2023

joao-fernando commented Aug 9, 2023

wladimirleite commented Aug 9, 2023

lfcnassif commented Aug 9, 2023

lfcnassif commented Aug 9, 2023

lfcnassif commented Aug 9, 2023 • edited Loading

lfcnassif commented Aug 10, 2023

lfcnassif commented Aug 1, 2023 •

edited

Loading

lfcnassif commented Aug 3, 2023 •

edited

Loading

lfcnassif commented Aug 3, 2023 •

edited

Loading

lfcnassif commented Aug 9, 2023 •

edited

Loading

lfcnassif commented Aug 9, 2023 •

edited

Loading

lfcnassif commented Aug 9, 2023 •

edited

Loading

lfcnassif commented Aug 9, 2023 •

edited

Loading