-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve detection of Apple iWork 13 documents #1793
Comments
This seems to be a non ideal detection by Tika library. Could you share some samples for testing and to validate a possible enhancement? PS: Tagged this as enhancement because iWork files are zip files indeed, but special ones. |
Sure! |
Thanks! |
Hum, thanks for testing. I was going to test last Tika version, so that seems not necessary. Maybe Tika is giving different results if the input is a File or if input is a byte stream, I have seen this before... I will try to do some tests when I have some time. |
I don't know if it helps to diagnose the case, but the file is inside a Time Machine backup. |
The "set true extension" step is also executed before parsing. PS: Changing some of those tasks execution order in pipeline can bring other side effects... |
Just tested Tika detection programmatically in a standalone program, result was the same I got using the command line: I also tested explicitly Tika's IWorkDetector, since it has specific code to detect iWork 13 and 18 files: Result was the same: Following above method call stack, this is the important detection block into IWork13PackageParser:
I found And Tika is able to detect the type while parsing just because of the file extension: Removing the extension from the provided samples, they are detected as I think we can also refine the file types based on extension at least, I'll try to implement that. |
Commits above worked, iWork 13 files were correctly classified based on extension. But they brought a side effect: since those files are not put in "Compressed Files" category anymore, they stopped to be expanded by default. That resulted in less extracted information, since a lot of images weren't extracted anymore (and surprisingly some scanned PDF files from the pages document). That is expected and user can also configure Documents/Spreadsheets/Presentations to be expanded if desired. |
I didn't see those entries, I think it should. Could you report the issue directly to Tika project? |
If you have a sample keynote file without sensitive info possible to share publicly, I can report the issue in Tika Jira. |
I don't have one available now. |
I have a few keynotes here, but I am not sure if they trigger the issue. |
Thanks @tc-wleite! I was able to reproduce with your files after removing their extension. I'll open an issue at Tika and point it here for reference. |
Took a look at Tika code again and I think I know what is happening, I'll report directly there. |
Tika issue created, explanation and tested fix to be applied: |
Just fixed it on Tika. Took much more work than expected, some tests started to fail, but seems good now. |
Hello!
Apple iWork documents are being detected with
![image](https://private-user-images.githubusercontent.com/54771774/257606892-5b2e50e6-b5c9-4165-bacc-63a2a5a272ec.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk4OTkwMjEsIm5iZiI6MTczOTg5ODcyMSwicGF0aCI6Ii81NDc3MTc3NC8yNTc2MDY4OTItNWIyZTUwZTYtYjVjOS00MTY1LWJhY2MtNjNhMmE1YTI3MmVjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE4VDE3MTIwMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTRjMTY5MTQ5Y2U4OTkxMjE3ODY1ZjI5MDM5YzAwZmZmYTY2ZTY0MTNkNjE1YjQ3ZWUyNjc4YjExMThmMjVhYjAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.-K8ojulM8VGL2-0v-QLLf1sxmBd-R8kwSyA7hcsuKNE)
![image](https://private-user-images.githubusercontent.com/54771774/257606978-231485e3-d477-4658-87c8-9d6adcd5ce3c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk4OTkwMjEsIm5iZiI6MTczOTg5ODcyMSwicGF0aCI6Ii81NDc3MTc3NC8yNTc2MDY5NzgtMjMxNDg1ZTMtZDQ3Ny00NjU4LTg3YzgtOWQ2YWRjZDVjZTNjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE4VDE3MTIwMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWZhZDMxMmM0OWU5MTk3YWM1ZDkxOTA2NmRiNjUyODYzZTU0ZTUxNTlkNTc2MDJmZDgxMDQ3NGQ4NjQ1NzU1MGQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.owe37lG7FmV7ntZ8qd68A7F2uEguW-L-T9RxZyaK5UI)
application/vnd.apple.unknown.13
contentType
and being identified as zip files:When I export these files, they are saved as filename.numbers.zip
I've seen this behavior with Numbers, Page and Keynote documents.
The text was updated successfully, but these errors were encountered: