-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexing shtml content #672
Comments
Could you give an example of a document which is in elasticsearch ? |
I don't understand. The example your shared (please share the raw response instead of images next time) shows that lot of thing have been extracted. So I don't understand what your "problem" is. Could you describe more? |
In the first picture, under source, it's missing the content field.. I couldn't reproduce it with my test file.. that's why I didn't share the raw response here sorry |
I see. You could just try this file which is failing and launch FSCrawler with |
Here what is says with the trace option :
I've started from scratch with only one document. {"_index":"fscrawler_job","_type":"_doc","_id":"26dd35f4bc771a5ec7145d6c084b3bf","_score":1.0,"_source":{"attachment”:”removed”,”meta":{"raw":{"X-Parsed-By":"org.apache.tika.parser.DefaultParser","resourceName":"index.shtml","Content-Type":"application/xml"}},"file":{"extension":"shtml","content_type":"application/xml","created":"2019-01-31T20:09:32.000+0000","last_modified":"2019-01-31T20:09:32.000+0000","last_accessed":"2019-01-31T20:09:33.000+0000","indexing_date":"2019-01-31T20:10:13.717+0000","filesize":16502,"filename":"index.shtml","url":"file:///tmp/es/index.shtml"},"path":{"root":"824b64ab42d4b63cda6e747e2b80e5","virtual":"/index.shtml","real":"/tmp/es/index.shtml"}}} |
Hmmm. There is no trace here. Could you start again with |
Here we go sorry first time I did it wrong... https://gist.github.com/cloud1250x4/e6e6992daf04a27c476a9c000334ae5b we can see it fails to parse the files.. |
I'll try and see if I can reproduce this with a file that I can actually share.. |
So the error is here:
Important part is: |
I'm closing the issue as I don't feel it's a FSCrawler issue. If your document is well formed, then you can open an issue in Tika project. |
Hm okay.. when I open the document and resave it.. everything is fine.. Probably some encoding problem.. Sadly.. I have over 10000 files like this.. I really wonder what that other person did.. |
Could you share one of those files ? |
I'm trying to index shtml files and it's working great.. except for one thing..
It doesn't index the file content... It indexes it as a base64 attachment...
Step to reproduce:
-Rename a text file as .shtml
-Use fscrawler to index it
config file:
The text was updated successfully, but these errors were encountered: