Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing shtml content #672

Closed
cloud1250x4 opened this issue Jan 30, 2019 · 13 comments
Closed

Indexing shtml content #672

cloud1250x4 opened this issue Jan 30, 2019 · 13 comments
Labels
wait for feedback Waiting for the user feedback

Comments

@cloud1250x4
Copy link

cloud1250x4 commented Jan 30, 2019

I'm trying to index shtml files and it's working great.. except for one thing..

It doesn't index the file content... It indexes it as a base64 attachment...

Step to reproduce:

-Rename a text file as .shtml
-Use fscrawler to index it

config file:

{
  "name" : "fscrawler_job",
  "fs" : {
    "url" : "/tmp/es",
    "update_rate" : "15m",
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "add_as_inner_object" : false,
    "store_source" : true,
    "index_content" : true,
    "attributes_support" : false,
    "raw_metadata" : true,
    "xml_support" : false,
    "index_folders" : true,
    "lang_detect" : false,
    "continue_on_error" : false,
    "pdf_ocr" : true,
    "ocr" : {
      "language" : "eng"
    }
  },
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "elasticsearch",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "bulk_size" : 100,
    "flush_interval" : "5s",
    "byte_size" : "10mb"
  },
  "rest" : {
    "scheme" : "HTTP",
    "host" : "127.0.0.1",
    "port" : 8080,
    "endpoint" : "fscrawler"
  }
}
@dadoonet
Copy link
Owner

Could you give an example of a document which is in elasticsearch ?

@dadoonet dadoonet added the wait for feedback Waiting for the user feedback label Jan 31, 2019
@cloud1250x4
Copy link
Author

cloud1250x4 commented Jan 31, 2019

Okay, I think I know what's going on here..

Some of the files I'm trying to index doesn't have Content-encoding apparently..

screen shot 2019-01-31 at 1 55 13 pm

screen shot 2019-01-31 at 1 54 57 pm

Any way I can solve this directly with fscrawler?

@dadoonet
Copy link
Owner

I don't understand. The example your shared (please share the raw response instead of images next time) shows that lot of thing have been extracted. So I don't understand what your "problem" is. Could you describe more?

@cloud1250x4
Copy link
Author

cloud1250x4 commented Jan 31, 2019

In the first picture, under source, it's missing the content field..
Probably because it "can't" read the file.

I couldn't reproduce it with my test file.. that's why I didn't share the raw response here sorry

@dadoonet
Copy link
Owner

I see. You could just try this file which is failing and launch FSCrawler with --trace option.

@cloud1250x4
Copy link
Author

cloud1250x4 commented Jan 31, 2019

Here what is says with the trace option :

fscrawler_1      | 20:04:33,492 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
fscrawler_1      | 20:04:33,494 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
fscrawler_1      | 20:04:34,166 WARN  [o.e.c.RestClient] request [PUT http://elasticsearch:9200/fscrawler_job] returned 1 warnings: [299 Elasticsearch-6.5.1-8c58350 "the default number of shards will change from [5] to [1] in 7.0.0; if you wish to continue using the default of [5] shards, you must manage this on the create index request or with an index template" "Thu, 31 Jan 2019 20:04:33 GMT"]
fscrawler_1      | 20:04:34,211 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [fscrawler_job] for [/tmp/es] every [15m]
fscrawler_1      | 20:04:34,568 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
fscrawler_1      | See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
fscrawler_1      | for optional dependencies.
fscrawler_1      | 

I've started from scratch with only one document.

{"_index":"fscrawler_job","_type":"_doc","_id":"26dd35f4bc771a5ec7145d6c084b3bf","_score":1.0,"_source":{"attachment”:”removed”,”meta":{"raw":{"X-Parsed-By":"org.apache.tika.parser.DefaultParser","resourceName":"index.shtml","Content-Type":"application/xml"}},"file":{"extension":"shtml","content_type":"application/xml","created":"2019-01-31T20:09:32.000+0000","last_modified":"2019-01-31T20:09:32.000+0000","last_accessed":"2019-01-31T20:09:33.000+0000","indexing_date":"2019-01-31T20:10:13.717+0000","filesize":16502,"filename":"index.shtml","url":"file:///tmp/es/index.shtml"},"path":{"root":"824b64ab42d4b63cda6e747e2b80e5","virtual":"/index.shtml","real":"/tmp/es/index.shtml"}}}

@dadoonet
Copy link
Owner

Hmmm. There is no trace here. Could you start again with --debug?

@cloud1250x4
Copy link
Author

cloud1250x4 commented Jan 31, 2019

Here we go sorry first time I did it wrong...

https://gist.github.com/cloud1250x4/e6e6992daf04a27c476a9c000334ae5b

we can see it fails to parse the files..

@cloud1250x4
Copy link
Author

I'll try and see if I can reproduce this with a file that I can actually share..

@dadoonet
Copy link
Owner

So the error is here:

fscrawler_1      | 20:18:35,476 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [index.shtml]
fscrawler_1      | org.apache.tika.exception.TikaException: XML parse error
fscrawler_1      | 	at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:81) ~[tika-parsers-1.19.jar:1.19]
fscrawler_1      | 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.19.jar:1.19]
fscrawler_1      | 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.19.jar:1.19]
fscrawler_1      | 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[tika-core-1.19.jar:1.19]
fscrawler_1      | 	at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:133) ~[fscrawler-tika-2.6-SNAPSHOT.jar:?]
fscrawler_1      | 	at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:93) [fscrawler-tika-2.6-SNAPSHOT.jar:?]
fscrawler_1      | 	at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexFile(FsParserAbstract.java:478) [fscrawler-core-2.6-SNAPSHOT.jar:?]
fscrawler_1      | 	at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:259) [fscrawler-core-2.6-SNAPSHOT.jar:?]
fscrawler_1      | 	at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:160) [fscrawler-core-2.6-SNAPSHOT.jar:?]
fscrawler_1      | 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
fscrawler_1      | Caused by: org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed.
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:203) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1472) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$TrailingMiscDriver.next(XMLDocumentScannerImpl.java:1395) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:602) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:505) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:327) ~[?:1.8.0_181]
fscrawler_1      | 	at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) ~[?:1.8.0_181]
fscrawler_1      | 	at org.apache.tika.utils.XMLReaderUtils.parseSAX(XMLReaderUtils.java:371) ~[tika-core-1.19.jar:1.19]
fscrawler_1      | 	at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:75) ~[tika-parsers-1.19.jar:1.19]
fscrawler_1      | 	... 9 more

Important part is: The markup in the document following the root element must be well-formed.

@dadoonet
Copy link
Owner

dadoonet commented Feb 1, 2019

I'm closing the issue as I don't feel it's a FSCrawler issue. If your document is well formed, then you can open an issue in Tika project.
That being said I think that when Tika is failing that way we should have an option not to index the document and also log warn about this file.

@cloud1250x4
Copy link
Author

Hm okay.. when I open the document and resave it.. everything is fine.. Probably some encoding problem..

Sadly.. I have over 10000 files like this.. I really wonder what that other person did..

@dadoonet
Copy link
Owner

dadoonet commented Feb 1, 2019

Could you share one of those files ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wait for feedback Waiting for the user feedback
Projects
None yet
Development

No branches or pull requests

2 participants