Indexing shtml content #672

cloud1250x4 · 2019-01-30T20:32:51Z

I'm trying to index shtml files and it's working great.. except for one thing..

It doesn't index the file content... It indexes it as a base64 attachment...

Step to reproduce:

-Rename a text file as .shtml
-Use fscrawler to index it

config file:

{
  "name" : "fscrawler_job",
  "fs" : {
    "url" : "/tmp/es",
    "update_rate" : "15m",
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "add_as_inner_object" : false,
    "store_source" : true,
    "index_content" : true,
    "attributes_support" : false,
    "raw_metadata" : true,
    "xml_support" : false,
    "index_folders" : true,
    "lang_detect" : false,
    "continue_on_error" : false,
    "pdf_ocr" : true,
    "ocr" : {
      "language" : "eng"
    }
  },
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "elasticsearch",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "bulk_size" : 100,
    "flush_interval" : "5s",
    "byte_size" : "10mb"
  },
  "rest" : {
    "scheme" : "HTTP",
    "host" : "127.0.0.1",
    "port" : 8080,
    "endpoint" : "fscrawler"
  }
}

The text was updated successfully, but these errors were encountered:

dadoonet · 2019-01-30T21:42:38Z

Could you give an example of a document which is in elasticsearch ?

cloud1250x4 · 2019-01-31T18:56:38Z

Okay, I think I know what's going on here..

Some of the files I'm trying to index doesn't have Content-encoding apparently..

Any way I can solve this directly with fscrawler?

dadoonet · 2019-01-31T19:03:04Z

I don't understand. The example your shared (please share the raw response instead of images next time) shows that lot of thing have been extracted. So I don't understand what your "problem" is. Could you describe more?

cloud1250x4 · 2019-01-31T19:08:50Z

In the first picture, under source, it's missing the content field..
Probably because it "can't" read the file.

I couldn't reproduce it with my test file.. that's why I didn't share the raw response here sorry

dadoonet · 2019-01-31T19:19:16Z

I see. You could just try this file which is failing and launch FSCrawler with --trace option.

cloud1250x4 · 2019-01-31T20:12:31Z

Here what is says with the trace option :

fscrawler_1      | 20:04:33,492 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
fscrawler_1      | 20:04:33,494 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
fscrawler_1      | 20:04:34,166 WARN  [o.e.c.RestClient] request [PUT http://elasticsearch:9200/fscrawler_job] returned 1 warnings: [299 Elasticsearch-6.5.1-8c58350 "the default number of shards will change from [5] to [1] in 7.0.0; if you wish to continue using the default of [5] shards, you must manage this on the create index request or with an index template" "Thu, 31 Jan 2019 20:04:33 GMT"]
fscrawler_1      | 20:04:34,211 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [fscrawler_job] for [/tmp/es] every [15m]
fscrawler_1      | 20:04:34,568 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
fscrawler_1      | See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
fscrawler_1      | for optional dependencies.
fscrawler_1      |

I've started from scratch with only one document.

{"_index":"fscrawler_job","_type":"_doc","_id":"26dd35f4bc771a5ec7145d6c084b3bf","_score":1.0,"_source":{"attachment”:”removed”,”meta":{"raw":{"X-Parsed-By":"org.apache.tika.parser.DefaultParser","resourceName":"index.shtml","Content-Type":"application/xml"}},"file":{"extension":"shtml","content_type":"application/xml","created":"2019-01-31T20:09:32.000+0000","last_modified":"2019-01-31T20:09:32.000+0000","last_accessed":"2019-01-31T20:09:33.000+0000","indexing_date":"2019-01-31T20:10:13.717+0000","filesize":16502,"filename":"index.shtml","url":"file:///tmp/es/index.shtml"},"path":{"root":"824b64ab42d4b63cda6e747e2b80e5","virtual":"/index.shtml","real":"/tmp/es/index.shtml"}}}

dadoonet · 2019-01-31T20:14:23Z

Hmmm. There is no trace here. Could you start again with --debug?

cloud1250x4 · 2019-01-31T20:24:48Z

Here we go sorry first time I did it wrong...

https://gist.github.com/cloud1250x4/e6e6992daf04a27c476a9c000334ae5b

we can see it fails to parse the files..

cloud1250x4 · 2019-01-31T20:54:09Z

I'll try and see if I can reproduce this with a file that I can actually share..

dadoonet · 2019-01-31T21:27:05Z

So the error is here:

fscrawler_1      | 20:18:35,476 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [index.shtml]
fscrawler_1      | org.apache.tika.exception.TikaException: XML parse error
fscrawler_1      | 	at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:81) ~[tika-parsers-1.19.jar:1.19]
fscrawler_1      | 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.19.jar:1.19]
fscrawler_1      | 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.19.jar:1.19]
fscrawler_1      | 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[tika-core-1.19.jar:1.19]
fscrawler_1      | 	at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:133) ~[fscrawler-tika-2.6-SNAPSHOT.jar:?]
fscrawler_1      | 	at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:93) [fscrawler-tika-2.6-SNAPSHOT.jar:?]
fscrawler_1      | 	at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexFile(FsParserAbstract.java:478) [fscrawler-core-2.6-SNAPSHOT.jar:?]
fscrawler_1      | 	at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:259) [fscrawler-core-2.6-SNAPSHOT.jar:?]
fscrawler_1      | 	at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:160) [fscrawler-core-2.6-SNAPSHOT.jar:?]
fscrawler_1      | 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
fscrawler_1      | Caused by: org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed.
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:203) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1472) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$TrailingMiscDriver.next(XMLDocumentScannerImpl.java:1395) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:602) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:505) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:327) ~[?:1.8.0_181]
fscrawler_1      | 	at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) ~[?:1.8.0_181]
fscrawler_1      | 	at org.apache.tika.utils.XMLReaderUtils.parseSAX(XMLReaderUtils.java:371) ~[tika-core-1.19.jar:1.19]
fscrawler_1      | 	at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:75) ~[tika-parsers-1.19.jar:1.19]
fscrawler_1      | 	... 9 more

Important part is: The markup in the document following the root element must be well-formed.

dadoonet · 2019-02-01T07:47:10Z

I'm closing the issue as I don't feel it's a FSCrawler issue. If your document is well formed, then you can open an issue in Tika project.
That being said I think that when Tika is failing that way we should have an option not to index the document and also log warn about this file.

cloud1250x4 · 2019-02-01T15:11:07Z

Hm okay.. when I open the document and resave it.. everything is fine.. Probably some encoding problem..

Sadly.. I have over 10000 files like this.. I really wonder what that other person did..

dadoonet · 2019-02-01T18:20:55Z

Could you share one of those files ?

dadoonet added the wait for feedback Waiting for the user feedback label Jan 31, 2019

dadoonet closed this as completed Feb 1, 2019

dadoonet mentioned this issue Feb 1, 2019

Warn in case of Tika error #674

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing shtml content #672

Indexing shtml content #672

cloud1250x4 commented Jan 30, 2019 •

edited

Loading

dadoonet commented Jan 30, 2019

cloud1250x4 commented Jan 31, 2019 •

edited

Loading

dadoonet commented Jan 31, 2019

cloud1250x4 commented Jan 31, 2019 •

edited

Loading

dadoonet commented Jan 31, 2019

cloud1250x4 commented Jan 31, 2019 •

edited by dadoonet

Loading

dadoonet commented Jan 31, 2019

cloud1250x4 commented Jan 31, 2019 •

edited

Loading

cloud1250x4 commented Jan 31, 2019

dadoonet commented Jan 31, 2019

dadoonet commented Feb 1, 2019

cloud1250x4 commented Feb 1, 2019

dadoonet commented Feb 1, 2019

Indexing shtml content #672

Indexing shtml content #672

Comments

cloud1250x4 commented Jan 30, 2019 • edited Loading

dadoonet commented Jan 30, 2019

cloud1250x4 commented Jan 31, 2019 • edited Loading

dadoonet commented Jan 31, 2019

cloud1250x4 commented Jan 31, 2019 • edited Loading

dadoonet commented Jan 31, 2019

cloud1250x4 commented Jan 31, 2019 • edited by dadoonet Loading

dadoonet commented Jan 31, 2019

cloud1250x4 commented Jan 31, 2019 • edited Loading

cloud1250x4 commented Jan 31, 2019

dadoonet commented Jan 31, 2019

dadoonet commented Feb 1, 2019

cloud1250x4 commented Feb 1, 2019

dadoonet commented Feb 1, 2019

cloud1250x4 commented Jan 30, 2019 •

edited

Loading

cloud1250x4 commented Jan 31, 2019 •

edited

Loading

cloud1250x4 commented Jan 31, 2019 •

edited

Loading

cloud1250x4 commented Jan 31, 2019 •

edited by dadoonet

Loading

cloud1250x4 commented Jan 31, 2019 •

edited

Loading