You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that when Tika is failing extracting content, we just print DEBUG that.
We should WARN the user (not with the full stacktrace though).
fscrawler_1 | 20:18:35,476 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [index.shtml]
fscrawler_1 | org.apache.tika.exception.TikaException: XML parse error
fscrawler_1 | at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:81) ~[tika-parsers-1.19.jar:1.19]
fscrawler_1 | at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.19.jar:1.19]
fscrawler_1 | at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.19.jar:1.19]
fscrawler_1 | at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[tika-core-1.19.jar:1.19]
fscrawler_1 | at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:133) ~[fscrawler-tika-2.6-SNAPSHOT.jar:?]
fscrawler_1 | at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:93) [fscrawler-tika-2.6-SNAPSHOT.jar:?]
fscrawler_1 | at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexFile(FsParserAbstract.java:478) [fscrawler-core-2.6-SNAPSHOT.jar:?]
fscrawler_1 | at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:259) [fscrawler-core-2.6-SNAPSHOT.jar:?]
fscrawler_1 | at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:160) [fscrawler-core-2.6-SNAPSHOT.jar:?]
fscrawler_1 | at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
fscrawler_1 | Caused by: org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed.
fscrawler_1 | at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:203) ~[?:1.8.0_181]
fscrawler_1 | at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177) ~[?:1.8.0_181]
fscrawler_1 | at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400) ~[?:1.8.0_181]
fscrawler_1 | at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327) ~[?:1.8.0_181]
fscrawler_1 | at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1472) ~[?:1.8.0_181]
fscrawler_1 | at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$TrailingMiscDriver.next(XMLDocumentScannerImpl.java:1395) ~[?:1.8.0_181]
fscrawler_1 | at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:602) ~[?:1.8.0_181]
fscrawler_1 | at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112) ~[?:1.8.0_181]
fscrawler_1 | at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:505) ~[?:1.8.0_181]
fscrawler_1 | at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842) ~[?:1.8.0_181]
fscrawler_1 | at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771) ~[?:1.8.0_181]
fscrawler_1 | at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) ~[?:1.8.0_181]
fscrawler_1 | at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) ~[?:1.8.0_181]
fscrawler_1 | at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643) ~[?:1.8.0_181]
fscrawler_1 | at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:327) ~[?:1.8.0_181]
fscrawler_1 | at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) ~[?:1.8.0_181]
fscrawler_1 | at org.apache.tika.utils.XMLReaderUtils.parseSAX(XMLReaderUtils.java:371) ~[tika-core-1.19.jar:1.19]
fscrawler_1 | at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:75) ~[tika-parsers-1.19.jar:1.19]
fscrawler_1 | ... 9 more
The text was updated successfully, but these errors were encountered:
dadoonet
changed the title
Warn and optionally do not index a document in case of Tika error
Warn in case of Tika error
Feb 1, 2019
From #672 (comment)_
It seems that when Tika is failing extracting content, we just print
DEBUG
that.We should
WARN
the user (not with the full stacktrace though).The text was updated successfully, but these errors were encountered: