Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn in case of Tika error #674

Closed
dadoonet opened this issue Feb 1, 2019 · 0 comments
Closed

Warn in case of Tika error #674

dadoonet opened this issue Feb 1, 2019 · 0 comments
Assignees
Labels
update When updating an existing feature

Comments

@dadoonet
Copy link
Owner

dadoonet commented Feb 1, 2019

From #672 (comment)_

It seems that when Tika is failing extracting content, we just print DEBUG that.
We should WARN the user (not with the full stacktrace though).

fscrawler_1      | 20:18:35,476 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [index.shtml]
fscrawler_1      | org.apache.tika.exception.TikaException: XML parse error
fscrawler_1      | 	at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:81) ~[tika-parsers-1.19.jar:1.19]
fscrawler_1      | 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.19.jar:1.19]
fscrawler_1      | 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.19.jar:1.19]
fscrawler_1      | 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) ~[tika-core-1.19.jar:1.19]
fscrawler_1      | 	at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:133) ~[fscrawler-tika-2.6-SNAPSHOT.jar:?]
fscrawler_1      | 	at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:93) [fscrawler-tika-2.6-SNAPSHOT.jar:?]
fscrawler_1      | 	at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexFile(FsParserAbstract.java:478) [fscrawler-core-2.6-SNAPSHOT.jar:?]
fscrawler_1      | 	at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:259) [fscrawler-core-2.6-SNAPSHOT.jar:?]
fscrawler_1      | 	at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:160) [fscrawler-core-2.6-SNAPSHOT.jar:?]
fscrawler_1      | 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
fscrawler_1      | Caused by: org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed.
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:203) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1472) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$TrailingMiscDriver.next(XMLDocumentScannerImpl.java:1395) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:602) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:505) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643) ~[?:1.8.0_181]
fscrawler_1      | 	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:327) ~[?:1.8.0_181]
fscrawler_1      | 	at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) ~[?:1.8.0_181]
fscrawler_1      | 	at org.apache.tika.utils.XMLReaderUtils.parseSAX(XMLReaderUtils.java:371) ~[tika-core-1.19.jar:1.19]
fscrawler_1      | 	at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:75) ~[tika-parsers-1.19.jar:1.19]
fscrawler_1      | 	... 9 more
@dadoonet dadoonet changed the title Warn and optionally do not index a document in case of Tika error Warn in case of Tika error Feb 1, 2019
@dadoonet dadoonet added the update When updating an existing feature label Feb 1, 2019
@dadoonet dadoonet added this to the 2.7 milestone Feb 1, 2019
@dadoonet dadoonet self-assigned this Feb 1, 2019
dadoonet added a commit that referenced this issue Feb 1, 2019
@dadoonet dadoonet removed this from the 2.7 milestone Feb 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
update When updating an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant