Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trying to index remote files with ssh - files seen as folder #680

Closed
sblanc0054 opened this issue Feb 12, 2019 · 4 comments · Fixed by #681
Closed

trying to index remote files with ssh - files seen as folder #680

sblanc0054 opened this issue Feb 12, 2019 · 4 comments · Fixed by #681
Assignees
Labels
bug For confirmed bugs
Milestone

Comments

@sblanc0054
Copy link

sblanc0054 commented Feb 12, 2019

I'm trying to index files using fscrawler 6-2.6.
it works fine with local folders.
when i try to deal with remote server via ssh, it doesnt work.
with the debug option, it seems as if documents are seen as folders

debug log
DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/...path.../folder, /...path.../folder) = /
DEBUG [f.p.e.c.f.FsParserAbstract] Indexing contenu_folder/_doc/6094a0b4ed1330ad3f73d69ef1d3f97?pipeline=null
DEBUG [f.p.e.c.f.FsParserAbstract] indexing [/...path.../folder] content
DEBUG [f.p.e.c.f.c.FileAbstractor] Listing local files from /...path.../folder
DEBUG [f.p.e.c.f.c.FileAbstractor] 173 local files found
DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/...path.../folder, /...path.../folder/5e8c8404515d64513b5d2ce56ae3d9ec.pdf) = /5e8c8404515d64513b5d2ce56ae3d9ec.pdf
DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [true], filename = [/5e8c8404515d64513b5d2ce56ae3d9ec.pdf], includes = [[*.pdf]], excludes = [null]
DEBUG [f.p.e.c.f.FsParserAbstract] [/5e8c8404515d64513b5d2ce56ae3d9ec.pdf] can be indexed: [true]
DEBUG [f.p.e.c.f.FsParserAbstract]   - folder: 5e8c8404515d64513b5d2ce56ae3d9ec.pdf
DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/...path.../folder, /...path.../folder/5e8c8404515d64513b5d2ce56ae3d9ec.pdf) = /5e8c8404515d64513b5d2ce56ae3d9ec.pdf
DEBUG [f.p.e.c.f.FsParserAbstract] Indexing contenu_folder/_doc/3af8b5835d2fe7605cc9fbca1bfe519c?pipeline=null
DEBUG [f.p.e.c.f.FsParserAbstract] indexing [/...path.../folder/5e8c8404515d64513b5d2ce56ae3d9ec.pdf] content
DEBUG [f.p.e.c.f.c.FileAbstractor] Listing local files from /...path.../folder/5e8c8404515d64513b5d2ce56ae3d9ec.pdf
DEBUG [f.p.e.c.f.c.FileAbstractor] 1 local files found
DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/...path.../folder, /...path.../folder/5e8c8404515d64513b5d2ce56ae3d9ec.pdf/5e8c8404515d64513b5d2ce56ae3d9ec.pdf) = /5e8c8404515d64513b5d2ce56ae3d9ec.pdf/5e8c8404515d64513b5d2ce56ae3d9ec.pdf
DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [true], filename = [/5e8c8404515d64513b5d2ce56ae3d9ec.pdf/5e8c8404515d64513b5d2ce56ae3d9ec.pdf], includes = [[*.pdf]], excludes = [null]
[/5e8c8404515d64513b5d2ce56ae3d9ec.pdf/5e8c8404515d64513b5d2ce56ae3d9ec.pdf] can be indexed: [true]
DEBUG [f.p.e.c.f.FsParserAbstract]   - folder: 5e8c8404515d64513b5d2ce56ae3d9ec.pdf
DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/...path.../folder, /...path.../folder/5e8c8404515d64513b5d2ce56ae3d9ec.pdf/5e8c8404515d64513b5d2ce56ae3d9ec.pdf) = /5e8c8404515d64513b5d2ce56ae3d9ec.pdf/5e8c8404515d64513b5d2ce56ae3d9ec.pdf
DEBUG [f.p.e.c.f.FsParserAbstract] Indexing contenu_folder/_doc/bb49c3ae6067d131922716aa534261c?pipeline=null
DEBUG [f.p.e.c.f.FsParserAbstract] indexing [/...path.../folder/5e8c8404515d64513b5d2ce56ae3d9ec.pdf/5e8c8404515d64513b5d2ce56ae3d9ec.pdf] content
DEBUG [f.p.e.c.f.c.FileAbstractor] Listing local files from /...path.../folder/5e8c8404515d64513b5d2ce56ae3d9ec.pdf/5e8c8404515d64513b5d2ce56ae3d9ec.pdf
WARN  [f.p.e.c.f.FsParserAbstract] Error while crawling /...path.../folder: No such file
WARN  [f.p.e.c.f.FsParserAbstract] Full stacktrace
com.jcraft.jsch.SftpException: No such file
        at com.jcraft.jsch.ChannelSftp.throwStatusError(ChannelSftp.java:2873) ~[jsch-0.1.54.jar:?]
        at com.jcraft.jsch.ChannelSftp._stat(ChannelSftp.java:2225) ~[jsch-0.1.54.jar:?]
        at com.jcraft.jsch.ChannelSftp._stat(ChannelSftp.java:2242) ~[jsch-0.1.54.jar:?]
        at com.jcraft.jsch.ChannelSftp.ls(ChannelSftp.java:1592) ~[jsch-0.1.54.jar:?]
        at com.jcraft.jsch.ChannelSftp.ls(ChannelSftp.java:1553) ~[jsch-0.1.54.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.crawler.ssh.FileAbstractorSSH.getFiles(FileAbstractorSSH.java:80) ~[fscrawler-crawler-ssh-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:241) ~[fscrawler-core-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:157) [fscrawler-core-2.6.jar:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]

#settings

settings ```json { "name" : "contenu", "fs" : { "url" : "/..path.../folder", "update_rate" : "1m", "includes": [ "*.pdf" ], "json_support" : false, "filename_as_id" : false, "add_filesize" : true, "remove_deleted" : true, "add_as_inner_object" : false, "store_source" : true, "index_content" : true, "indexed_chars": "100%", "attributes_support" : false, "raw_metadata" : true, "xml_support" : false, "index_folders" : true, "ignore_above": "5mb", "lang_detect" : false, "continue_on_error" : false, "pdf_ocr" : false, "ocr" : { "language" : "eng+fra" } }, "server" : { "hostname" : "remoteip", "port" : "22", "username" : "account", "password" : "password", "protocol" : "ssh" }, "elasticsearch" : { "nodes" : [ { "url" : "http://127.0.0.1:9200" } ], "bulk_size" : 100, "flush_interval" : "5s", "byte_size" : "10mb" }, } ```

Versions:

  • OS: SLES15
  • fscrawler 6-2.6
@dadoonet dadoonet added the bug For confirmed bugs label Feb 12, 2019
@dadoonet dadoonet self-assigned this Feb 12, 2019
@dadoonet
Copy link
Owner

Thanks for testing it. I must confess that I'm not testing very often the SSH mode. And I believe I did not test it for a year... That's probably why it's buggy. 🐛

I'll give it a look and fix it.

@dadoonet dadoonet added check_for_bug Needs to be reproduced and removed check_for_bug Needs to be reproduced labels Feb 12, 2019
@dadoonet
Copy link
Owner

I found the issue. I'll come with a patch soon.

dadoonet added a commit that referenced this issue Feb 12, 2019
Also fix when SSH date is null. It was generating a NPE.

Closes #680.
@dadoonet dadoonet added this to the 2.7 milestone Feb 12, 2019
@dadoonet
Copy link
Owner

@sblanc0054 Could you download a recent version from https://oss.sonatype.org/content/repositories/snapshots/fr/pilato/elasticsearch/crawler/fscrawler-es6/2.7-SNAPSHOT/ (it will need some minutes I think before a new SNAPSHOT is available)?

If you don't see any new SNAPSHOT from today (Tuesday) in a few hours, please ping here and I'll generate it manually.

@sblanc0054
Copy link
Author

Thanks a lot. Works like a charm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For confirmed bugs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants