Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read from any FS Provider using the REST Service #1247

Closed
dadoonet opened this issue Sep 6, 2021 · 2 comments · Fixed by #1937
Closed

Read from any FS Provider using the REST Service #1247

dadoonet opened this issue Sep 6, 2021 · 2 comments · Fixed by #1937
Assignees
Labels
Milestone

Comments

@dadoonet
Copy link
Owner

dadoonet commented Sep 6, 2021

We want to be able to send commands to FSCrawler which could fetch a file from any provider like the local FS where FSCrawler is running or S3...

FSCrawler supports the following services:

  • local: reads a file from the server where FSCrawler is running (a local file)
  • http: reads a file from a URL
  • s3: reads a file from an S3 compatible service

To upload a binary from a 3rd party service, you can call POST /_document endpoint and pass
a JSON document which describes the service settings:

curl -XPOST http://127.0.0.1:8080/fscrawler/_document -H 'Content-Type: application/json' -d '{
  "type": "<TYPE>",
  "<TYPE>": {
    // Settings for the <TYPE>
  }
}'

Local plugin

The local plugin reads a file from the server where FSCrawler is running (a local file).
It needs the following parameter:

  • url: link to the local file

For example, we can read the file bar.txt from the /path/to/foo directory with:

curl -XPOST http://127.0.0.1:8080/fscrawler/_document -H 'Content-Type: application/json' -d '{
  "type": "local",
  "local": {
    "url": "/path/to/foo/bar.txt"
  }
}'

HTTP plugin

The http plugin reads a file from a given URL.
It needs the following parameter:

  • url: link to the file

For example, we can read the file robots.txt from the https://www.elastic.co/ website with:

curl -XPOST http://127.0.0.1:8080/fscrawler/_document -H 'Content-Type: application/json' -d '{
  "type": "http",
  "http": {
    "url": "https://www.elastic.co/robots.txt"
  }
}'

S3 plugin

The s3 plugin reads a file from an S3 compatible service.
It needs the following parameters:

  • url: url for the S3 Service
  • bucket: bucket name
  • object: object to read from the bucket
  • access_key: access key (or login)
  • secret_key: secret key (or password)

For example, we can read the file foo.txt from the bucket foo running on https://s3.amazonaws.com:

curl -XPOST http://127.0.0.1:8080/fscrawler/_document -H 'Content-Type: application/json' -d '{
  "type": "s3",
  "s3": {
    "url": "https://s3.amazonaws.com",
    "bucket": "foo",
    "object": "foo.txt",
    "access_key": "ACCESS",
    "secret_key": "SECRET"
  }
}'

If you are using Minio, you can use:

curl -XPOST http://127.0.0.1:8080/fscrawler/_document -H 'Content-Type: application/json' -d '{
  "type": "s3",
  "s3": {
    "url": "http://localhost:9000",
    "bucket": "foo",
    "object": "foo.txt",
    "access_key": "minioadmin",
    "secret_key": "minioadmin"
  }
}'
@dadoonet dadoonet added the feature_request for feature request label Sep 6, 2021
@dadoonet dadoonet added this to the 2.8 milestone Sep 6, 2021
@dadoonet
Copy link
Owner Author

dadoonet commented Sep 6, 2021

This could also solve may be #805

@dadoonet dadoonet modified the milestones: 2.8, 2.9 Dec 14, 2021
@dadoonet dadoonet modified the milestones: 2.9, 2.10 Jan 10, 2022
@dadoonet
Copy link
Owner Author

#1897 is proposing one implementation for http url.

@dadoonet dadoonet self-assigned this Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant