Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS S3 Integration #1

Closed
2m opened this issue Oct 20, 2016 · 39 comments
Closed

AWS S3 Integration #1

2m opened this issue Oct 20, 2016 · 39 comments
Milestone

Comments

@2m
Copy link
Member

2m commented Oct 20, 2016

Continued from akka/akka-stream-contrib#75

@johanandren
Copy link
Member

The one I started on but haven't really had time to move on with, not based on the aws java sdk but on top of akka http client: https://github.com/johanandren/awsync

Feel free to take any bits and pieces that are useful from it.

@agolubev
Copy link
Contributor

I'm on it.
My intention is to implement some similar API like HTTP client has (with vocabulary from AWS).
Something like S3().fromObject and S3().toObject.
Actually I'm going to use AWS Java API. So please let me know if you are strongly recommending REST.
Also will take into account API of FileIO for having file based sink/sources

@juanjovazquez
Copy link

Could Apache jclouds be under consideration?. The idea would be to build the connector on top of the blobstore abstraction designed there. Thus, we can save a lot of time as some integrations, e.g. AWS S3, Google Cloud Storage or Azure Blobs, would be solved at once. I have some code on this direction that I can share if you like.

@juanjovazquez
Copy link

Another argument that would support the use of jclouds is that the local file system could be used as just another blobstore provider. That can be very useful for testing purposes since you wouldn't be extra charged for cloud providers while maintaining the code unaltered. Changing environment would be a matter of configuration. Just my two cents.

@agolubev
Copy link
Contributor

jClouds uses AWS S3 REST API with javax.ws.rs and sync approach.
It looks good and is the wrapper that can be mocked for testing purposes. However there can be additional delay because of several layers of abstractions.

@johanandren
Copy link
Member

johanandren commented Oct 23, 2016

The AWS Java API is also a synchronous client using the AWS XML HTTP APIs, no? (even if they have some "async" way of running a request and periodically check on its status AFAIR)

@agolubev
Copy link
Contributor

yeah going to look closely to https://github.com/aws/aws-sdk-java

@agolubev
Copy link
Contributor

main purpose is to choose the fastest API

@juanjovazquez
Copy link

The work with blobstore cloud providers always introduces some inherent latencies. I only say that it might worth it to leverage the job done by jclouds and try to not reinvent the wheel. Anyway, taking into account other blobstore providers might bring a better perspective on how to deal with the problem in a more generic way.

@agolubev
Copy link
Contributor

Did some digging into AWS Java API.
It implements async API with TransferManager and it's for upload/download objects. It is based on java.util.concurrent.Future. Minimum task is to transfer the whole object (in case of single download) or some portion in case of multiple download/upload.
It uses REST API.

@agolubev
Copy link
Contributor

Camel AWS uses AWS Java API.
jclouds uses AWS REST API directly.
So question to core team (@ktoso @patriknw ) - do you want it:

  • soon and universal with jclouds or AWS API
  • or fast-functioning with akka-http and AWS REST?

@2m
Copy link
Member Author

2m commented Oct 25, 2016

My vote goes for akka-http. It would be great validation for the current Akka Http Client. Also could be a driving force to implement missing features in the Akka Http Client.

@patriknw
Copy link
Member

I don't think we should re-invent the wheel. If there are good client libraries we should integrate with them instead of developing the same thing again. It might not always be the optimal solution, but I think time to market and maintenance cost are more important at this stage of the project.

I have no opinion (experience) of jclouds vs aws java api.

@johanandren
Copy link
Member

I think pure async would be nice but probably a lot of work, I don't see what jclouds improves over the AWS Java client except for an abundance of abstraction layers, so my vote is for building it on top of the Java client for now.

@juanjovazquez
Copy link

The blobstore abstraction carried out in jclouds comes from the fact that there're a lot of similarities among different cloud providers so that it's feasible, and maybe convenient, to create this abstraction layer. Something similar happens with traversing on file systems and that's the reason why Camel guys created an abstraction layer on which they built some similar connectors, e.g. File or FTP. I plan to apply this same approach on the FTP connector after the first usable version is ready.

My vote is for thinking in time to market and taking advantage of previous efforts carried out by the community. Scala came with the promise of leveraging existing Java libraries and previous investments. IMHO, that's what the users expect. Having a bunch of connectors almost "for free" is something that deserves at least a closer look and evaluation.

@ktoso
Copy link
Member

ktoso commented Oct 25, 2016

I agree that using existing API is good as first step. Then we have a working integration, and a more "reactive native" one can follow soon after if someone has time to do it.

@joearasin
Copy link

We have an akka-http based implementation that works, but is a bit rough around the edges. This includes a library for signing Akka-HTTP requests for AWS. We'd love to contribute it to the project.

@joearasin
Copy link

As far as question marks go -- one (nontrivial) bit in S3-land is credentials. It's be really sweet to implement a Source that produces a stream of AWS credentials, b/c it's a nice abstraction around S3 credential refreshing on EC2 boxes.

@ktoso
Copy link
Member

ktoso commented Oct 27, 2016

So... looking forward to PRs. We're happy to accept either I think, if one's more tested and used in the real world we'd prefer that - your impl is @joearasin I assume, right?

The signing AFAIR is general for just credentials + requests right, so would be reusable for other AWS APIs as well?

Please coodrinate here with others who wanted to contribute.

@joearasin
Copy link

The library is bluelabsio/s3-stream -- and yeah -- the signing is reusable (and I have the signing code split off in a separate module).

@agolubev
Copy link
Contributor

Amm so the question is who will do the initial PR and when. Migrating should be faster. @joearasin are you going to do PR? or I can do PR (probably during weekend) and you'll review it then?

@agolubev
Copy link
Contributor

Is anyone doing anything here? I've actually put together small prove of concept with jclouds.
Still willing to move/enhance s3-stream within alpakka (as we'll start not from scratch here)

@joearasin
Copy link

joearasin commented Oct 31, 2016

I'm putting something together -- One thing I'd like to figure out before PR is there was one issue we were having a bit of a debate over, and I wanted to open it up here as to whether we should merge in the PR over there before bringing things here.

Here's the issue at hand: bluelabsio/s3-stream#12 -- In particular, is caching incomplete upload chunks to the file system something that should be handled as a part of this, or is it a piece of logic that should be pushed to "outside" the S3 upload flow?

@agolubev
Copy link
Contributor

agolubev commented Oct 31, 2016

I think we should skip this for now and add afterwards.
My opinion is if this is global settings it should be placed in config files where akka config is.
Maybe it is standard case so core team can guide here.

@agolubev
Copy link
Contributor

@joearasin have you consider some application that mock S3 service locally? I mean I did not find tests for S3 source itself.

@joearasin
Copy link

That's another question worth asking. I absolutely want to test that code -- is there going to be a preferred testing approach in this repo for external services?

@patriknw
Copy link
Member

patriknw commented Nov 1, 2016

We would like to use Travis, at least initially. Anything you can run on Travis is fine, we can start things from the travis startup script. Lightweight testing is of course preferred to keep build times short and in the end there might be a limit on how much things we can run on (free) Travis.

@filosganga
Copy link
Contributor

I have used s3ninja (http://s3ninja.net/) successfully for testing. However, it is complex to embed as the main class is defined in the root package. I have written a docker based test that starts and stops the s3ninja docker container using the Spotify docker client. I am quite happy with that.

s3ninja does not have all the s3 features but it is enough in general.

@jypma
Copy link
Member

jypma commented Nov 8, 2016

It doesn't look like s3ninja does multipart uploads, which is the main use case for s3stream. I had on my own list to try out https://github.com/ianbytchek/docker-riak-cs , which is supposed to be a much more complete S3 implementation. As long as that can be brought up on Travis.

@jypma
Copy link
Member

jypma commented Nov 8, 2016

Or one could just use WireMock perhaps, the I/O isn't going to be that much.

@jypma
Copy link
Member

jypma commented Nov 8, 2016

Just started #24 to get this rolling.

@agolubev
Copy link
Contributor

agolubev commented Nov 8, 2016

Cool. Next questions:

  1. Are we going to merge this this PR to master or some branch will be used for some time? I'm voting for branch
  2. Do we need to support Java API? Java tests. We can postpone but at least we need ticket for this.
  3. Perhaps need Ticket for testing
  4. Should we make PR review and create tickets based upon feedback? (trying to jump in and take some ticket for myself =) )

@jypma
Copy link
Member

jypma commented Nov 8, 2016

@agolubev Feel free to branch off mine and add what you feel could do with additional tests. I can put them on top of my branch in the PR then.

@agolubev
Copy link
Contributor

agolubev commented Nov 8, 2016

@jypma I would rather move some settings to config file. If you are Ok

@jypma
Copy link
Member

jypma commented Nov 8, 2016

@agolubev Makes total sense. Maybe even model them as a nice SettingsCompanion thingy that one can then pass along, to override them in code.

@patriknw
Copy link
Member

patriknw commented Nov 8, 2016

Isn't it easier to collaborate if we just work on master? We can add something to the build to avoid releasing this module until it's ready.

@jypma
Copy link
Member

jypma commented Nov 8, 2016

Sure, master works for me. I'll see if I can unbreak the build tomorrow :)

@filosganga
Copy link
Contributor

@jypma s3-ninja supports multipart upload as I am using it in another project. But if docker-riak-cs is more complete I am happy with that as well.

@2m
Copy link
Member Author

2m commented Nov 23, 2016

S3 support landed with #24

@2m 2m closed this as completed Nov 23, 2016
@2m 2m added this to the 0.2 milestone Nov 23, 2016
DanieleSassoli referenced this issue in seglo/alpakka Dec 10, 2019
Keeping up to date with alpakka
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants