Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 relay interface #833

Merged
merged 29 commits into from
Oct 30, 2024
Merged

Conversation

cody-littley
Copy link
Contributor

@cody-littley cody-littley commented Oct 24, 2024

Why are these changes needed?

This PR adds utilities for uploading and downloading files from S3 in a way that supports breaking files into smaller ones.

Checks

  • I've made sure the lint is passing in this PR.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
  • I've checked the new test coverage and the coverage percentage didn't drop.
  • Testing Strategy
    • Unit tests
    • Integration tests
    • This PR is not tested :(

Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Copy link
Contributor

@dmanc dmanc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, some initial comments

go.mod Outdated
@@ -6,6 +6,7 @@ toolchain go1.21.1

require (
github.com/Layr-Labs/eigensdk-go v0.1.7-0.20240507215523-7e4891d5099a
github.com/aws/aws-sdk-go v1.55.5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to do it with just v2 library?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, refactored to only use v2 API.

PrefixChars int
// This framework utilizes a pool of workers to help upload/download files. This value specifies the number of
// workers to use. Default is 32.
Parallelism int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set the default as number of cpus in machine? Like by using gomaxprocs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've turned this into two parameters: ParallelismFactor and ParallelismConstant. The total number of workers is set equal to the formula ParallelismFactor * numCores + ParallelismConstant.

We will want to have a lot more workers than cores, since most of the time the workers are blocked on IO tasks. This allows us to set a sane default that uses a good number of workers as the number of cores grows.

Copy link
Contributor

@ian-shim ian-shim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Few comments

// that are consumed by utilities that are not aware of the multipart upload/download process.
//
// Implementations of this interface are required to be thread-safe.
type S3Client interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we implement new methods that support fragments in the existing s3 client (common/aws/s3)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change made

if fragmentCount-1 == index {
postfix = "f"
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also validate that index < fragmentCount ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checks added

Comment on lines 25 to 28
// tasks are placed into this channel to be processed by workers.
tasks chan func()
// this wait group is completed when all workers have finished.
wg *sync.WaitGroup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you're doing here can be simplified & abstracted out by using workerpool (ex. https://github.com/Layr-Labs/eigenda/blob/master/api/clients/retrieval_client.go#L181)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a neat little library, good to know about. Done.

Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
@cody-littley cody-littley mentioned this pull request Oct 29, 2024
6 tasks
Copy link
Contributor

@ian-shim ian-shim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!
Please resolve the test failures before merging

)

type ClientConfig struct {
Region string
AccessKey string
// The region to use when interacting with S3. Default is "us-east-2".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: start comment with the name of the field..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 34 to 41
// This framework utilizes a pool of workers to help upload/download files. A non-zero value for this parameter
// adds a number of workers equal to the number of cores times this value. Default is 8. In general, the number
// of workers here can be a lot larger than the number of cores because the workers will be blocked on I/O most
// of the time.
FragmentParallelismFactor int
// This framework utilizes a pool of workers to help upload/download files. A non-zero value for this parameter
// adds a constant number of workers. Default is 0.
FragmentParallelismConstant int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use a single field to explicitly define the number of workers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is based on @dmanc's request to scale the number of workers based on the number of cores on the machine. The pattern here allows for the user to specify either a fixed number of threads or a number that varies with the number of cores.

If you think this is over complicated I'm willing to go back to having a constant number of worker threads.


for _, fragment := range fragments {
fragmentCapture := fragment
s.pool.Submit(func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pool is shared across all ongoing uploads and downloads. Which means when there are many uploads/downloads in flight, it can build backpressure. Is that problematic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous iteration had a config parameter that allowed the client to be have a configurable sized work queue (i.e. when I was using my own implementation of workerpool. Unfortunately, the workerpool library does not allow us to override the channel used to send data to the workers.

From the implementation:

workerQueue: make(chan func())

By default, channels have a buffer size of 0. This means that if the number of read/write tasks exceed the number of available workers, the caller will block until all tasks are accepted by a worker. This will provide back pressure if more work is scheduled than there are workers to handle that work.

Are you ok with the way this is configured by default? If not, we will probably need our own implementation of workerpool.

Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
@cody-littley cody-littley merged commit aca3040 into Layr-Labs:master Oct 30, 2024
5 of 6 checks passed
@cody-littley cody-littley deleted the s3-relay-interface branch October 30, 2024 17:11
cody-littley added a commit that referenced this pull request Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants