Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation Idea: File Output Committer Algorithms #2

Open
kbendick opened this issue Nov 23, 2021 · 3 comments
Open

Documentation Idea: File Output Committer Algorithms #2

kbendick opened this issue Nov 23, 2021 · 3 comments

Comments

@kbendick
Copy link
Contributor

One possible idea for documentation or support for making jobs faster would be discussing various file output committers.

Particularly on object storage and as of Spark 3, using the default file output committer with s3a is going to result in a double data write and sad times.

Even registering s3a and friends via the right configs is something that I've had come up for a number of users, but that might be a little out of scope for the projects users.

I could try to contribute on the subject or just otherwise happy to throw it out there. Admittedly 9/10 on S3 if you just stick to the new cloud committers you're good. 😛

@kbendick
Copy link
Contributor Author

kbendick commented Nov 23, 2021

In the context of the flow chart as it is, I'm not sure where this would fit.

Maybe a write phase slow down guidance box? This is a performance "regression" I've seen when helping people migrate from HDFS to S3 but I'm not sure it fits fully into the context of the docs flow.

@holdenk
Copy link
Owner

holdenk commented Nov 23, 2021

I think maybe adding a node for slow writes and then having a sub node for s3 would be a great way to expose this. If you want to write the docs I'm happy to integrate it into the flowchart (I know the flowchart syntax is a little funky).

@atbida
Copy link
Collaborator

atbida commented Jun 7, 2022

@kbendick I started a PR for this issue; happy to add more details if you want to flesh out your idea.
#9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants