Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved threading capabilities of S3+parquet #5451

Merged
merged 11 commits into from
May 8, 2024

Conversation

malhotrashivam
Copy link
Contributor

@malhotrashivam malhotrashivam commented May 2, 2024

@devinrsmith pointed out an issue with very large number of threads being spawned when reading partitioned parquet data from S3. This was happening because the codebase was creating a new instance of S3AsyncClient for each partition file discovered and each instance internally by default creates large number of threads.

As part of this PR,

  • We will share S3AsyncClient for all partition files for the same table.
  • We will share the underlying threads across S3AsyncClient instances instead of starting new threads for each instance.
  • Deleted a number of internal public methods from Parquet related classes which accepted Java File objects in favor of those accepting URIs

Documentation Update:
Added two new config parameters:

  • S3.numFutureCompletionThreads: The number of threads used to complete the futures returned by the async aws s3 client. By default, this is set as the number of processors on the system.
  • S3.numScheduledExecutorThreads: The number of threads used for scheduling tasks such as async retry attempts and timeout task with the aws s3 client. By default, this is set as 5.

@malhotrashivam malhotrashivam added parquet Related to the Parquet integration DocumentationNeeded ReleaseNotesNeeded Release notes are needed s3 labels May 2, 2024
@malhotrashivam malhotrashivam added this to the 3. May 2024 milestone May 2, 2024
@malhotrashivam malhotrashivam self-assigned this May 2, 2024
@malhotrashivam malhotrashivam added NoReleaseNotesNeeded No release notes are needed. and removed ReleaseNotesNeeded Release notes are needed labels May 7, 2024
@malhotrashivam malhotrashivam merged commit 2dbbf32 into deephaven:main May 8, 2024
15 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators May 8, 2024
@deephaven-internal
Copy link
Contributor

Labels indicate documentation is required. Issues for documentation have been opened:

Community: deephaven/deephaven-docs-community#209

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
DocumentationNeeded NoReleaseNotesNeeded No release notes are needed. parquet Related to the Parquet integration s3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants