Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates readline logic for azure to match s3 #826

Merged
merged 3 commits into from
Sep 17, 2024

Conversation

quantumfusion
Copy link
Contributor

Title

Updates readline buffer management for azure, improving performance.

Motivation

Observed slow read times when iterating using readlines() from a 100MB csv file in Azure Blob Storage. Simple copies and reads were fine, slowness only observed with some code that was explicitly calling readlines().

Borrows from the logic found in the S3 implementation, which was updated to be more efficient and does not perform as many scans on the local chunk of the remote file.

Tests

Confirmed clean run with pytest -k test_azure

Checklist

Before you create the PR, please make sure you have:

  • Picked a concise, informative and complete title
  • Clearly explained the motivation behind the PR
  • Linked to any existing issues that your PR will be solving
  • Included tests for any new functionality
  • Checked that all unit tests pass

Workflow

Please avoid rebasing and force-pushing to the branch of the PR once a review is in progress.
Rebasing can make your commits look a bit cleaner, but it also makes life more difficult from the reviewer, because they are no longer able to distinguish between code that has already been reviewed, and unreviewed code.

Loosely copies the readline buffer management from s3 to azure,
improving performance.
Copy link
Contributor

@ddelange ddelange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ddelange
Copy link
Contributor

@mpenkov can you review and merge?

@ddelange
Copy link
Contributor

@quantumfusion can you add a test_readlines under this one based on it?

def test_read(self):

Did not need to add test, already exists further down in the file.

This reverts commit 35a8c5e.
@quantumfusion
Copy link
Contributor Author

quantumfusion commented Aug 24, 2024

@quantumfusion can you add a test_readlines under this one based on it?

def test_read(self):

@ddelange I added it briefly, but missed doing the linting which caught that test_azure.py already has a unittest for test_readline:
https://github.com/piskvorky/smart_open/blob/develop/smart_open/tests/test_azure.py#L480

@ddelange
Copy link
Contributor

ah my bad! @mpenkov can you run CI?

@ddelange
Copy link
Contributor

@mpenkov can you include this one in the next release?

@ddelange
Copy link
Contributor

@mpenkov kind reminder:)

@mpenkov mpenkov merged commit 3c610c3 into piskvorky:develop Sep 17, 2024
26 checks passed
@mpenkov
Copy link
Collaborator

mpenkov commented Sep 17, 2024

Thank you @quantumfusion and @ddelange .

Apologies for the delay, been a bit busy with real life recently :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants