Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File-like API is inconsistent with built-in file-like objects #673

Closed
adriangb opened this issue Dec 18, 2021 · 9 comments
Closed

File-like API is inconsistent with built-in file-like objects #673

adriangb opened this issue Dec 18, 2021 · 9 comments
Assignees
Labels
api: storage Issues related to the googleapis/python-storage API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@adriangb
Copy link

Python's file-like object protocol is not very well documented, and there are a lot of edge cases / undefined behavior that lead to arbitrary outcomes. It seems like in at least some of these scenarios, the implementation of file-like objects in this library does not conform with what file-like objects from the standard library do. Here is an example:

from google.cloud import storage

client = storage.Client()
bucket = client.bucket(...)

with open("test.txt", mode="rb") as f:
    f.seek(1, 1)
    assert f.tell() == 1

with bucket.blob("test.txt").open(mode="rb") as f:
    f.seek(1, 1)
    assert f.tell() == 0

It would be nice to get this inconsistencies aligned where possible. For S3 at least, it is possible to write a wrapper that uses range headers and such to get a 100% compatible API.

For a case like this, I think state machine testing works nicely.
Here's an implementation outline using hypothesis:

import functools
import io
from io import BytesIO, RawIOBase

from hypothesis import strategies
from hypothesis.stateful import (
    RuleBasedStateMachine,
    initialize,
    invariant,
    rule,
    run_state_machine_as_test,
)

from google.cloud import storage


class FileStateMachine(RuleBasedStateMachine):
    tested: RawIOBase
    true: RawIOBase

    def __init__(self, bucket: storage.Bucket) -> None:
        super().__init__()
        self.bucket = bucket

    @initialize(data=strategies.binary(max_size=2))
    def write_data(self, data: bytes) -> None:
        blob = self.bucket.blob("testkey")
        blob.upload_from_file(BytesIO(data))
        self.tested = blob.open("rb")
        self.true = BytesIO(data)

    @rule(size=strategies.integers(min_value=-1, max_value=3))
    def read(self, size: int):
        true = self.true.read(size)
        tested = self.tested.read(size)
        assert tested == true

    @rule(
        offset=strategies.integers(min_value=0, max_value=3),
        whence=strategies.sampled_from((io.SEEK_CUR, io.SEEK_END, io.SEEK_SET)),
    )
    def seek(self, offset: int, whence: int):
        true = self.true.seek(offset, whence)
        tested = self.tested.seek(offset, whence)
        assert tested == true

    @invariant()
    def tell(self):
        true = self.true.tell()
        tested = self.tested.tell()
        assert tested == true

run_state_machine_as_test(functools.partial(FileStateMachine, storage.Client().bucket("test-bucket")))
@product-auto-label product-auto-label bot added the api: storage Issues related to the googleapis/python-storage API. label Dec 18, 2021
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Dec 19, 2021
@BrennaEpp BrennaEpp added priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed triage me I really want to be triaged. labels Dec 20, 2021
@andrewsg
Copy link
Contributor

I can't reproduce the issue you describe with my own bucket.

>>> f = bu.blob("test.txt").open(mode="rb")
>>> f.seek(1, 1)
1
>>> f.tell()
1

Is test.txt of nonzero length?

@adriangb
Copy link
Author

adriangb commented Feb 15, 2022

Here's a full self contained example:

import io

from google.cloud import storage


bucket = "foobarbaz"

client = storage.Client()
bucket = client.bucket(bucket_name=bucket)

data = b""

# create file
with open("test.txt", mode="wb") as f:
    f.write(data)
bucket.blob("test.txt").upload_from_file(open("test.txt", mode="rb"))


# compare
with open("test.txt", mode="rb") as f:
    f.seek(1, io.SEEK_CUR)
    res = f.tell()
    assert f.tell() == 1, res

with bucket.blob("test.txt").open(mode="rb") as f:
    f.seek(1, io.SEEK_CUR)
    res = f.tell()
    assert f.tell() == 0, res  # passes but shouldn't
    assert f.tell() == 1, res  # fails

And indeed this particular discrepancy is only reproducible with an empty file.
But I remember there being several more edge cases when I ran the tests in #673 (comment), and this is pretty undocumented behavior for Python, hence why I suggested that style of testing instead of attempting to exhaustively enumerate the edge cases.

@adriangb
Copy link
Author

@andrewsg does that example reproduce for you?

@andrewsg
Copy link
Contributor

andrewsg commented Mar 2, 2022

Hi @adriangb, was on vacation previously, taking a look this week. Thanks!

@adriangb
Copy link
Author

adriangb commented Mar 2, 2022

No problem, but also just FYI this probably should be low priority, it's not impacting our work directly and I doubt anyone else is impacted by it in a way they can't easily work around.

@andrewsg
Copy link
Contributor

Got it. I believe resolving #462 should take care of a lot of the problems you encountered with the state machine test, though I'm not yet sure how it will impact the zero-length file issue. I think adding the state machine test or something like it is a good idea and will revisit it once that other bug is closed. Thank you!

@andrewsg andrewsg closed this as completed Apr 4, 2023
@funera1
Copy link

funera1 commented Sep 11, 2024

I have found a new inconsistent behavior between this library and the standard library. Here is an example:

import io

from google.cloud import storage


bucket = "foobarbaz"

client = storage.Client()
bucket = client.bucket(bucket_name=bucket)

data = b"01234"

# create file
with open("test.txt", mode="wb") as f:
    f.write(data)
bucket.blob("test.txt").upload_from_file(open("test.txt", mode="rb"))


# compare
with open("test.txt", mode="rb") as f:
    # access 
    try:
        f.seek(-1, io.SEEK_SET)
        raise OSError
    except OSError as err:
        assert err.errno == 22
        assert f.tell() == 0
        
    assert f.seek(10, io.SEEK_SET) == 10

with bucket.blob("test.txt").open(mode="rb") as f:
    try:
        f.seek(-1, io.SEEK_SET) # expect OSError, but pass
        raise OSError
    except OSError as err:
        assert err.errno == 22
        assert f.tell() == 0

    assert f.seek(10, io.SEEK_SET) == 10 # expect 10, but return 5

I too think that the behavior of objects should be consistent as much as possible.
What are your thoughts on this issue?
If this issue should be solved, I will create a PR that addresses this issue.

@andrewsg
Copy link
Contributor

@funera1 As OSErrors are raised by the OS, and there is normally no need for a Python program to intentionally raise an OSError, I think I will decline to change this behavior. If it were a ValueError, then there would be an issue with Python code behaving differently from other Python code. But an OSError simply reflects the reality that one type of file-like object is handled by the OS and one is handled in Python.

@funera1
Copy link

funera1 commented Oct 15, 2024

I understand. Thanks for the reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the googleapis/python-storage API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

5 participants