Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_image - decode b64encode and encodebytes strings #30192

Merged
merged 2 commits into from
Apr 26, 2024

Conversation

amyeroberts
Copy link
Collaborator

What does this PR do?

Updates base64 decoding login to decode base64 string encoded with either base64.b64encode or base64.encodebytes

Fixes #30114

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

@amyeroberts amyeroberts requested a review from ydshieh April 11, 2024 16:31
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@@ -320,7 +320,7 @@ def load_image(image: Union[str, "PIL.Image.Image"], timeout: Optional[float] =

# Try to load as base64
try:
b64 = base64.b64decode(image, validate=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain a bit why b64decode here can't handle the data encoded by encodebytes?

@@ -320,7 +320,7 @@ def load_image(image: Union[str, "PIL.Image.Image"], timeout: Optional[float] =

# Try to load as base64
try:
b64 = base64.b64decode(image, validate=True)
b64 = base64.decodebytes(image.encode() if isinstance(image, str) else image)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the fix is to make it handle string?

Copy link
Collaborator

@ydshieh ydshieh Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But image is always str inside this if branch if isinstance(image, str):, so I feel something is strange here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, sorry, this is a hangover from when I was testing with my own scripts. I've changed it so it's just image.encode()

Copy link
Collaborator

@ydshieh ydshieh Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see we have

image.encode() if isinstance(image, str) else image

??

Other than this, LGTM.

@ydshieh
Copy link
Collaborator

ydshieh commented Apr 17, 2024

Hi @amyeroberts Let me know your thoughts on my comments whenever you get some bandwidth

@amyeroberts
Copy link
Collaborator Author

amyeroberts commented Apr 23, 2024

@ydshieh OK, so the solution I put in here was a hacky one I found worked without digging too much into it - sorry for the rushed solution.

So, base64.decodebytes can decode the bytes string from both base64.b64encode and base64.encodebytes, and base64.decode can decode the string of base64.b64encode and base64.encodebytes. However, this will fail if we have validate=True:

import base64
from io import BytesIO
from PIL import Image

import torch

buffered = BytesIO()
im = torch.rand((256, 256,3))
image = Image.fromarray(im.numpy().astype('uint8'), 'RGB')
image.save(buffered, format="JPEG")

base64.b64decode(base64.encodebytes(buffered.getvalue()), validate=True)

This is because encodebytes inserts newline characters into the encoded bytes string after every 76 bytes if output and ensures there's a trailing new line. This fails validation with Error: Non-base64 digit found

The solution to enable these images would be to remove validate=True from the current b64decode call. I'm not sure we should do this, as the bytes string from base64.encodebytes(s) can contain arbitrary binary data. WDYT?

@ydshieh ydshieh self-assigned this Apr 23, 2024
@ydshieh
Copy link
Collaborator

ydshieh commented Apr 26, 2024

The solution to enable these images would be to remove validate=True from the current b64decode call. I'm not sure we should do this, as the bytes string from base64.encodebytes(s) can contain arbitrary binary data. WDYT?

This is a good question. I tried to feed some bytes string without. Without adding extra arguments to load_image, not sure if we can keep an option about the check. Do think think it is worth a new argument?

@ydshieh
Copy link
Collaborator

ydshieh commented Apr 26, 2024

I tried to decode some arbitrary bytes and decodebytes could still fail. It will succeed only if the bytes is a base64 encoded data.
So I think we can safely use decodebytes as you do in this PR.

import base64


s = b'data to be encoded'
print(type(s))

data = base64.decodebytes(s)
print(data)

@amyeroberts
Copy link
Collaborator Author

So I think we can safely use decodebytes as you do in this PR.

Awesome. Thanks for looking into this! I'll tidy up the code and let you know when it's ready for re-review.

@amyeroberts amyeroberts force-pushed the decode-all-b64-input-strs branch from abe8906 to e21dd17 Compare April 26, 2024 15:58
@amyeroberts
Copy link
Collaborator Author

@ydshieh Ready for re-review 🤗

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix, the iterations and the explanations!

@amyeroberts amyeroberts merged commit c793b26 into huggingface:main Apr 26, 2024
21 checks passed
@amyeroberts amyeroberts deleted the decode-all-b64-input-strs branch April 26, 2024 17:21
itazap pushed a commit that referenced this pull request May 14, 2024
* Decode b64encode and encodebytes strings

* Remove conditional encode -- image is always a string
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Image classification pipeline fails with base64.encodebytes() input
3 participants