Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GZip Google Storages Stores With .csv.gz Ending #1454

Closed
thuibr opened this issue Sep 27, 2024 · 3 comments
Closed

GZip Google Storages Stores With .csv.gz Ending #1454

thuibr opened this issue Sep 27, 2024 · 3 comments

Comments

@thuibr
Copy link

thuibr commented Sep 27, 2024

Hello,

When GZIP compressing a text/csv file, with the GCS Backend, it saves it as Type text/csv with a .csv ending. This leads to the OS believing that it's a CSV when downloaded when in actuality it is a GZipped CSV.

Current behavior:
Type text/csv
.csv ending

Desired behavior:
Type application/octet-stream
.csv.gz ending

How do I go about fixing this? I am willing to make a contribution if you'd like. I just need some guidance.

Thank you!

@thuibr thuibr changed the title GZip Google Storages Stores With .csv.gz Ending GZip Google Storages Stores With .csv Ending Sep 27, 2024
@thuibr thuibr changed the title GZip Google Storages Stores With .csv Ending GZip Google Storages Stores With .csv.gz Ending Sep 28, 2024
@thuibr
Copy link
Author

thuibr commented Sep 28, 2024

I think if I added force_gzip to object_parameters and then at https://github.com/jschneier/django-storages/blob/master/storages/backends/gcloud.py#L203-L205 I did:

            self.gzip
            and upload_params[CONTENT_TYPE] in self.gzip_content_types
            and (CONTENT_ENCODING not in blob_params, or blob_params["force_gzip"])

I think that would do it.

@thuibr
Copy link
Author

thuibr commented Sep 28, 2024

I don't know if what I posted above is the exact correct solution. The root cause is https://github.com/jschneier/django-storages/blob/master/storages/backends/gcloud.py#L41 where

>>> import mimetypes
>>> mimetypes.guess_type("c.csv")
('text/csv', None)
>>> mimetypes.guess_type("c.csv.gz")
('text/csv', 'gzip')

even though c.csv.gz might not actually be gzipped yet. It's just the desired name on GCS.

That 'gzip' gets used as the content_encoding and prevents compression at https://github.com/jschneier/django-storages/blob/master/storages/backends/gcloud.py#L207.

The problem is the desired name for my file on GCS is c.csv.gz, but I still want the GoogleCloudStorage to do the heavy lifting of actually compressing the file. The content_encoding is gzip though based on guessing from the name.

One more potential solution is to not guess based on the name but instead actually check what content_encoding a file is by looking at the first few bytes. I think that python-magic does this.

Another potential solution is:

            self.gzip
            and upload_params[CONTENT_TYPE] in self.gzip_content_types
            and (CONTENT_ENCODING not in blob_params or blob_params[CONTENT_ENCODING] is None)

@thuibr
Copy link
Author

thuibr commented Sep 30, 2024

Wow I was totally off base. The file was getting stored gzipped in GCS. Per this SO thread though, it automatically decompresses it on download: https://stackoverflow.com/questions/67744979/how-to-prevent-gcs-from-automatically-decompressing-objects-when-using-python-sd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant