-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In-memory to_csv compression #22555
Comments
Interesting proposal! As a (temporary) workaround, could you not save to disk and then read into memory by any chance? BTW, if you have ideas on how to implement in-memory compression, go for it! |
Hey - thanks for the reply @gfyoung , and sorry for my delay in replying. The functions where I use this are part of a library, so temporarily saving to disk isn't ideal (can't be sure what the end-user's local environment will look like). My thought was something like this as a workaround:
|
@ZaxR : Gotcha. Interesting...I think it would be a solid enhancement nonetheless. Would be open to proposals for implementation. |
I agree that the compression argument should take effect when a file-like object is passed. This enhancement would likely include implementing zip writing support in _get_handle, which would address the frustration I had in #22011 (comment). |
For the 'gzip' compression, If The 'bz2' compression fix is the same. 'xz' will not compress There is too much logic in _get_handle, and it is called many times for reading and for writing. One idea is for it to call _get_read_handle and _get_write_handle to split the logic. Or _get_handle_python2 and _get_handle_python3 could be an option. In order to actually call For Python 2, the exception about not supporting a custom encoding gets raised in _get_handle. This is b/c This is the test code I was using: hello = BytesIO()
test = df.to_csv(hello, compression='gzip')
print(hello.getvalue()) |
I agree. Especially since there is no docstring to define what the function intends to support.
I agree it may be helpful to split read/write and 2/3. However, with 2019 only a couple months away, the purging of 2 from pandas is just around the corner. It seems like any major changes should plan for a 3-only codebase (and hence benefit from the great simplification)? |
@dhimmel : I agree that the implementation should be 3-oriented. If it's 2-oriented as well, great! If not, I would still write it out but just hold off on the PR until the turn of the calendar year. |
@ZaxR : Thanks for the ping! We're still in the process of releasing versions that are Python-2-compatible, so we might want to hold on this a little longer. That being said, proposals are a pure Python-3-compatible implementation would be great 👍 |
I was also looking for the compress functionality in order to produce base64 encoded links. @staticmethod
def _to_base64_encoded_link(data: pd.DataFrame):
csv = data.to_csv(index=False)
b64 = base64.b64encode(
csv.encode()
).decode() # some strings <-> bytes conversions necessary here
link = f'<a href="data:file/csv;base64,{b64}" download="data.csv">Download</a>'
return link Currently my dataframes are to big. So I would like to compress them. |
* Moves column from metadata to body Max metadata size is 8kb and became a problem when the dataset has lots of columns. * Removes redundant parameter encoding='utf-8' is the default value * New metadata storage solution MinIO/S3 object metadata has lots of problems: max 8kb size, allows only Dict[str, str], ... We had a problem saving a file with lots of columns (~1000) and we couldn't store feature types as metadata since it exceeded the 8kb limit. As a solution, now the columns are the 1st row in the csv file and the metadata is in a separate file encoded in JSON format. The unit test was modified to a large dataset 1000 cols, 1mi rows * Removes compression, since it does not work yet on pandas See: pandas-dev/pandas#22555 * Fix list_datasets issue Two files are created for each dataset. * Decreases mock_dataset size Causing 502 bad gateway on play.min.io * Improves code when reading saved metrics
I modified some code which seems to produce the bz2 for me and appears to avoid the intermediate on-disk csv file. Does this help you guys? I'm not sure how efficient the memory allocation is or how in-memory it all is, so if there are optimizations, please advise! I'm a py newbie (first day), and this stuff seemed to be poorly documented, so thanks for making this thread and apologies for my ignorance. #This code takes a pandas df and makes clickable link in your ipynb UI to download a bz2 compressed file #Housekeeping - BEGIN #Create test pandas dataframe from example in 22555, and add D col and data #Note: this requires a pandas df as input #Call the function with your pandas df |
Code Sample, a copy-pastable example if possible
Problem description
I am trying to gzip compress a dataframe in memory (as opposed to directly to a named file location). The use case for this is (I imagine) similar to the reason by to_csv now allows not specifying a path in other cases to create an in memory representation, but specifically my case is that I need to save the compressed df to a cloud location using a custom URI, and I'm temporarily keeping it in memory for that purpose.
Expected Output
I would expect the compression option to result in a compressed, bytes object (similar to the gzip library).
Thank you in advance for your help!
Note: I originally saw #21227 (df.to_csv ignores compression when provided with a file handle), and thought it might have also been a fix, but looks like it just stopped a little short of fixing my issue as well.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 17.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.15.1
scipy: None
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: