-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Polars to read and write rather than Pandas #56
base: main
Are you sure you want to change the base?
Conversation
Polars requires a single character delimeter so first replace extended whitespace with a single whitespace character
…Pandas DataFrames and just converts to Polars
…slight efficiency improvements for that
What an awesome PR to wake up to! Will take a proper look once I'm up and running |
Hey @d-j-hatton, are you still interested in this? I missed this, but think its a great idea. I've always been interested in polars and from looking at the documentation maybe there is even a way to make the quotation code a bit saner. I've thought a bit about whether we should be cautious to add new dependencies, since this is a helper package used in quite a few other projects and this always increases the risk of collisions. Is there an easy way to make this an optional dependency? |
Yes, it will be possible to have it as an optional dependency by adding an optional dependency group so you can
to check for the presence of the package in the code and default to pandas if not present. I can make some changes for that and fix the conflicts that have cropped up |
Thank you, that sounds great! Also happy to help, but won't touch anything unless you ask me to. |
Using
polars
CSV read and write functionality rather than the equivalents inpandas
can lead to significant speed ups. There is the added benefit that everything can be kept inpolars
for additional speed ups in dataframe manipulation later. To maintain the current behaviourpolars
dataframes are converted topandas
before returning fromstarfile.read
by default. The additional keyword argumentpolars=True
can be specified to return apolars.DataFrame
.starfile.write
will accept data blocks that are eitherpandas
orpolars
dataframes.As
polars
will only accept a single character separator arbitrary whitespace has to be parsed when the star file is read. Some modifications have therefore been made to the line by line parsing for efficiency.Attached are some very rough read and write benchmarks from an M1 MacBook Pro on particle star files of different sizes. For read the bars are split into the time taken to perform
read_csv
and the rest. "Polars to pandas" and "Pure polars" refer to code modified as in this PR, "Pure pandas" is the existing implementation.Notes:
polars
as a dependency