-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slower deserialization than in python #23
Comments
You may be seeing some of the overhead of using I have been encountering problems myself with large feather files (about 5 GB). I had an earlier version of the At present I am unable to get compatible versions of |
Yeah, as an aside, I've also been having some rather strange compatibility issues (at least I think that's what it was). Write now when I do |
No, there is no overhead from DataStreams. My guess is that your file includes a String column that has no null values. This is currently returned in a DataFrame as a Maybe for Strings, we should always return a NullableVector{String} for now, due to the performance issues. |
How would that help performance? Wouldn't you need to create these string objects anyway? Or do you mean using |
Yeah, with NullableArray, we can safely use WeakRefStrings |
You are correct, the dataframe did have |
The issue doesn't appear to be strings. Here is the result of serializing and deserializing a
Note that python was called from within Julia using |
Can you share the full script you're using to do the perf? I'm seeing results slower than python, but not as slow as you're seeing. |
Here is a gist showing what I used for testing. |
Pretty sure we are quite close to Python now, if not faster, in best case scenario (i.e. no strings). Haven't done a detailed set of benchmarks yet. If anyone wants to do some it would certainly be welcome. |
Here are some results on a 366MB dataframe with mixed types...
In this case I am calling python from Julia using
PyCall
.I'm sure the python API has gotten more work and attention than this one, just because it is a more popular language, so I imagine this will be an ongoing thing. Is it known why this should be the case? I suppose it's possible that the Python implementation is almost entirely in C using Cython, so that might be hard to compete with without doing something analogous.
The text was updated successfully, but these errors were encountered: