Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slower deserialization than in python #23

Open
ExpandingMan opened this issue Sep 30, 2016 · 10 comments
Open

slower deserialization than in python #23

ExpandingMan opened this issue Sep 30, 2016 · 10 comments

Comments

@ExpandingMan
Copy link
Collaborator

Here are some results on a 366MB dataframe with mixed types...

INFO: Opening feather with python...
  3.100129 seconds (65.20 k allocations: 2.547 MB)
INFO: Opening feather with julia...
  5.888453 seconds (51.48 M allocations: 2.937 GB, 42.68% gc time)

In this case I am calling python from Julia using PyCall.

I'm sure the python API has gotten more work and attention than this one, just because it is a more popular language, so I imagine this will be an ongoing thing. Is it known why this should be the case? I suppose it's possible that the Python implementation is almost entirely in C using Cython, so that might be hard to compete with without doing something analogous.

@dmbates
Copy link
Contributor

dmbates commented Sep 30, 2016

You may be seeing some of the overhead of using DataStreams when reading the Feather file in Julia.

I have been encountering problems myself with large feather files (about 5 GB). I had an earlier version of the Feather package that memory-mapped the columns and returned a structure with the memory-mapped arrays, so it was very fast. I may resurrect that as a separate package, because it depends on Cxx and I don't want everyone using Feather to need to load Cxx.

At present I am unable to get compatible versions of DataFrames, CategoricalArrays, NominalArrays and Feather installed so I may need to resurrect the earlier version sooner rather than later.

@ExpandingMan
Copy link
Collaborator Author

ExpandingMan commented Sep 30, 2016

Yeah, as an aside, I've also been having some rather strange compatibility issues (at least I think that's what it was). Write now when I do Pkg.add it points to an old commit (v0.1.5 not v0.1.6). That commit has a bug which prevents writing (which seems to be known). When I try to use the clone of v0.1.6 I get some errors. I can give details if you'd like but since Pkg was pointing to an older version I figured it was known that there are problems.

@quinnj
Copy link
Member

quinnj commented Sep 30, 2016

No, there is no overhead from DataStreams. My guess is that your file includes a String column that has no null values. This is currently returned in a DataFrame as a Vector{String}, and materializing all those Strings in Julia is currently a known performance issue (see a lot of the discussion here).

Maybe for Strings, we should always return a NullableVector{String} for now, due to the performance issues.

@nalimilan
Copy link
Member

Maybe for Strings, we should always return a NullableVector{String} for now, due to the performance issues.

How would that help performance? Wouldn't you need to create these string objects anyway? Or do you mean using WeakRefStrings?

@quinnj
Copy link
Member

quinnj commented Sep 30, 2016

Yeah, with NullableArray, we can safely use WeakRefStrings

@ExpandingMan
Copy link
Collaborator Author

You are correct, the dataframe did have Strings.

@ExpandingMan
Copy link
Collaborator Author

The issue doesn't appear to be strings. Here is the result of serializing and deserializing a DataFrame of 3 columns of 2*10^6 Float64

INFO: Serializing...
  3.252102 seconds (2.69 M allocations: 177.047 MB, 5.10% gc time)
INFO: Serializing with Python...
  0.104558 seconds (46.84 k allocations: 1.839 MB)
INFO: Deserializing...
  0.427634 seconds (362.43 k allocations: 61.366 MB, 3.12% gc time)
INFO: Deserializing with Python...
  0.098412 seconds (46.36 k allocations: 1.812 MB)

Note that python was called from within Julia using PyCall and the python serialization and deserialization was done to and from pandas dataframes.

@quinnj
Copy link
Member

quinnj commented Oct 4, 2016

Can you share the full script you're using to do the perf? I'm seeing results slower than python, but not as slow as you're seeing.

@ExpandingMan
Copy link
Collaborator Author

Here is a gist showing what I used for testing.

@ExpandingMan
Copy link
Collaborator Author

Pretty sure we are quite close to Python now, if not faster, in best case scenario (i.e. no strings). Haven't done a detailed set of benchmarks yet. If anyone wants to do some it would certainly be welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants