slower deserialization than in python #23

ExpandingMan · 2016-09-30T16:35:44Z

Here are some results on a 366MB dataframe with mixed types...

INFO: Opening feather with python...
  3.100129 seconds (65.20 k allocations: 2.547 MB)
INFO: Opening feather with julia...
  5.888453 seconds (51.48 M allocations: 2.937 GB, 42.68% gc time)

In this case I am calling python from Julia using PyCall.

I'm sure the python API has gotten more work and attention than this one, just because it is a more popular language, so I imagine this will be an ongoing thing. Is it known why this should be the case? I suppose it's possible that the Python implementation is almost entirely in C using Cython, so that might be hard to compete with without doing something analogous.

The text was updated successfully, but these errors were encountered:

dmbates · 2016-09-30T17:16:29Z

You may be seeing some of the overhead of using DataStreams when reading the Feather file in Julia.

I have been encountering problems myself with large feather files (about 5 GB). I had an earlier version of the Feather package that memory-mapped the columns and returned a structure with the memory-mapped arrays, so it was very fast. I may resurrect that as a separate package, because it depends on Cxx and I don't want everyone using Feather to need to load Cxx.

At present I am unable to get compatible versions of DataFrames, CategoricalArrays, NominalArrays and Feather installed so I may need to resurrect the earlier version sooner rather than later.

ExpandingMan · 2016-09-30T17:21:25Z

Yeah, as an aside, I've also been having some rather strange compatibility issues (at least I think that's what it was). Write now when I do Pkg.add it points to an old commit (v0.1.5 not v0.1.6). That commit has a bug which prevents writing (which seems to be known). When I try to use the clone of v0.1.6 I get some errors. I can give details if you'd like but since Pkg was pointing to an older version I figured it was known that there are problems.

quinnj · 2016-09-30T18:24:58Z

No, there is no overhead from DataStreams. My guess is that your file includes a String column that has no null values. This is currently returned in a DataFrame as a Vector{String}, and materializing all those Strings in Julia is currently a known performance issue (see a lot of the discussion here).

Maybe for Strings, we should always return a NullableVector{String} for now, due to the performance issues.

nalimilan · 2016-09-30T18:29:22Z

Maybe for Strings, we should always return a NullableVector{String} for now, due to the performance issues.

How would that help performance? Wouldn't you need to create these string objects anyway? Or do you mean using WeakRefStrings?

quinnj · 2016-09-30T18:35:01Z

Yeah, with NullableArray, we can safely use WeakRefStrings

ExpandingMan · 2016-09-30T18:49:44Z

You are correct, the dataframe did have Strings.

ExpandingMan · 2016-10-04T15:57:49Z

The issue doesn't appear to be strings. Here is the result of serializing and deserializing a DataFrame of 3 columns of 2*10^6 Float64

INFO: Serializing...
  3.252102 seconds (2.69 M allocations: 177.047 MB, 5.10% gc time)
INFO: Serializing with Python...
  0.104558 seconds (46.84 k allocations: 1.839 MB)
INFO: Deserializing...
  0.427634 seconds (362.43 k allocations: 61.366 MB, 3.12% gc time)
INFO: Deserializing with Python...
  0.098412 seconds (46.36 k allocations: 1.812 MB)

Note that python was called from within Julia using PyCall and the python serialization and deserialization was done to and from pandas dataframes.

quinnj · 2016-10-04T16:15:57Z

Can you share the full script you're using to do the perf? I'm seeing results slower than python, but not as slow as you're seeing.

ExpandingMan · 2016-10-04T16:54:25Z

Here is a gist showing what I used for testing.

ExpandingMan · 2018-04-16T19:15:29Z

Pretty sure we are quite close to Python now, if not faster, in best case scenario (i.e. no strings). Haven't done a detailed set of benchmarks yet. If anyone wants to do some it would certainly be welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slower deserialization than in python #23

slower deserialization than in python #23

ExpandingMan commented Sep 30, 2016

dmbates commented Sep 30, 2016

ExpandingMan commented Sep 30, 2016 •

edited

Loading

quinnj commented Sep 30, 2016

nalimilan commented Sep 30, 2016

quinnj commented Sep 30, 2016

ExpandingMan commented Sep 30, 2016

ExpandingMan commented Oct 4, 2016

quinnj commented Oct 4, 2016

ExpandingMan commented Oct 4, 2016

ExpandingMan commented Apr 16, 2018

slower deserialization than in python #23

slower deserialization than in python #23

Comments

ExpandingMan commented Sep 30, 2016

dmbates commented Sep 30, 2016

ExpandingMan commented Sep 30, 2016 • edited Loading

quinnj commented Sep 30, 2016

nalimilan commented Sep 30, 2016

quinnj commented Sep 30, 2016

ExpandingMan commented Sep 30, 2016

ExpandingMan commented Oct 4, 2016

quinnj commented Oct 4, 2016

ExpandingMan commented Oct 4, 2016

ExpandingMan commented Apr 16, 2018

ExpandingMan commented Sep 30, 2016 •

edited

Loading