-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
serialization / deserialization performance #7893
Comments
You Don;t want to use serialize for this. Try the HDF package. |
Yes, this is bad, but operating on something in parallel by copying the entire dataset around is not going to work anyway. Even if the serializer works at the speed of the network you're not likely to recoup the cost. |
Looks like you can't save a DataFrame directly to HDF5. see JuliaData/DataFrames.jl#64 Seriously, guys, what is the point of having a serialization format that is hundreds of times slower than CSV? That's the real issue here. |
I'm pretty sure that is false. I have saved DataFrames to HDF5 plenty of times. Please have a look at the HDF5 package. The serializer is also not intended as a storage format and it is not stable between versions of julia. |
The point of it is to support arbitrary julia objects. When we find something that's very slow, we investigate and try to speed it up. Assuming that somebody maliciously intends for it to be slow is totally inappropriate and makes no sense. |
Yes, HDF5 (really JLD) even includes a test for DataFrames. (It only runs if DataFrames is installed; I just added a test/REQUIRE file to make sure Iain runs it.) That said, I certainly support improving serializer performance. |
Also, it would help a lot to know what is in this DataFrame. Could we have the types of the columns? |
@jakebolewski, have you tried https://github.com/tanmaykm/ProtoBuf.jl? |
It was not my intent to be accusatory. It's just rather frustrating that having tried a few different ways to read datasets into Julia, it seems like the fastest way is to read the whole thing in as a CSV file. (@jakebolewski just tried the HDF5 option; while it works, the speed is comparable to the serialization mode. Thanks for the correction @Keno and @timholy) @JeffBezanson the performance we're seeing doesn't seem to vary too much across the various tables in this dataset - they're all |
@quinnj I have not tried protobufs. I've only skimmed the package, is there a way to easily generate a .proto file from a Dataframe object? Some potential problems are handling missing data, and reencoding all string data to UTF8. ODBC returns UTF16 encoded strings by default. |
@JeffBezanson the serialized dataframe files are in the folder |
Thanks, I will take a look. |
Ok I have a strong lead here: reading So the entire problem appears to be strings. |
Makes sense that it is O N^2 julia> @time writetable("labevents.csv", labevents)
elapsed time: 382.450579558 seconds (5702565828 bytes allocated, 92.74% gc time)
shell> head labevents.csv
"subject_id","hadm_id","icustay_id","itemid","charttime","value","valuenum","flag","valueuom"
3,NA,NA,50112,"2682-08-22T16:24:00","121",121.0,"abnormal","mg/dL"
3,NA,NA,50428,"2682-08-22T16:24:00","261",261.0,"NA","K/uL"
3,NA,NA,50443,"2682-08-22T16:24:00","NORMAL",NA,"NA","NA"
3,NA,NA,50415,"2682-08-22T16:24:00","NORMAL",NA,"NA","NA"
3,NA,NA,50409,"2682-08-22T16:24:00","NORMAL",NA,"NA","NA"
3,NA,NA,50431,"2682-08-22T16:24:00","NORMAL",NA,"NA","NA"
3,NA,NA,50326,"2682-08-22T16:24:00","NORMAL",NA,"NA","NA"
3,NA,NA,50396,"2682-08-22T16:24:00","NORMAL",NA,"NA","NA"
3,NA,NA,50333,"2682-08-22T16:24:00","0.5",0.5,"NA","%"
julia> @time readtable("labevents.csv")
elapsed time: 46.019874084 seconds (5823425276 bytes allocated, 40.02% gc time) Timing the same IO code in python using Pandas: In [17]: %time df = pandas.io.parsers.read_csv("labevents.csv")
CPU times: user 5.88 s, sys: 430 ms, total: 6.31 s
Wall time: 6.33 s
In [18]: %time df.to_pickle("labevents.pkl")
CPU times: user 4.28 s, sys: 508 ms, total: 4.79 s
Wall time: 4.79 s
In [25]: %time pandas.read_pickle("labevents.pkl")
CPU times: user 4.02 s, sys: 740 ms, total: 4.76 s
Wall time: 4.76 s
# Using the pickler written in pure python (no c-extensions)
In [37]: %time pickle.dump(df, fh)
CPU times: user 1min 10s, sys: 1.97 s, total: 1min 12s
Wall time: 1min 12s
In [43]: %time df2 = pickle.load(fh)
CPU times: user 30.9 s, sys: 2.16 s, total: 33.1 s
Wall time: 33.1 s comparing filesizes: -rw-r--r-- 1 jake jake 243M Aug 7 15:02 labevents.csv
-rw-r--r-- 1 jake jake 475M Aug 7 10:33 labevents.jld
-rw-rw-r-- 1 jake jake 259M Aug 7 15:19 labevents.pkl |
The original 8 GB file finished deserializing: julia> @time chartevents = open("chartevents.jld", "r") do io
deserialize(io)
end
elapsed time: 20390.763780769 seconds (80046322888 bytes allocated, 95.33% gc time) @timholy I'm finding saving some of the smaller dataframes to HDF5 takes much longer than just serializing the object directly to disk. |
It's quite possible---HDF5 has to write a lot of annotation to store the type in a "readable" way, whereas the serializer just uses a couple bytes to indicate type. In my experience JLD is ~5x faster than CSV. I bet it depends a lot on the contents, though. If you can post the dataframe somewhere, I can take a look. |
For numeric data, that would be expected since we write binary. Here, it's all about the strings. |
I've played around a little and it looks like some of the bottlenecks in |
I'm working on some more small but worthwhile improvements. This is a thorn in my side: That integer is totally redundant and AFAICT not used at all. It's obviously not a huge bottleneck, but it would be nice to eliminate given that it's totally unused. However it doesn't appear possible to remove it in a way that would allow old files to be read correctly. |
0.3 release seems like a reasonable time to update the serialize format. |
mostly this avoids memory allocation from Module fullname()s, and boxing Ints. this is backwards-compatible for *reading* old files, but written data will not be usable by older julias.
Dare I ask why it's there in the first place? |
I have no idea. It looks like @timholy added it over 2 years ago. |
@JeffBezanson When linking to lines, please make sure you include the commit hash. You can simply press |
I have no idea either. Looks like it was motivated by Matlab compatibility back when I was interested in using Julia via Matlab---perhaps because Julia has intrinsic information about types in Base, but Matlab doesn't? Not sure. Anyway, sorry about the thorn. Do what you gotta do. |
Here's where we are. Timings when this issue was filed:
After my improvements plus removing the extra integer:
However I feel bad about breaking read-compatibility. @timholy as somebody who uses JLD, do you really think this is ok? |
I think there is some confusion going on here. .jld is the extension of the HDF5 package dumper isn't it? |
If it's an immutable, we may want JuliaIO/HDF5.jl#27 sooner rather than later. CC @simonster. |
I'll work on that this weekend. |
That would be pretty awesome. |
Did something break? Just found this |
Yes, it looks like the performance changes were backwards incompatible. If I attempt to open any of the data from yesterday it does not work. julia> open(deserialize, "a_chartdurations.jls")
ERROR: type: setfield!: expected Array{Any,1}, got Type{()}
in deserialize at serialize.jl:565
in deserialize at serialize.jl:520
in handle_deserialize at serialize.jl:351
in deserialize at serialize.jl:334
in open at ./iostream.jl:137 |
@jakebolewski and I were playing around with this some more this afternoon, and reached the conclusion that the current state of garbage collection greatly magnifies the effect of any allocation in this benchmark. After the dataset is loaded into memory, each gc takes 2 seconds, as @timholy noted above. So given the current state of garbage collection, optimizing serialization/deserialization of that dataset comes down to minimizing gc. But I suspect these very long gc times will also make it hard to do anything with the data. With 100x longer gc pauses, even code that spent only 10% of its time in gc before the dataset was in memory will take 10x longer to run after the dataset is in memory. |
For some hard numbers allocating 1,000,000. UTF16 strings takes about 0.5 seconds on the julia test machine in a fresh environment. After loading in a dataframe like labevents, allocating the same 1_000_000 UTF16 strings takes 5.07 seconds. Gc collection times are on the order of 1.6 seconds. This obviously gets much worse as you load in bigger datasets. Lab events is 450mb containing almost 10,000,000 short UTF16 strings. |
Yes, we know. |
Yeah, strings are a total disaster currently – that's why they're slated for an overhaul in 0.4. |
I tried to search the issues, how do you see the representation of strings changing? Allocating so many objects into the environment be they strings, arrays or immutables whatever gives the same behavior. I think this is more of a gc problem then a string problem. How will the new representation get around the issue that you are just allocating many small objects? |
We need to work both sides of the issue: better gc and fewer, smaller
|
@jakebolewski, here's a mockup of a |
Note that short strings can be stored inline and compared using Int128 operations. When I've benchmarked this prototype, sorting an array of inline strings is about 2x faster than sorting an array of slightly longer strings, non-inline strings, so there does seem to be some advantage to be had. |
Also, usually, you don't need to work with millions of different string values: you have millions of values from a relatively small pool. In this case |
The approach we're considering will basically make that optimization obsolete, for strings at least. We may also make it "autopooling" – i.e. whenever two strings are equal or can share memory, we may be able to transparently use a single object. |
Boy am i glad we're starting to work on big datasets :-) |
mostly this avoids memory allocation from Module fullname()s, and boxing Ints. this is backwards-compatible for *reading* old files, but written data will not be usable by older julias.
I don't know whether the optimization techniques described in the blog post are applicable here, but I'll share it for the purposes of discussion: |
It's not clear from this issue, has anything been done to specifically improve the performance of serialization/deserialization (besides the general improvement from GC improvements)? Has the file extension been changed to not be .jld? Has file header information been added? Is the speed of UTF-16 conversions still a problem? |
I added a performance benchmark of this to nanosoldier, so we can track this going forward. |
I believe this is largely fixed. New issues can be opened if anybody encounters an especially bad case for serialize/deserialize. |
Julia's serialization / deserialization performance is pretty worrying. Deserializing a ~500 mb DataFrame object takes about ~5-6 minutes. An 8 GB dataframe object takes 3 + hours (it is not finished yet). It looks like almost all time is spent in GC. This is making it impractical to do any sort of parallel operations on dataframe objects, without dumping the dataframe out to a csv file and reading the contents back into each process.
@jiahao said serializing the 8GB file took ~5 hours.
The text was updated successfully, but these errors were encountered: