Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization file format for arbitrary julia objects #7909

Closed
jakebolewski opened this issue Aug 8, 2014 · 20 comments
Closed

Serialization file format for arbitrary julia objects #7909

jakebolewski opened this issue Aug 8, 2014 · 20 comments
Assignees

Comments

@jakebolewski
Copy link
Member

Python has pretty much standardized on using the .pkl file name extension for serialized python objects. What to call Julia's serialized object dumps? .jld is already taken.

@jiahao
Copy link
Member

jiahao commented Aug 8, 2014

.jls? (Julia serialized)

@jiahao
Copy link
Member

jiahao commented Aug 8, 2014

.dat is always good for a round of Twenty Questions.

@Keno
Copy link
Member

Keno commented Aug 8, 2014

I think .jls is fine. Otherwise how about .js (JuliaSerialized) ;)?

@JeffBezanson
Copy link
Member

+1 for .jls
A related issue is that technically there is no API for saving a .jls "file", only a value at a time. I like the simplicity of this, but it leaves us nowhere to put a header.

@jakebolewski
Copy link
Member Author

Allowing the user to define methods to serialize custom data types also kind of defeats the versioned file concept. The ability to deserialize depends on what is currently loaded into the environment, so the "version" of the format is kind of meaningless.

@JeffBezanson
Copy link
Member

We could explore using the serializer in dump.c. That implements the non-overloadable, just-write-the-damn-data kind of serialization. It tends to change rarely, and might be faster too. However it will require modifications not to save the entire heap; e.g. the Base module should just be saved by reference instead of writing all its contents.

@Keno Keno changed the title Serialization file format extension for julia. Serialization file format for arbitrary julia objects Aug 8, 2014
@Keno
Copy link
Member

Keno commented Aug 8, 2014

Changed the title since the extension seems to be pretty much decided, so the issue now tracks the issue that there is no file format.

@timholy
Copy link
Member

timholy commented Aug 8, 2014

I would also argue that if HDF5 can be made essentially as fast (which remains to be seen), we really shouldn't encourage this---there are many advantages in a stable, nearly-human-readable (thanks to h5dump and h5ls), widely-supported format. For that reason, it would be great if rather than saying "wow, HDF5 is slow so let's find something else" if folks threw some effort into investigating & improving its performance.

@simonster
Copy link
Member

Like @timholy I think that HDF5/JLD is the way to go unless there is reason to believe it can never be made fast enough.

@timholy
Copy link
Member

timholy commented Aug 8, 2014

#7893 now suggests HDF5/JLD can be plenty fast.

@JeffBezanson
Copy link
Member

I agree completely --- originally I didn't really want to use serialize/deserialize for anything except message passing, where it is needed to handle details like tracking RemoteRefs and efficiently sending closures around. HDF5/JLD would be much better for persisting data.

@jakebolewski
Copy link
Member Author

How performant is HDF5 on Mac or Windows? Maybe I need to wean myself off past python experience but being able to quickly serailize whatever state you are working on to look at it later is useful. I agree that this is not a good format for "permanent" storage, its more useful for "I'll look at this tomorrow" type storage. I guess if HDF5 is performant enough, and it is easy to install on all platforms then it fills this role quite nicely.

@JeffBezanson
Copy link
Member

We could perhaps add save to Base, and use serialize if the extension is jls.

@timholy
Copy link
Member

timholy commented Aug 8, 2014

Dunno, I've only tested Linux. I agree with the convenience of serialization, but I've found that for my needs @save "mywholeworkspace.jld" works pretty well. (Usually I pick specific variables, however.) I basically think of .jld files as the equivalent of .mat files.

There are some limitations, though, like not saving functions.

@JeffBezanson
Copy link
Member

Saving functions could probably be added by imitating what the code in Base does, but it seems like a fairly marginal need.

@simonster
Copy link
Member

After working on JuliaIO/HDF5.jl#132, I'm not sure where to go next here. For @jakebolewski and @jiahao's cases, which involve large arrays of numbers, strings, and immutables, JLD can now beat serialize, but for serializing large arrays of Vector{Any} it's quite pitiful. The problem is that we have to save each object of unknown type as its own HDF5 dataset, which appears to be costly not only (or even mostly) because of disk access but also because of the massive amount of overhead in libhdf5 itself. To get good performance, we're going to have to avoid relying on libhdf5. The question is whether to design our own Julia-specific format or to continue using an HDF5-based format but reimplement the code to read/write it.

I am still of the opinion that the format used by serialize/deserialize is not good as a data storage format. In addition to the instability of the serialize/deserialize format and the fact that it is only meant to save one value at a time, a data storage format should be suitable for archival purposes: It should still be possible to read data from a file even if the types are missing. That doesn't mean serialize/deserialize couldn't be used as a basis for a data storage format, just that more work would be necessary.

@simonster simonster self-assigned this Jul 3, 2015
@Skylion007
Copy link

I'd like second Simonster's point. Unless we fix the issues with JLD, it cannot be a replacement for JLS.

@simonster
Copy link
Member

I fixed most of the issues with JLD in JLD2: https://github.com/simonster/JLD2.jl. It reads and writes valid HDF5 files in pure Julia (although for reading it only supports a subset of HDF5). It pretty much works, and it can be many orders of magnitude faster than JLD for the Vector{Any} cases. The caveats: I never tested on Windows; I never got around to implementing compression or groups; HDF5 has substantial file size overhead if you're writing a bunch of small objects (although the performance is typically no slower than 2x Base.serialize, and in some cases it can even be faster). Unfortunately I have to finish my PhD and won't have time to finish it in the near future.

@JeffBezanson
Copy link
Member

For 1.0 we should probably add the ability to write a small header giving the format version (and probably an ABI specifier).

@JeffBezanson
Copy link
Member

serialize(io, x) and serialize(filename, x) now write a header; I think this can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants