Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrames not threadsafe #2795

Closed
smjaberl opened this issue Jun 22, 2021 · 11 comments · Fixed by #2823
Closed

DataFrames not threadsafe #2795

smjaberl opened this issue Jun 22, 2021 · 11 comments · Fixed by #2823
Labels

Comments

@smjaberl
Copy link

Hi,

I have a case, where I use a push to a DataFrame on multiprocessing. When the rare case occurs, that two processes are pushing on the same time, I get this error:

┌ Error: Error adding value to column :t.
└ @ DataFrames ~/.julia/packages/DataFrames/nxjiD/src/dataframe/dataframe.jl:1644

Is this a known bug? I found no documentation about it.

regards
jan

@bkamins
Copy link
Member

bkamins commented Jun 22, 2021

This is not a bug, but a design decision in Julia Base, which is just propagated to DataFrames.jl. Operations like push! (I assume you are doing this) are not thread safe in Julia in general (not only in DataFrames.jl).

I will keep this issue open, as maybe in the future the design assumptions will change.

@smjaberl
Copy link
Author

Well, thanks for adding the feature label to the issue. I think it make sense to think about this design decision. A modern fast programming language like julia should be threadsafe in such elementary functions. On the other hand I know, it would be more complex.

@bkamins
Copy link
Member

bkamins commented Jun 23, 2021

I agree, but this is a request to Julia Base mainly. The issue is that adding thread safety to such elementary operations would make them slower.

See a comment by @StefanKarpinski here https://discourse.julialang.org/t/can-dicts-be-threadsafe/27172/6.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jun 23, 2021

A modern fast programming language like julia should be threadsafe in such elementary functions. On the other hand I know, it would be more complex.

It's not just a matter of complexity. For many data structures and operations, the threadsafe version is significantly slower that the threadunsafe version. So it is precisely because Julia is a fast programming language that it cannot simply make everything threadsafe. The core data structures and operations provided by the language are fast and threadunsafe (unless thread safety is necessary or can be provided without loss of performance) and the tools (like locks and atomics) are provided for packages to build thread safe algorithms and data structures on top of that.

@nalimilan
Copy link
Member

I agree that we can't ensure Julia is threadsafe everywhere without killing performance. However, given that appending new rows to a data frame is quite slower than appending entries to a single vector (due to type instability and column lookup), it might make sense to ensure thread safety if we can confirm the overhead is negligible.

@bkamins
Copy link
Member

bkamins commented Jul 7, 2021

We could ensure thread safety by:

  • having an internal lock (like in GroupedDataFrame)
  • using it in such operations

However, it would not have any performance benefit (i.e. the speed of threaded push! would be worse than speed of sequential push!) - the only benefit would be thread safety at the cost of non deterministic row order.

However, ensuring this thread safety is equally easy to do on user side. Essentially one needs to wrap push! in a lock (this is what essentially we would do internally). I assume that if someone uses multi-threading adding lock call:

lock(mylock) do
    push!(df, something)
end

should be easy enough to do.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jul 8, 2021

This is a good point. If you're going to lock on the entire data frame structure then there's zero benefit to using multiple threads: the locks force sequential execution anyway while adding significant overhead and causing operations to be done in a random (as in non-deterministic) order. Using threads and locking for something like this only makes sense if you can do it at a granularity smaller than the entire data structure.

@nalimilan
Copy link
Member

The main benefit would be to avoid corrupting the data frame (which leads to the reported error). But given its limited interest that safety check would only be acceptable if the overhead is negligible.

@bkamins
Copy link
Member

bkamins commented Jul 8, 2021

If you're going to lock on the entire data frame structure

You have to lock an entire data frame as push! mutates all columns in a data frame (and assumes it is an atomic operation).

The place, where we support multi-threading is in situations, where adding multi-threading allows us to process data column by column.

See the following example:

julia> using DataFrames

julia> Threads.nthreads()
1

julia> df = DataFrame(rand(10^6, 1000), :auto);

julia> @time copy(df);
  2.341119 seconds (75.97 k allocations: 7.455 GiB, 36.73% gc time, 1.51% compilation time)

julia> @time copy(df);
  2.063910 seconds (3.50 k allocations: 7.451 GiB, 30.87% gc time)

julia> @time copy(df);
  2.750718 seconds (3.50 k allocations: 7.451 GiB, 32.78% gc time)

vs

julia> using DataFrames

julia> Threads.nthreads()
4

julia> df = DataFrame(rand(10^6, 1000), :auto);

julia> @time copy(df);
  1.229864 seconds (89.47 k allocations: 7.456 GiB, 39.35% gc time, 2.79% compilation time)

julia> @time copy(df);
  0.738466 seconds (9.52 k allocations: 7.451 GiB, 19.72% gc time)

Here we use multi-threading, as copy can copy each column separately and it can be efficiently parallelized (in a lock-free manner, see

@sync for i in eachindex(columns)
).

@NowanIlfideme
Copy link

Perhaps the "fix" is adding a tutorial specifically covering multithreading? In your JuliaCon 2021 talk, you mentioned this column parallelism, but I didn't see it elsewhere in the docs (maybe I just missed it). And the above comment on using built-in lock mechanisms would work, too.

@bkamins
Copy link
Member

bkamins commented Jul 24, 2021

I will make a blog post about it + add to a documentation a description which operations are currently supporting multithreading (as currently it is spread over the documentation and incomplete).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants