-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrames not threadsafe #2795
Comments
This is not a bug, but a design decision in Julia Base, which is just propagated to DataFrames.jl. Operations like I will keep this issue open, as maybe in the future the design assumptions will change. |
Well, thanks for adding the feature label to the issue. I think it make sense to think about this design decision. A modern fast programming language like julia should be threadsafe in such elementary functions. On the other hand I know, it would be more complex. |
I agree, but this is a request to Julia Base mainly. The issue is that adding thread safety to such elementary operations would make them slower. See a comment by @StefanKarpinski here https://discourse.julialang.org/t/can-dicts-be-threadsafe/27172/6. |
It's not just a matter of complexity. For many data structures and operations, the threadsafe version is significantly slower that the threadunsafe version. So it is precisely because Julia is a fast programming language that it cannot simply make everything threadsafe. The core data structures and operations provided by the language are fast and threadunsafe (unless thread safety is necessary or can be provided without loss of performance) and the tools (like locks and atomics) are provided for packages to build thread safe algorithms and data structures on top of that. |
I agree that we can't ensure Julia is threadsafe everywhere without killing performance. However, given that appending new rows to a data frame is quite slower than appending entries to a single vector (due to type instability and column lookup), it might make sense to ensure thread safety if we can confirm the overhead is negligible. |
We could ensure thread safety by:
However, it would not have any performance benefit (i.e. the speed of threaded However, ensuring this thread safety is equally easy to do on user side. Essentially one needs to wrap
should be easy enough to do. |
This is a good point. If you're going to lock on the entire data frame structure then there's zero benefit to using multiple threads: the locks force sequential execution anyway while adding significant overhead and causing operations to be done in a random (as in non-deterministic) order. Using threads and locking for something like this only makes sense if you can do it at a granularity smaller than the entire data structure. |
The main benefit would be to avoid corrupting the data frame (which leads to the reported error). But given its limited interest that safety check would only be acceptable if the overhead is negligible. |
You have to lock an entire data frame as The place, where we support multi-threading is in situations, where adding multi-threading allows us to process data column by column. See the following example:
vs
Here we use multi-threading, as DataFrames.jl/src/dataframe/dataframe.jl Line 200 in ab5ffd7
|
Perhaps the "fix" is adding a tutorial specifically covering multithreading? In your JuliaCon 2021 talk, you mentioned this column parallelism, but I didn't see it elsewhere in the docs (maybe I just missed it). And the above comment on using built-in lock mechanisms would work, too. |
I will make a blog post about it + add to a documentation a description which operations are currently supporting multithreading (as currently it is spread over the documentation and incomplete). |
Hi,
I have a case, where I use a push to a DataFrame on multiprocessing. When the rare case occurs, that two processes are pushing on the same time, I get this error:
┌ Error: Error adding value to column :t.
└ @ DataFrames ~/.julia/packages/DataFrames/nxjiD/src/dataframe/dataframe.jl:1644
Is this a known bug? I found no documentation about it.
regards
jan
The text was updated successfully, but these errors were encountered: