We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running the following code which generates some data and then reads it via Arrow.Table shows a very bad slow down when using threads:
Arrow.Table
using DataFrames, Dates, Arrow, StatsBase, Random, InlineStrings function generate_data(f) number_of_companies = 10000 dates = collect(Date(2001,1,1):Day(1):Date(2020,12,31)) companyid = sample(100000:1000000, number_of_companies, replace = false) number_of_items = length(companyid)*length(dates) df = DataFrame( dates = repeat(dates, outer = number_of_companies), companyid = repeat(companyid, inner = length(dates)), item1 = rand(number_of_items), item2 = randn(number_of_items), item3 = rand(1:1000,number_of_items), item4 = repeat([String7(randstring(['a':'z' 'A':'Z'],5)) for _ in 1:number_of_companies],length(dates)) ) @info "Saving to $f" open(f, "w") do f Arrow.write(f, Tables.partitioner(groupby(df,:dates))) end end f = "mytestdata.arrow" if !isfile(f) generate_data(f) end Arrow.Table(f) @time Arrow.Table(f)
Results:
❯julia arrowthreads.jl 0.203852 seconds (2.38 M allocations: 126.388 MiB, 34.93% gc time, 1.32% compilation time) ❯ julia --project --threads=3 arrowthreads.jl 6.603782 seconds (2.39 M allocations: 126.349 MiB, 0.46% gc time)
We can see that Arrow.Table spawns a task here https://github.com/apache/arrow-julia/blob/2696105d01cfda7c55d1902951a20908a3c205e5/src/table.jl#L525C18-L528 and from profiling we are spending almost all time waiting on the lock in https://github.com/JuliaServices/ConcurrentUtilities.jl/blob/5fced8291da84bd081cb2e27d2e16f5bc8081f38/src/synchronizer.jl#L108.
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Running the following code which generates some data and then reads it via
Arrow.Table
shows a very bad slow down when using threads:Results:
We can see that
Arrow.Table
spawns a task here https://github.com/apache/arrow-julia/blob/2696105d01cfda7c55d1902951a20908a3c205e5/src/table.jl#L525C18-L528 and from profiling we are spending almost all time waiting on the lock in https://github.com/JuliaServices/ConcurrentUtilities.jl/blob/5fced8291da84bd081cb2e27d2e16f5bc8081f38/src/synchronizer.jl#L108.The text was updated successfully, but these errors were encountered: