Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Severe negative scaling when running Julia threaded. #528

Open
KristofferC opened this issue Oct 14, 2024 · 0 comments
Open

Severe negative scaling when running Julia threaded. #528

KristofferC opened this issue Oct 14, 2024 · 0 comments

Comments

@KristofferC
Copy link
Contributor

Running the following code which generates some data and then reads it via Arrow.Table shows a very bad slow down when using threads:

using DataFrames, Dates, Arrow, StatsBase, Random, InlineStrings

function generate_data(f)
    number_of_companies = 10000
    dates = collect(Date(2001,1,1):Day(1):Date(2020,12,31))
    companyid = sample(100000:1000000, number_of_companies, replace = false)

    number_of_items = length(companyid)*length(dates)

    df = DataFrame(
            dates = repeat(dates, outer = number_of_companies),
            companyid = repeat(companyid, inner = length(dates)),
            item1 = rand(number_of_items),
            item2 = randn(number_of_items),
            item3 = rand(1:1000,number_of_items),
            item4 = repeat([String7(randstring(['a':'z' 'A':'Z'],5)) for _ in 1:number_of_companies],length(dates))
        )

    @info "Saving to $f"
    open(f, "w") do f
        Arrow.write(f, Tables.partitioner(groupby(df,:dates)))
    end
end

f = "mytestdata.arrow"

if !isfile(f)
    generate_data(f)
end

Arrow.Table(f)
@time Arrow.Table(f)

Results:

❯julia arrowthreads.jl
  0.203852 seconds (2.38 M allocations: 126.388 MiB, 34.93% gc time, 1.32% compilation time)

❯ julia  --project --threads=3 arrowthreads.jl
  6.603782 seconds (2.39 M allocations: 126.349 MiB, 0.46% gc time)

We can see that Arrow.Table spawns a task here https://github.com/apache/arrow-julia/blob/2696105d01cfda7c55d1902951a20908a3c205e5/src/table.jl#L525C18-L528 and from profiling we are spending almost all time waiting on the lock in https://github.com/JuliaServices/ConcurrentUtilities.jl/blob/5fced8291da84bd081cb2e27d2e16f5bc8081f38/src/synchronizer.jl#L108.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant