Speed of filter #3208

gugatr0n1c · 2022-10-26T08:40:47Z

Hello,

using julia 1.8.2, DataFrames 1.4.1

Suppose this code:

import Random
using BenchmarkTools
using DataFrames

sdf = DataFrame()
sdf[!, "SOME_STR"] = [Random.randstring() for i in 1:1_000_000]
sdf[!, "SOME_FLT"] = [Random.rand() for i in 1:1_000_000]

@btime filter(row -> occursin("E8", row.SOME_STR), sdf); # ---> 126.400 ms
@btime sdf[occursin.("E8", sdf[!, "SOME_STR"]), :]; # ---> 56.947 ms

@btime filter(row -> row.SOME_FLT > 0.5, sdf); # ---> 126.400 ms
@btime sdf[sdf[!, "SOME_FLT"] .> 0.5, :]; # --->  4.093 ms

Actually, I like code with filter more, but speed in some cases is much worse. Am I doing something wrong? Or filter does much more under the hood that cause this speed regression?

Cheers,
Lubo

The text was updated successfully, but these errors were encountered:

bkamins · 2022-10-26T10:12:27Z

This is a standard way to do it using filter:

julia> @btime sdf[occursin.("E8", sdf[!, "SOME_STR"]), :];
  50.419 ms (2000021 allocations: 122.24 MiB)

julia> @btime filter("SOME_STR" => contains("E8"), sdf);
  50.332 ms (2000023 allocations: 122.24 MiB)

julia> @btime sdf[sdf[!, "SOME_FLT"] .> 0.5, :];
  5.831 ms (23 allocations: 11.58 MiB)

julia> @btime filter("SOME_FLT" => >(0.5), sdf);
  5.871 ms (26 allocations: 11.58 MiB)

The style of filter you use is accepted because it is convenient, but it is not type stable, so it is expected to be slower.

bkamins closed this as completed Oct 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed of filter #3208

Speed of filter #3208

gugatr0n1c commented Oct 26, 2022 •

edited

Loading

bkamins commented Oct 26, 2022

Speed of filter #3208

Speed of filter #3208

Comments

gugatr0n1c commented Oct 26, 2022 • edited Loading

bkamins commented Oct 26, 2022

gugatr0n1c commented Oct 26, 2022 •

edited

Loading