Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump Polars 0.37 #861

Merged
merged 4 commits into from
Feb 25, 2024
Merged

Conversation

lkarthee
Copy link
Member

No description provided.

categories = categories |> distinct() |> cast(:category)
apply_series(series, :categorise, [categories])
end

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved string series and list of strings categorise to here.

  • instead of throwing errors, it applies distinct to categories series
  • nil check is missing - should we add nil_count check ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of throwing errors, it applies distinct to categories series

I don't think we should do this because we are mapping indexes into the list. If you remove duplicates, the indexes are shifted, and the result changes. Is there a reason we removed the Rust code responsible for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

https://pola.rs/posts/polars-string-type/

This caused errors in Series.categorise and encoding of strings/binary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hit a wall with getting RevMapping to work in Series.Categorise - so tried if it can be fixed in elixir

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can do it in Elixir, but we need to get the unique_count and raise if different than size, and the nil count and raise.

s.str()?.into_iter().map(|option| option.encode(env))
))
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not confident of generic_string_series_to_list and generic_binary_series_to_list. Feel free to make it better. I have a feeling that I messed up something here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josevalim this code ok ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea, we will need to wait for @philss' review. :)

Copy link
Member Author

@lkarthee lkarthee Feb 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about one thing performance - earlier binaries and strings were created from buffers using binary and subbinary. Now because of the new arrow implementation, they are exposing BinaryArrayView which gives values. So not sure if there is way to get earlier implementation working or new one is as good as old one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think code-wise it looks good! But I think we need a benchmark to be sure about performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem! I will take care of it. Thanks, and safe travels!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Sorry for the delay. I could find something interesting about this.

For the test, I'm trying to compare how much time it takes to encode a binary series, and the MB of RAM it uses. I'm using a custom allocator called PeakAlloc for this.

The patch of the diffs I made to measure this is located in https://gist.github.com/philss/3f519ce2587461aa61d7fd9f77e3ea1f (only the Cargo.lock is out to avoid conflicts).

Timing and memory usage - main branch

Running on main without calling GC:

$ MIX_ENV=prod mix run -e 'for i <- [:medium, :medium, :big, :medium, :medium], do: :timer.tc(fn -> Explorer.DataFrame.from_parquet!("./tmp/#{i}.parquet")["bins"] |> Explorer.Series.to_list() end) |> tap(fn {time, _val} -> IO.puts("done in: #{time / 1_000_000}") end)'

Results:

Begin: 30.768875 MB of RAM.
End: 30.769852 MB of RAM.
done in: 0.099455

Begin: 61.288383 MB of RAM.
End: 61.288383 MB of RAM.
done in: 0.101318

Begin: 3082.6296 MB of RAM.
End: 3082.6296 MB of RAM.
done in: 7.900625

Begin: 3082.6296 MB of RAM.
End: 3082.6296 MB of RAM.
done in: 2.738504

Begin: 61.288383 MB of RAM.
End: 61.288383 MB of RAM.
done in: 0.083659

We can see above that the memory grows, but soon it drops again when we try with a smaller file. It looks like the GC takes a while to run and drop the encoded content.

Timing and memory usage - this PR branch

$ MIX_ENV=prod mix run -e 'for i <- [:medium, :medium, :big, :medium, :medium, :small, :small], do: :timer.tc(fn -> Explorer.DataFrame.from_parquet!("./tmp/#{i}.parquet")["bins"] |> Explorer.Series.to_list() end) |> tap(fn {time, _val} -> IO.puts("done in: #{time / 1_000_000}"); :erlang.garbage_collect(); Process.sleep(1_000) end)'

Even with GC runs (I tried before without it), it does not drop after a while:

Begin: 39.60874 MB of RAM.
End: 39.60972 MB of RAM.
done in: 0.106566

Begin:  78.96812 MB of RAM.
End:  78.96812 MB of RAM.
done in: 0.130126

Begin:  4662.26 MB of RAM.
End:  4662.26 MB of RAM.
done in: 9.069385

Begin:  4701.618 MB of RAM.
End:  4701.618 MB of RAM.
done in: 0.042475

Begin:  4740.9766 MB of RAM.
End:  4740.9766 MB of RAM.
done in: 0.049214

Begin:  4741.016 MB of RAM.
End:  4741.016 MB of RAM.
done in: 0.323967

Begin:  4741.0557 MB of RAM.
End:  4741.0557 MB of RAM.
done in: 0.327083

And if we try to load the DF without encoding the series, it is dropped after a while. See below:

MIX_ENV=prod mix run -e 'for i <- [:medium, :medium, :big, :medium, :medium, :small, :small], do: :timer.tc(fn -> Explorer.DataFrame.from_parquet!("./tmp/#{i}.parquet")["bins"]; Explorer.Series.from_list([1, 2]) |> Explorer.Series.to_list() end) |> tap(fn {time, _val} -> IO.puts("done in: #{time / 1_000_000}"); Process.sleep(1_000) end)'
Begin: 39.609077 MB of RAM.
End: 39.610054 MB of RAM.
done in: 0.026808

Begin: 78.96885 MB of RAM.
End: 78.96885 MB of RAM.
done in: 0.02043

Begin: 4662.2607 MB of RAM.
End: 4662.2607 MB of RAM.
done in: 0.453696

Begin: 4701.6196 MB of RAM.
End: 4701.6196 MB of RAM.
done in: 0.012698

Begin: 4740.9785 MB of RAM.
End: 4740.9785 MB of RAM.
done in: 0.015301

Begin: 4741.0186 MB of RAM.
End: 4741.0186 MB of RAM.
done in: 3.74e-4

Begin: 0.2911501 MB of RAM.
End: 0.2911501 MB of RAM.
done in: 0.27713

So I suspect the new to_list/1 implementation for binary series is leaking something.
My idea is to investigate more the new APIs and see if the problem is in the Polars side or ours.

PS: The parquet files used in this experiment were made by creating a DF with a column called bins that contains random bytes (using :crypto.strong_rand_bytes(24)). The size of the files is the following:

-rw-r--r--. 1 philip philip 2.7G Feb 22 16:07 tmp/big.parquet (100_000_000 rows)
-rw-r--r--. 1 philip philip  27M Feb 22 15:49 tmp/medium.parquet (100_000 rows)
-rw-r--r--. 1 philip philip  28K Feb 22 15:48 tmp/small.parquet (1_000 rows)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm pretty sure that this is the expected behavior, since Polars changed its internal string/binary series representation. According to the blog post "Why we have rewritten the string data type", they may store binaries in memory for more time (even requiring GC), and they may require more space as well. So what we had previously is not possible to achieve in this new model, for what I could understand.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that resolves last pending item in the review ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I would say it does. We are just deciding if we are going to release a version before merging this one.

@@ -602,7 +602,7 @@ pub fn expr_last(expr: ExExpr) -> ExExpr {

#[rustler::nif]
pub fn expr_format(exprs: Vec<ExExpr>) -> ExExpr {
ExExpr::new(concat_str(ex_expr_to_exprs(exprs), ""))
ExExpr::new(concat_str(ex_expr_to_exprs(exprs), "", true)) //TODO: ignore_nulls
Copy link
Member Author

@lkarthee lkarthee Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore_nulls need to be true is what I felt reading docs. let me know if it needs to be false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like @cigrainger and @billylanchantin opinion on this one (and we should write tests).

Copy link
Member Author

@lkarthee lkarthee Feb 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also should we add new param ignore_nulls just like polars ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also should we add new param ignore_nulls just like polars ?

I would be fine leaving this as a TODO for this PR. After reading:

I can see the value in having it. But the additional testing/documentation would necessitate a lot more work which we could instead do in a follow-up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @billylanchantin. We can do it in a follow-up.

@@ -149,4 +149,3 @@ sepal_length,sepal_width,petal_length,petal_width,species
6.5,3.0,5.2,2.0,Iris-virginica
6.2,3.4,5.4,2.3,Iris-virginica
5.9,3.0,5.1,1.8,Iris-virginica

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty line is causing an extra row to be added to data frame and its related tests are failing..

// }
// DataType::Binary => {
// generic_binary_series_to_list(&s.resource, s.binary()?.downcast_iter(), env)
// }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will delete comments after review..

@@ -1051,7 +1055,7 @@ pub fn expr_second(expr: ExExpr) -> ExExpr {
pub fn expr_join(expr: ExExpr, sep: String) -> ExExpr {
let expr = expr.clone_inner();

ExExpr::new(expr.list().join(sep.lit()))
ExExpr::new(expr.list().join(sep.lit(), true)) //TODO: ignore_nulls
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like @cigrainger and @billylanchantin opinion on this one (and we should write tests). Should join discard nulls? And what does it mean to ignore? does it mean they are removed or that they are considered an empty string?

In Elixir:

iex(2)> Enum.join [1, nil, 3], ","
"1,,3"

Copy link
Member Author

@lkarthee lkarthee Feb 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.pola.rs/py-polars/html/reference/expressions/api/polars.Expr.list.join.html

If there is a null value, the whole output becomes null.

Ignore null values (default).
If set to False, null values will be propagated. If the sub-list contains any null values, the output is None.

context:
pola-rs/polars#13701
pola-rs/polars#13877

pola-rs/polars#13877 (comment)
pola-rs/polars#13701 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I definitely prefer to ignore nulls then. Having the whole thing return null is surprising and hard to debug. So let's add tests and remove the comment and we are good to go on this one. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree 100%

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep 100%

@@ -259,18 +259,6 @@ pub fn s_slice(series: ExSeries, offset: i64, length: usize) -> Result<ExSeries,
Ok(ExSeries::new(series.slice(offset, length)))
}

#[rustler::nif(schedule = "DirtyCpu")]
Copy link
Member Author

@lkarthee lkarthee Feb 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shifted to elixir side for eager series. To align with ignore_nulls.

@cigrainger cigrainger mentioned this pull request Feb 19, 2024
5 tasks
lib/explorer/series.ex Outdated Show resolved Hide resolved
Copy link
Member

@philss philss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After apply José's suggestions, 🚢

@josevalim josevalim merged commit 7f168f5 into elixir-explorer:main Feb 25, 2024
4 checks passed
@josevalim
Copy link
Member

💚 💙 💜 💛 ❤️

@lkarthee lkarthee deleted the bump_polars_v0_37 branch February 25, 2024 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants