-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bump Polars 0.37 #861
Bump Polars 0.37 #861
Conversation
64233ca
to
4023c2e
Compare
categories = categories |> distinct() |> cast(:category) | ||
apply_series(series, :categorise, [categories]) | ||
end | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved string series and list of strings categorise to here.
- instead of throwing errors, it applies distinct to categories series
- nil check is missing - should we add nil_count check ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of throwing errors, it applies distinct to categories series
I don't think we should do this because we are mapping indexes into the list. If you remove duplicates, the indexes are shifted, and the result changes. Is there a reason we removed the Rust code responsible for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
https://pola.rs/posts/polars-string-type/
This caused errors in Series.categorise and encoding of strings/binary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hit a wall with getting RevMapping to work in Series.Categorise - so tried if it can be fixed in elixir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can do it in Elixir, but we need to get the unique_count and raise if different than size, and the nil count and raise.
s.str()?.into_iter().map(|option| option.encode(env)) | ||
)) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am not confident of generic_string_series_to_list
and generic_binary_series_to_list
. Feel free to make it better. I have a feeling that I messed up something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@josevalim this code ok ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea, we will need to wait for @philss' review. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about one thing performance - earlier binaries and strings were created from buffers using binary and subbinary. Now because of the new arrow implementation, they are exposing BinaryArrayView which gives values. So not sure if there is way to get earlier implementation working or new one is as good as old one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think code-wise it looks good! But I think we need a benchmark to be sure about performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem! I will take care of it. Thanks, and safe travels!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Sorry for the delay. I could find something interesting about this.
For the test, I'm trying to compare how much time it takes to encode a binary series, and the MB of RAM it uses. I'm using a custom allocator called PeakAlloc
for this.
The patch of the diffs I made to measure this is located in https://gist.github.com/philss/3f519ce2587461aa61d7fd9f77e3ea1f (only the Cargo.lock is out to avoid conflicts).
Timing and memory usage - main branch
Running on main
without calling GC:
$ MIX_ENV=prod mix run -e 'for i <- [:medium, :medium, :big, :medium, :medium], do: :timer.tc(fn -> Explorer.DataFrame.from_parquet!("./tmp/#{i}.parquet")["bins"] |> Explorer.Series.to_list() end) |> tap(fn {time, _val} -> IO.puts("done in: #{time / 1_000_000}") end)'
Results:
Begin: 30.768875 MB of RAM.
End: 30.769852 MB of RAM.
done in: 0.099455
Begin: 61.288383 MB of RAM.
End: 61.288383 MB of RAM.
done in: 0.101318
Begin: 3082.6296 MB of RAM.
End: 3082.6296 MB of RAM.
done in: 7.900625
Begin: 3082.6296 MB of RAM.
End: 3082.6296 MB of RAM.
done in: 2.738504
Begin: 61.288383 MB of RAM.
End: 61.288383 MB of RAM.
done in: 0.083659
We can see above that the memory grows, but soon it drops again when we try with a smaller file. It looks like the GC takes a while to run and drop the encoded content.
Timing and memory usage - this PR branch
$ MIX_ENV=prod mix run -e 'for i <- [:medium, :medium, :big, :medium, :medium, :small, :small], do: :timer.tc(fn -> Explorer.DataFrame.from_parquet!("./tmp/#{i}.parquet")["bins"] |> Explorer.Series.to_list() end) |> tap(fn {time, _val} -> IO.puts("done in: #{time / 1_000_000}"); :erlang.garbage_collect(); Process.sleep(1_000) end)'
Even with GC runs (I tried before without it), it does not drop after a while:
Begin: 39.60874 MB of RAM.
End: 39.60972 MB of RAM.
done in: 0.106566
Begin: 78.96812 MB of RAM.
End: 78.96812 MB of RAM.
done in: 0.130126
Begin: 4662.26 MB of RAM.
End: 4662.26 MB of RAM.
done in: 9.069385
Begin: 4701.618 MB of RAM.
End: 4701.618 MB of RAM.
done in: 0.042475
Begin: 4740.9766 MB of RAM.
End: 4740.9766 MB of RAM.
done in: 0.049214
Begin: 4741.016 MB of RAM.
End: 4741.016 MB of RAM.
done in: 0.323967
Begin: 4741.0557 MB of RAM.
End: 4741.0557 MB of RAM.
done in: 0.327083
And if we try to load the DF without encoding the series, it is dropped after a while. See below:
MIX_ENV=prod mix run -e 'for i <- [:medium, :medium, :big, :medium, :medium, :small, :small], do: :timer.tc(fn -> Explorer.DataFrame.from_parquet!("./tmp/#{i}.parquet")["bins"]; Explorer.Series.from_list([1, 2]) |> Explorer.Series.to_list() end) |> tap(fn {time, _val} -> IO.puts("done in: #{time / 1_000_000}"); Process.sleep(1_000) end)'
Begin: 39.609077 MB of RAM.
End: 39.610054 MB of RAM.
done in: 0.026808
Begin: 78.96885 MB of RAM.
End: 78.96885 MB of RAM.
done in: 0.02043
Begin: 4662.2607 MB of RAM.
End: 4662.2607 MB of RAM.
done in: 0.453696
Begin: 4701.6196 MB of RAM.
End: 4701.6196 MB of RAM.
done in: 0.012698
Begin: 4740.9785 MB of RAM.
End: 4740.9785 MB of RAM.
done in: 0.015301
Begin: 4741.0186 MB of RAM.
End: 4741.0186 MB of RAM.
done in: 3.74e-4
Begin: 0.2911501 MB of RAM.
End: 0.2911501 MB of RAM.
done in: 0.27713
So I suspect the new to_list/1
implementation for binary series is leaking something.
My idea is to investigate more the new APIs and see if the problem is in the Polars side or ours.
PS: The parquet files used in this experiment were made by creating a DF with a column called bins
that contains random bytes (using :crypto.strong_rand_bytes(24)
). The size of the files is the following:
-rw-r--r--. 1 philip philip 2.7G Feb 22 16:07 tmp/big.parquet (100_000_000 rows)
-rw-r--r--. 1 philip philip 27M Feb 22 15:49 tmp/medium.parquet (100_000 rows)
-rw-r--r--. 1 philip philip 28K Feb 22 15:48 tmp/small.parquet (1_000 rows)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'm pretty sure that this is the expected behavior, since Polars changed its internal string/binary series representation. According to the blog post "Why we have rewritten the string data type", they may store binaries in memory for more time (even requiring GC), and they may require more space as well. So what we had previously is not possible to achieve in this new model, for what I could understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that resolves last pending item in the review ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I would say it does. We are just deciding if we are going to release a version before merging this one.
native/explorer/src/expressions.rs
Outdated
@@ -602,7 +602,7 @@ pub fn expr_last(expr: ExExpr) -> ExExpr { | |||
|
|||
#[rustler::nif] | |||
pub fn expr_format(exprs: Vec<ExExpr>) -> ExExpr { | |||
ExExpr::new(concat_str(ex_expr_to_exprs(exprs), "")) | |||
ExExpr::new(concat_str(ex_expr_to_exprs(exprs), "", true)) //TODO: ignore_nulls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignore_nulls need to be true is what I felt reading docs. let me know if it needs to be false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like @cigrainger and @billylanchantin opinion on this one (and we should write tests).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also should we add new param ignore_nulls
just like polars ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also should we add new param
ignore_nulls
just like polars ?
I would be fine leaving this as a TODO for this PR. After reading:
I can see the value in having it. But the additional testing/documentation would necessitate a lot more work which we could instead do in a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @billylanchantin. We can do it in a follow-up.
@@ -149,4 +149,3 @@ sepal_length,sepal_width,petal_length,petal_width,species | |||
6.5,3.0,5.2,2.0,Iris-virginica | |||
6.2,3.4,5.4,2.3,Iris-virginica | |||
5.9,3.0,5.1,1.8,Iris-virginica | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
empty line is causing an extra row to be added to data frame and its related tests are failing..
native/explorer/src/encoding.rs
Outdated
// } | ||
// DataType::Binary => { | ||
// generic_binary_series_to_list(&s.resource, s.binary()?.downcast_iter(), env) | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will delete comments after review..
native/explorer/src/expressions.rs
Outdated
@@ -1051,7 +1055,7 @@ pub fn expr_second(expr: ExExpr) -> ExExpr { | |||
pub fn expr_join(expr: ExExpr, sep: String) -> ExExpr { | |||
let expr = expr.clone_inner(); | |||
|
|||
ExExpr::new(expr.list().join(sep.lit())) | |||
ExExpr::new(expr.list().join(sep.lit(), true)) //TODO: ignore_nulls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like @cigrainger and @billylanchantin opinion on this one (and we should write tests). Should join discard nulls? And what does it mean to ignore? does it mean they are removed or that they are considered an empty string?
In Elixir:
iex(2)> Enum.join [1, nil, 3], ","
"1,,3"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.pola.rs/py-polars/html/reference/expressions/api/polars.Expr.list.join.html
If there is a null
value, the whole output becomes null
.
Ignore null values (default).
If set to False, null values will be propagated. If the sub-list contains any null values, the output is None.
context:
pola-rs/polars#13701
pola-rs/polars#13877
pola-rs/polars#13877 (comment)
pola-rs/polars#13701 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I definitely prefer to ignore nulls then. Having the whole thing return null is surprising and hard to debug. So let's add tests and remove the comment and we are good to go on this one. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree 100%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep 100%
@@ -259,18 +259,6 @@ pub fn s_slice(series: ExSeries, offset: i64, length: usize) -> Result<ExSeries, | |||
Ok(ExSeries::new(series.slice(offset, length))) | |||
} | |||
|
|||
#[rustler::nif(schedule = "DirtyCpu")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shifted to elixir side for eager series. To align with ignore_nulls
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After apply José's suggestions, 🚢
Co-authored-by: José Valim <[email protected]>
💚 💙 💜 💛 ❤️ |
No description provided.