add Series.corr/3 and Series.cov/2 #630

anthony-khong · 2023-06-28T15:24:22Z

This is based on corr and cov. For corr, I did not implement the spearman_rank_corr function, because it needs the propagate_nans Polars feature. I wonder what's the appetite for adding new Polars features? Should we do it in this case?

philss

Awesome! 🚢

I think we could discuss the naming, but the features looks good to me!

philss · 2023-06-28T19:31:30Z

lib/explorer/backend/lazy_series.ex

+    corr: 3,
+    cov: 2,


I would call it correlation and covariance, since they are short names and we usually prefer to avoid abbreviations. But let's hear what José thinks :)

anthony-khong · 2023-06-29T05:38:33Z

Thank you for the feedback! I've changed the names to correlation and covariance 👍

anthony-khong · 2023-06-29T05:52:03Z

Oh, there was one issue I ran into when developing this. I don't think it's related to the features here, but I may as well report it before this gets merged.

This works:

DF.new(a: [1.0, 8.0, 3.0, nil], b: [4.0, 5.0, 2.0, nil])
|> DF.mutate(c: covariance(a, b), d: correlation(a, b))

# #Explorer.DataFrame<
#  Polars[4 x 4]
#  a float [1.0, 8.0, 3.0, nil]
#  b float [4.0, 5.0, 2.0, nil]
#  c float [3.0, 3.0, 3.0, 3.0]
#  d float [0.5447047794019223, 0.5447047794019223, 0.5447047794019223,
#   0.5447047794019223]
# >

but this panics:

DF.new(a: [1.0, 8.0, 3.0], b: [4.0, 5.0, 2.0])
|> DF.concat_rows(DF.new(a: [nil], b: [nil]))
|> DF.mutate(c: covariance(a, b), d: correlation(a, b))

# thread '<unnamed>' panicked at 'validity must be equal to the array's length', [...]/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow2-0.17.2/src/array/primitive/mod.rs:241:5
# ** (ErlangError) Erlang error: :nif_panicked
# ...

I would've thought that DF.new(a: [1.0, 8.0, 3.0, nil], b: [4.0, 5.0, 2.0, nil]) is equivalent to DF.new(a: [1.0, 8.0, 3.0], b: [4.0, 5.0, 2.0]) |> DF.concat_rows(DF.new(a: [nil], b: [nil]))? Is this a bug, or perhaps I'm understanding concat_rows wrong?

josevalim · 2023-06-29T09:13:08Z

You are using it correctly. This looks like a polars bug. It is failing to merge two dataframes when one has nils and the other does not. Can you isolate it on the Rust side?

josevalim · 2023-06-29T09:14:24Z

💚 💙 💜 💛 ❤️

anthony-khong · 2023-06-29T13:31:23Z

You are using it correctly. This looks like a polars bug. It is failing to merge two dataframes when one has nils and the other does not. Can you isolate it on the Rust side?

Honestly, I don't know enough about Polars to debug this, but changing the following stops the panic:

diff --git a/native/explorer/src/dataframe.rs b/native/explorer/src/dataframe.rs
index fff57c5..0fe9389 100644
--- a/native/explorer/src/dataframe.rs
+++ b/native/explorer/src/dataframe.rs
@@ -107,7 +107,7 @@ pub fn df_concat_rows(
         out_df.vstack_mut(&df)?;
     }
     // Follows recommendation from docs and rechunk after many vstacks.
-    out_df.rechunk();
+    out_df.as_single_chunk_par();
     Ok(ExDataFrame::new(out_df))
 }

I think it's something to do with the should_rechunk function, seeing that the rechunk function is defined as:

    pub fn rechunk(&mut self) -> &mut Self {
        if self.should_rechunk() {
            self.as_single_chunk_par()
        } else {
            self
        }
    }

What do you think? Would it be worth fixing the bug by skipping their should_rechunk function and rechunk every time concat_rows is called?

josevalim · 2023-06-29T13:35:26Z

I think so. By definition there is more than one chunk, so we can always force it.

anthony-khong added 5 commits June 28, 2023 16:53

corr and cov that pass Rust compilation

3cd4286

unhandled nils

60a14cc

resolved dataframe test

4e49bed

correctly encoded nans

88becf7

add documentation, type guards and more tests

900da5e

philss approved these changes Jun 28, 2023

View reviewed changes

anthony-khong added 2 commits June 29, 2023 12:35

renamed corr/cov to correlation/covariance

70bc724

fix typo for covariance/2 error message

3ebf99e

josevalim merged commit 78ceb27 into elixir-explorer:main Jun 29, 2023

anthony-khong deleted the corr-and-cov branch June 29, 2023 10:59

anthony-khong mentioned this pull request Jun 29, 2023

force chunk in concat_rows #632

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add Series.corr/3 and Series.cov/2 #630

add Series.corr/3 and Series.cov/2 #630

anthony-khong commented Jun 28, 2023

philss left a comment

philss Jun 28, 2023

josevalim Jun 28, 2023

anthony-khong commented Jun 29, 2023

anthony-khong commented Jun 29, 2023

josevalim commented Jun 29, 2023

josevalim commented Jun 29, 2023

anthony-khong commented Jun 29, 2023

josevalim commented Jun 29, 2023

add Series.corr/3 and Series.cov/2 #630

add Series.corr/3 and Series.cov/2 #630

Conversation

anthony-khong commented Jun 28, 2023

philss left a comment

Choose a reason for hiding this comment

philss Jun 28, 2023

Choose a reason for hiding this comment

josevalim Jun 28, 2023

Choose a reason for hiding this comment

anthony-khong commented Jun 29, 2023

anthony-khong commented Jun 29, 2023

josevalim commented Jun 29, 2023

josevalim commented Jun 29, 2023

anthony-khong commented Jun 29, 2023

josevalim commented Jun 29, 2023