Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Series.corr/3 and Series.cov/2 #630

Merged
merged 7 commits into from
Jun 29, 2023

Conversation

anthony-khong
Copy link
Contributor

This is based on corr and cov. For corr, I did not implement the spearman_rank_corr function, because it needs the propagate_nans Polars feature. I wonder what's the appetite for adding new Polars features? Should we do it in this case?

Copy link
Member

@philss philss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! 🚢

I think we could discuss the naming, but the features looks good to me!

Comment on lines 108 to 109
corr: 3,
cov: 2,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call it correlation and covariance, since they are short names and we usually prefer to avoid abbreviations. But let's hear what José thinks :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

@anthony-khong
Copy link
Contributor Author

Thank you for the feedback! I've changed the names to correlation and covariance 👍

@anthony-khong
Copy link
Contributor Author

Oh, there was one issue I ran into when developing this. I don't think it's related to the features here, but I may as well report it before this gets merged.

This works:

DF.new(a: [1.0, 8.0, 3.0, nil], b: [4.0, 5.0, 2.0, nil])
|> DF.mutate(c: covariance(a, b), d: correlation(a, b))

# #Explorer.DataFrame<
#  Polars[4 x 4]
#  a float [1.0, 8.0, 3.0, nil]
#  b float [4.0, 5.0, 2.0, nil]
#  c float [3.0, 3.0, 3.0, 3.0]
#  d float [0.5447047794019223, 0.5447047794019223, 0.5447047794019223,
#   0.5447047794019223]
# >

but this panics:

DF.new(a: [1.0, 8.0, 3.0], b: [4.0, 5.0, 2.0])
|> DF.concat_rows(DF.new(a: [nil], b: [nil]))
|> DF.mutate(c: covariance(a, b), d: correlation(a, b))

# thread '<unnamed>' panicked at 'validity must be equal to the array's length', [...]/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow2-0.17.2/src/array/primitive/mod.rs:241:5
# ** (ErlangError) Erlang error: :nif_panicked
# ...

I would've thought that DF.new(a: [1.0, 8.0, 3.0, nil], b: [4.0, 5.0, 2.0, nil]) is equivalent to DF.new(a: [1.0, 8.0, 3.0], b: [4.0, 5.0, 2.0]) |> DF.concat_rows(DF.new(a: [nil], b: [nil]))? Is this a bug, or perhaps I'm understanding concat_rows wrong?

@josevalim
Copy link
Member

You are using it correctly. This looks like a polars bug. It is failing to merge two dataframes when one has nils and the other does not. Can you isolate it on the Rust side?

@josevalim josevalim merged commit 78ceb27 into elixir-explorer:main Jun 29, 2023
@josevalim
Copy link
Member

💚 💙 💜 💛 ❤️

@anthony-khong anthony-khong deleted the corr-and-cov branch June 29, 2023 10:59
@anthony-khong
Copy link
Contributor Author

You are using it correctly. This looks like a polars bug. It is failing to merge two dataframes when one has nils and the other does not. Can you isolate it on the Rust side?

Honestly, I don't know enough about Polars to debug this, but changing the following stops the panic:

diff --git a/native/explorer/src/dataframe.rs b/native/explorer/src/dataframe.rs
index fff57c5..0fe9389 100644
--- a/native/explorer/src/dataframe.rs
+++ b/native/explorer/src/dataframe.rs
@@ -107,7 +107,7 @@ pub fn df_concat_rows(
         out_df.vstack_mut(&df)?;
     }
     // Follows recommendation from docs and rechunk after many vstacks.
-    out_df.rechunk();
+    out_df.as_single_chunk_par();
     Ok(ExDataFrame::new(out_df))
 }

I think it's something to do with the should_rechunk function, seeing that the rechunk function is defined as:

    pub fn rechunk(&mut self) -> &mut Self {
        if self.should_rechunk() {
            self.as_single_chunk_par()
        } else {
            self
        }
    }

What do you think? Would it be worth fixing the bug by skipping their should_rechunk function and rechunk every time concat_rows is called?

@josevalim
Copy link
Member

I think so. By definition there is more than one chunk, so we can always force it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants