Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper usage of 2nd data input, y in correlate() #115

Open
thisisdaryn opened this issue Aug 17, 2020 · 1 comment
Open

Proper usage of 2nd data input, y in correlate() #115

thisisdaryn opened this issue Aug 17, 2020 · 1 comment
Labels
bug an unexpected problem or unintended behavior

Comments

@thisisdaryn
Copy link
Collaborator

correlate() allows for a 2nd data input, y. This raises the question of whether correlate is ever intended to be used with y differing from x.

It would be good to get some clarity/confirmation on how y was intended to be used.

I do not believe that y was intended to be used to supply data different to x for the 2 reasons that follow:

  1. if y has the same number of columns as x but different names, correlate incorrectly assigns the wrong row names to the correlation data frame output. Specifically, it sets the row names to be the same as the column names (which come from the names of y). The column names should come from y and the rownames should come from x.
> correlate(mtcars[,1:2], mtcars[,3:4])

Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'

# A tibble: 2 x 3
  rowname   disp     hp
  <chr>    <dbl>  <dbl>
1 disp    NA     -0.776
2 hp       0.902 NA    

We can verify that the correlations shown are not correct for mtcars (in addition to the fact that the matrix is not symmetric)

> cor(mtcars$hp, mtcars$disp)
[1] 0.7909486

We can see how the rows should be labeled using stats::cor(), which is what correlate() is using internally

> cor(mtcars[,1:2], mtcars[,3:4])
          disp         hp
mpg -0.8475514 -0.7761684
cyl  0.9020329  0.8324475
> 
  1. if x and y have different numbers of columns, correlate fails as it is expecting a square matrix of correlations to be cast to cor_df
> correlate(mtcars[1:3], mtcars[4:5])

Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'

 Error in as_cordf(x, diagonal = diagonal) : 
  Input object x is not a square. The number of columns must be equal to the number of rows. 

Incidentally, stats::cor does allow x and y to have different numbers of variables:

> cor(mtcars[1:3], mtcars[4:5])
             hp       drat
mpg  -0.7761684  0.6811719
cyl   0.8324475 -0.6999381
disp  0.7909486 -0.7102139
@drsimonj
Copy link
Collaborator

This outcome was not anticipated. The original intention for correlate was to wrap cor(), providing the same correlation functionality but then convert the results to a data frame. I had not fully explored the outcomes of using y in various ways. I suppose, given that it's taken until now for someone to report the problem, using y is a rare thing in general. Regarldess, it's definitely a bug!

Options I can think of:

  • Adjust correlate() to handle y properly. Though, will result in having to be able to return an object that is NOT cor_df.
  • Remove support for y. It feels like the cost of keeping/fixing for it outweighs the benefits. However, this will be a breaking change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants