Proper usage of 2nd data input, y in correlate() #115

thisisdaryn · 2020-08-17T20:30:10Z

correlate() allows for a 2nd data input, y. This raises the question of whether correlate is ever intended to be used with y differing from x.

It would be good to get some clarity/confirmation on how y was intended to be used.

I do not believe that y was intended to be used to supply data different to x for the 2 reasons that follow:

if y has the same number of columns as x but different names, correlate incorrectly assigns the wrong row names to the correlation data frame output. Specifically, it sets the row names to be the same as the column names (which come from the names of y). The column names should come from y and the rownames should come from x.

> correlate(mtcars[,1:2], mtcars[,3:4])

Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'

# A tibble: 2 x 3
  rowname   disp     hp
  <chr>    <dbl>  <dbl>
1 disp    NA     -0.776
2 hp       0.902 NA

We can verify that the correlations shown are not correct for mtcars (in addition to the fact that the matrix is not symmetric)

> cor(mtcars$hp, mtcars$disp)
[1] 0.7909486

We can see how the rows should be labeled using stats::cor(), which is what correlate() is using internally

> cor(mtcars[,1:2], mtcars[,3:4])
          disp         hp
mpg -0.8475514 -0.7761684
cyl  0.9020329  0.8324475
>

if x and y have different numbers of columns, correlate fails as it is expecting a square matrix of correlations to be cast to cor_df

> correlate(mtcars[1:3], mtcars[4:5])

Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'

 Error in as_cordf(x, diagonal = diagonal) : 
  Input object x is not a square. The number of columns must be equal to the number of rows.

Incidentally, stats::cor does allow x and y to have different numbers of variables:

> cor(mtcars[1:3], mtcars[4:5])
             hp       drat
mpg  -0.7761684  0.6811719
cyl   0.8324475 -0.6999381
disp  0.7909486 -0.7102139

The text was updated successfully, but these errors were encountered:

drsimonj · 2020-08-26T06:17:14Z

This outcome was not anticipated. The original intention for correlate was to wrap cor(), providing the same correlation functionality but then convert the results to a data frame. I had not fully explored the outcomes of using y in various ways. I suppose, given that it's taken until now for someone to report the problem, using y is a rare thing in general. Regarldess, it's definitely a bug!

Options I can think of:

Adjust correlate() to handle y properly. Though, will result in having to be able to return an object that is NOT cor_df.
Remove support for y. It feels like the cost of keeping/fixing for it outweighs the benefits. However, this will be a breaking change.

juliasilge mentioned this issue Oct 30, 2020

Improve network_plot and correlate for 1- and 2-column data.frames #122

Merged

thisisdaryn mentioned this issue Oct 31, 2020

Correlate is incorrect with numeric vectors as x,y #120

Closed

juliasilge added the bug an unexpected problem or unintended behavior label Nov 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proper usage of 2nd data input, y in correlate() #115

Proper usage of 2nd data input, y in correlate() #115

thisisdaryn commented Aug 17, 2020

drsimonj commented Aug 26, 2020

Proper usage of 2nd data input, y in correlate() #115

Proper usage of 2nd data input, y in correlate() #115

Comments

thisisdaryn commented Aug 17, 2020

drsimonj commented Aug 26, 2020