You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
correlate() allows for a 2nd data input, y. This raises the question of whether correlate is ever intended to be used with y differing from x.
It would be good to get some clarity/confirmation on how y was intended to be used.
I do not believe that y was intended to be used to supply data different to x for the 2 reasons that follow:
if y has the same number of columns as x but different names, correlate incorrectly assigns the wrong row names to the correlation data frame output. Specifically, it sets the row names to be the same as the column names (which come from the names of y). The column names should come from y and the rownames should come from x.
> correlate(mtcars[,1:2], mtcars[,3:4])
Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'
# A tibble: 2 x 3
rowname disp hp
<chr> <dbl> <dbl>
1 disp NA -0.776
2 hp 0.902 NA
We can verify that the correlations shown are not correct for mtcars (in addition to the fact that the matrix is not symmetric)
> cor(mtcars$hp, mtcars$disp)
[1] 0.7909486
We can see how the rows should be labeled using stats::cor(), which is what correlate() is using internally
if x and y have different numbers of columns, correlate fails as it is expecting a square matrix of correlations to be cast to cor_df
> correlate(mtcars[1:3], mtcars[4:5])
Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'
Error in as_cordf(x, diagonal = diagonal) :
Input object x is not a square. The number of columns must be equal to the number of rows.
Incidentally, stats::cor does allow x and y to have different numbers of variables:
This outcome was not anticipated. The original intention for correlate was to wrap cor(), providing the same correlation functionality but then convert the results to a data frame. I had not fully explored the outcomes of using y in various ways. I suppose, given that it's taken until now for someone to report the problem, using y is a rare thing in general. Regarldess, it's definitely a bug!
Options I can think of:
Adjust correlate() to handle y properly. Though, will result in having to be able to return an object that is NOT cor_df.
Remove support for y. It feels like the cost of keeping/fixing for it outweighs the benefits. However, this will be a breaking change.
correlate()
allows for a 2nd data input, y. This raises the question of whether correlate is ever intended to be used with y differing from x.It would be good to get some clarity/confirmation on how y was intended to be used.
I do not believe that y was intended to be used to supply data different to x for the 2 reasons that follow:
We can verify that the correlations shown are not correct for
mtcars
(in addition to the fact that the matrix is not symmetric)We can see how the rows should be labeled using
stats::cor()
, which is whatcorrelate()
is using internallycor_df
Incidentally,
stats::cor
does allow x and y to have different numbers of variables:The text was updated successfully, but these errors were encountered: