What's the best method of getting correlation matrix? #3070

vopani · 2021-07-15T10:50:00Z

vopani
Jul 15, 2021

I'm looking for an equivalent or close solution to https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html which is getting the correlation matrix of a frame.

I see the current corr implementation is for two columns only and I can manually create the matrix in a loop, something like:

DT = dt.Frame(a=[1, 2, 3, 4], b=[3.1, 5.2, 7, 2.3], c=['x', 'y', 'x', 'z'], d=[2, None, 3, 1])

numeric_DT = DT[:, [int, float]]
numeric_ncols = numeric_DT.ncols
numeric_names = list(numeric_DT.names)
corr_matrix = dt.Frame([[None] * numeric_ncols] * (numeric_ncols + 1), names=['Columns'] + numeric_names)
corr_matrix[:, 0] = dt.Frame(numeric_names)

for i in range(numeric_DT.ncols):
    for j in range(numeric_DT.ncols):
        corr_matrix[i, j+1] = numeric_DT[:, dt.corr(dt.f[i], dt.f[j])]

corr_matrix

Columns	a	d	b
▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪	▪▪▪▪▪▪▪▪
a	1	−0.327327	−0.0365148
d	−0.327327	1	0.934533
b	−0.0365148	0.934533	1

but was looking at a native or better implementation of the same.

Answered by oleksiyskononenko

Jul 15, 2021

We could have the correlation matrix capability implemented. However, I guess that for big data performance of the native implementation will not be very different from your code, since the correlation reducer is already parallel internally. Btw, you could increase your performance by the factor of two calculating only a half of the symmetric matrix.

View full answer

oleksiyskononenko · 2021-07-15T17:43:16Z

oleksiyskononenko
Jul 15, 2021

We could have the correlation matrix capability implemented. However, I guess that for big data performance of the native implementation will not be very different from your code, since the correlation reducer is already parallel internally. Btw, you could increase your performance by the factor of two calculating only a half of the symmetric matrix.

1 reply

vopani Jul 15, 2021
Author

It would be useful to have this capability in-built, its a fairly common use-case. Even if implementation is similar, it'd make it simpler to use and showcase.

You're right on calculating only half.
Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the best method of getting correlation matrix? #3070

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

What's the best method of getting correlation matrix? #3070

vopani Jul 15, 2021

Replies: 1 comment · 1 reply

oleksiyskononenko Jul 15, 2021

vopani Jul 15, 2021 Author

vopani
Jul 15, 2021

Replies: 1 comment 1 reply

oleksiyskononenko
Jul 15, 2021

vopani Jul 15, 2021
Author