What's the best method of getting correlation matrix? #3070
-
I'm looking for an equivalent or close solution to https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html which is getting the correlation matrix of a frame. I see the current DT = dt.Frame(a=[1, 2, 3, 4], b=[3.1, 5.2, 7, 2.3], c=['x', 'y', 'x', 'z'], d=[2, None, 3, 1])
numeric_DT = DT[:, [int, float]]
numeric_ncols = numeric_DT.ncols
numeric_names = list(numeric_DT.names)
corr_matrix = dt.Frame([[None] * numeric_ncols] * (numeric_ncols + 1), names=['Columns'] + numeric_names)
corr_matrix[:, 0] = dt.Frame(numeric_names)
for i in range(numeric_DT.ncols):
for j in range(numeric_DT.ncols):
corr_matrix[i, j+1] = numeric_DT[:, dt.corr(dt.f[i], dt.f[j])]
corr_matrix
but was looking at a native or better implementation of the same. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
We could have the correlation matrix capability implemented. However, I guess that for big data performance of the native implementation will not be very different from your code, since the correlation reducer is already parallel internally. Btw, you could increase your performance by the factor of two calculating only a half of the symmetric matrix. |
Beta Was this translation helpful? Give feedback.
We could have the correlation matrix capability implemented. However, I guess that for big data performance of the native implementation will not be very different from your code, since the correlation reducer is already parallel internally. Btw, you could increase your performance by the factor of two calculating only a half of the symmetric matrix.