diff --git a/docs/reference/dcov.html b/docs/reference/dcov.html index 1d1aef3..7adeb93 100644 --- a/docs/reference/dcov.html +++ b/docs/reference/dcov.html @@ -1,215 +1,214 @@ - -
dcov.Rd
Computes distance covariance and distance correlation statistics, - which are multivariate measures of dependence.
-dcov(x, y, index = 1.0)
-dcor(x, y, index = 1.0)
data or distances of first sample
data or distances of second sample
exponent on Euclidean distance, in (0,2]
dcov
and dcor
compute distance
- covariance and distance correlation statistics.
The sample sizes (number of rows) of the two samples must - agree, and samples must not contain missing values.
-The index
is an optional exponent on Euclidean distance.
-Valid exponents for energy are in (0, 2) excluding 2.
Argument types supported are -numeric data matrix, data.frame, or tibble, with observations in rows; -numeric vector; ordered or unordered factors. In case of unordered factors -a 0-1 distance matrix is computed.
-Optionally pre-computed distances can be input as class "dist" objects or as distance matrices. - For data types of arguments, distance matrices are computed internally.
-Distance correlation is a new measure of dependence between random -vectors introduced by Szekely, Rizzo, and Bakirov (2007). -For all distributions with finite first moments, distance -correlation \(\mathcal R\) generalizes the idea of correlation in two -fundamental ways: - (1) \(\mathcal R(X,Y)\) is defined for \(X\) and \(Y\) in arbitrary dimension. - (2) \(\mathcal R(X,Y)=0\) characterizes independence of \(X\) and - \(Y\).
-Distance correlation satisfies \(0 \le \mathcal R \le 1\), and -\(\mathcal R = 0\) only if \(X\) and \(Y\) are independent. Distance -covariance \(\mathcal V\) provides a new approach to the problem of -testing the joint independence of random vectors. The formal -definitions of the population coefficients \(\mathcal V\) and -\(\mathcal R\) are given in (SRB 2007). The definitions of the -empirical coefficients are as follows.
-The empirical distance covariance \(\mathcal{V}_n(\mathbf{X,Y})\)
-with index 1 is
-the nonnegative number defined by
-$$
- \mathcal{V}^2_n (\mathbf{X,Y}) = \frac{1}{n^2} \sum_{k,\,l=1}^n
- A_{kl}B_{kl}
- $$
- where \(A_{kl}\) and \(B_{kl}\) are
- $$
-A_{kl} = a_{kl}-\bar a_{k.}- \bar a_{.l} + \bar a_{..}
-$$
-$$
- B_{kl} = b_{kl}-\bar b_{k.}- \bar b_{.l} + \bar b_{..}.
- $$
-Here
-$$
-a_{kl} = \|X_k - X_l\|_p, \quad b_{kl} = \|Y_k - Y_l\|_q, \quad
-k,l=1,\dots,n,
-$$
-and the subscript .
denotes that the mean is computed for the
-index that it replaces. Similarly,
-\(\mathcal{V}_n(\mathbf{X})\) is the nonnegative number defined by
-$$
- \mathcal{V}^2_n (\mathbf{X}) = \mathcal{V}^2_n (\mathbf{X,X}) =
- \frac{1}{n^2} \sum_{k,\,l=1}^n
- A_{kl}^2.
- $$
The empirical distance correlation \(\mathcal{R}_n(\mathbf{X,Y})\) is
-the square root of
-$$
- \mathcal{R}^2_n(\mathbf{X,Y})=
- \frac {\mathcal{V}^2_n(\mathbf{X,Y})}
- {\sqrt{ \mathcal{V}^2_n (\mathbf{X}) \mathcal{V}^2_n(\mathbf{Y})}}.
-$$
-See dcov.test
for a test of multivariate independence
-based on the distance covariance statistic.
dcov
returns the sample distance covariance and
-dcor
returns the sample distance correlation.
Note that it is inefficient to compute dCor by:
-square root of
-dcov(x,y)/sqrt(dcov(x,x)*dcov(y,y))
because the individual
-calls to dcov
involve unnecessary repetition of calculations.
Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007),
- Measuring and Testing Dependence by Correlation of Distances,
- Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.
-
doi:10.1214/009053607000000505
Szekely, G.J. and Rizzo, M.L. (2009),
- Brownian Distance Covariance,
- Annals of Applied Statistics,
- Vol. 3, No. 4, 1236-1265.
-
doi:10.1214/09-AOAS312
Szekely, G.J. and Rizzo, M.L. (2009), - Rejoinder: Brownian Distance Covariance, - Annals of Applied Statistics, Vol. 3, No. 4, 1303-1308.
-dcov.Rd
Computes distance covariance and distance correlation statistics, + which are multivariate measures of dependence.
+dcov(x, y, index = 1.0)
+dcor(x, y, index = 1.0)
dcov
and dcor
compute distance
+ covariance and distance correlation statistics.
The sample sizes (number of rows) of the two samples must + agree, and samples must not contain missing values.
+The index
is an optional exponent on Euclidean distance.
+Valid exponents for energy are in (0, 2) excluding 2.
Argument types supported are +numeric data matrix, data.frame, or tibble, with observations in rows; +numeric vector; ordered or unordered factors. In case of unordered factors +a 0-1 distance matrix is computed.
+Optionally pre-computed distances can be input as class "dist" objects or as distance matrices. + For data types of arguments, distance matrices are computed internally.
+Distance correlation is a new measure of dependence between random +vectors introduced by Szekely, Rizzo, and Bakirov (2007). +For all distributions with finite first moments, distance +correlation \(\mathcal R\) generalizes the idea of correlation in two +fundamental ways: + (1) \(\mathcal R(X,Y)\) is defined for \(X\) and \(Y\) in arbitrary dimension. + (2) \(\mathcal R(X,Y)=0\) characterizes independence of \(X\) and + \(Y\).
+Distance correlation satisfies \(0 \le \mathcal R \le 1\), and +\(\mathcal R = 0\) only if \(X\) and \(Y\) are independent. Distance +covariance \(\mathcal V\) provides a new approach to the problem of +testing the joint independence of random vectors. The formal +definitions of the population coefficients \(\mathcal V\) and +\(\mathcal R\) are given in (SRB 2007). The definitions of the +empirical coefficients are as follows.
+The empirical distance covariance \(\mathcal{V}_n(\mathbf{X,Y})\)
+with index 1 is
+the nonnegative number defined by
+$$
+ \mathcal{V}^2_n (\mathbf{X,Y}) = \frac{1}{n^2} \sum_{k,\,l=1}^n
+ A_{kl}B_{kl}
+ $$
+ where \(A_{kl}\) and \(B_{kl}\) are
+ $$
+A_{kl} = a_{kl}-\bar a_{k.}- \bar a_{.l} + \bar a_{..}
+$$
+$$
+ B_{kl} = b_{kl}-\bar b_{k.}- \bar b_{.l} + \bar b_{..}.
+ $$
+Here
+$$
+a_{kl} = \|X_k - X_l\|_p, \quad b_{kl} = \|Y_k - Y_l\|_q, \quad
+k,l=1,\dots,n,
+$$
+and the subscript .
denotes that the mean is computed for the
+index that it replaces. Similarly,
+\(\mathcal{V}_n(\mathbf{X})\) is the nonnegative number defined by
+$$
+ \mathcal{V}^2_n (\mathbf{X}) = \mathcal{V}^2_n (\mathbf{X,X}) =
+ \frac{1}{n^2} \sum_{k,\,l=1}^n
+ A_{kl}^2.
+ $$
The empirical distance correlation \(\mathcal{R}_n(\mathbf{X,Y})\) is
+the square root of
+$$
+ \mathcal{R}^2_n(\mathbf{X,Y})=
+ \frac {\mathcal{V}^2_n(\mathbf{X,Y})}
+ {\sqrt{ \mathcal{V}^2_n (\mathbf{X}) \mathcal{V}^2_n(\mathbf{Y})}}.
+$$
+See dcov.test
for a test of multivariate independence
+based on the distance covariance statistic.
dcov
returns the sample distance covariance and
+dcor
returns the sample distance correlation.
Note that it is inefficient to compute dCor by:
+square root of
+dcov(x,y)/sqrt(dcov(x,x)*dcov(y,y))
because the individual
+calls to dcov
involve unnecessary repetition of calculations.
Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007),
+ Measuring and Testing Dependence by Correlation of Distances,
+ Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.
+
doi:10.1214/009053607000000505
Szekely, G.J. and Rizzo, M.L. (2009),
+ Brownian Distance Covariance,
+ Annals of Applied Statistics,
+ Vol. 3, No. 4, 1236-1265.
+
doi:10.1214/09-AOAS312
Szekely, G.J. and Rizzo, M.L. (2009), + Rejoinder: Brownian Distance Covariance, + Annals of Applied Statistics, Vol. 3, No. 4, 1303-1308.
+dcov.test.Rd
Distance covariance test and distance correlation test of multivariate independence. - Distance covariance and distance correlation are - multivariate measures of dependence.
-dcov.test(x, y, index = 1.0, R = NULL)
-dcor.test(x, y, index = 1.0, R)
data or distances of first sample
data or distances of second sample
number of replicates
exponent on Euclidean distance, in (0,2]
dcov.test
and dcor.test
are nonparametric
- tests of multivariate independence. The test decision is
- obtained via permutation bootstrap, with R
replicates.
The sample sizes (number of rows) of the two samples must - agree, and samples must not contain missing values.
-The index
is an optional exponent on Euclidean distance.
-Valid exponents for energy are in (0, 2) excluding 2.
Argument types supported are -numeric data matrix, data.frame, or tibble, with observations in rows; -numeric vector; ordered or unordered factors. In case of unordered factors -a 0-1 distance matrix is computed.
-Optionally pre-computed distances can be input as class "dist" objects or as distance matrices. -For data types of arguments, -distance matrices are computed internally.
-The dcov
test statistic is
- \(n \mathcal V_n^2\) where
- \(\mathcal V_n(x,y)\) = dcov(x,y),
- which is based on interpoint Euclidean distances
- \(\|x_{i}-x_{j}\|\). The index
- is an optional exponent on Euclidean distance.
Similarly, the dcor
test statistic is based on the normalized
-coefficient, the distance correlation. (See the manual page for dcor
.)
Distance correlation is a new measure of dependence between random -vectors introduced by Szekely, Rizzo, and Bakirov (2007). -For all distributions with finite first moments, distance -correlation \(\mathcal R\) generalizes the idea of correlation in two -fundamental ways:
-(1) \(\mathcal R(X,Y)\) is defined for \(X\) and \(Y\) in arbitrary dimension.
-(2) \(\mathcal R(X,Y)=0\) characterizes independence of \(X\) and - \(Y\).
-Characterization (2) also holds for powers of Euclidean distance \(\|x_i-x_j\|^s\), where \(0<s<2\), but (2) does not hold when \(s=2\).
-Distance correlation satisfies \(0 \le \mathcal R \le 1\), and
-\(\mathcal R = 0\) only if \(X\) and \(Y\) are independent. Distance
-covariance \(\mathcal V\) provides a new approach to the problem of
-testing the joint independence of random vectors. The formal
-definitions of the population coefficients \(\mathcal V\) and
-\(\mathcal R\) are given in (SRB 2007). The definitions of the
-empirical coefficients are given in the energy
-dcov
topic.
For all values of the index in (0,2), under independence -the asymptotic distribution of \(n\mathcal V_n^2\) -is a quadratic form of centered Gaussian random variables, -with coefficients that depend on the distributions of \(X\) and \(Y\). For the general problem of testing independence when the distributions of \(X\) and \(Y\) are unknown, the test based on \(n\mathcal V^2_n\) can be implemented as a permutation test. See (SRB 2007) for -theoretical properties of the test, including statistical consistency.
-dcov.test
or dcor.test
returns a list with class htest
containing
description of test
observed value of the test statistic
dCov(x,y) or dCor(x,y)
a vector: [dCov(x,y), dCor(x,y), dVar(x), dVar(y)]
logical, permutation test applied
replicates of the test statistic
approximate p-value of the test
sample size
description of data
For the dcov test of independence, -the distance covariance test statistic is the V-statistic -\(\mathrm{n\, dCov^2} = n \mathcal{V}_n^2\) (not dCov).
-Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007),
- Measuring and Testing Dependence by Correlation of Distances,
- Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.
-
doi:10.1214/009053607000000505
Szekely, G.J. and Rizzo, M.L. (2009),
- Brownian Distance Covariance,
- Annals of Applied Statistics,
- Vol. 3, No. 4, 1236-1265.
-
doi:10.1214/09-AOAS312
Szekely, G.J. and Rizzo, M.L. (2009), - Rejoinder: Brownian Distance Covariance, - Annals of Applied Statistics, Vol. 3, No. 4, 1303-1308.
- x <- iris[1:50, 1:4]
- y <- iris[51:100, 1:4]
- set.seed(1)
- dcor.test(dist(x), dist(y), R=199)
-#>
-#> dCor independence test (permutation test)
-#>
-#> data: index 1, replicates 199
-#> dCor = 0.30605, p-value = 0.955
-#> sample estimates:
-#> dCov dCor dVar(X) dVar(Y)
-#> 0.1025087 0.3060479 0.2712927 0.4135274
-#>
- set.seed(1)
- dcov.test(x, y, R=199)
-#>
-#> dCov independence test (permutation test)
-#>
-#> data: index 1, replicates 199
-#> nV^2 = 0.5254, p-value = 0.955
-#> sample estimates:
-#> dCov
-#> 0.1025087
-#>
-
dcov.test.Rd
Distance covariance test and distance correlation test of multivariate independence. + Distance covariance and distance correlation are + multivariate measures of dependence.
+dcov.test(x, y, index = 1.0, R = NULL)
+dcor.test(x, y, index = 1.0, R)
dcov.test
and dcor.test
are nonparametric
+ tests of multivariate independence. The test decision is
+ obtained via permutation bootstrap, with R
replicates.
The sample sizes (number of rows) of the two samples must + agree, and samples must not contain missing values.
+The index
is an optional exponent on Euclidean distance.
+Valid exponents for energy are in (0, 2) excluding 2.
Argument types supported are +numeric data matrix, data.frame, or tibble, with observations in rows; +numeric vector; ordered or unordered factors. In case of unordered factors +a 0-1 distance matrix is computed.
+Optionally pre-computed distances can be input as class "dist" objects or as distance matrices. +For data types of arguments, +distance matrices are computed internally.
+The dcov
test statistic is
+ \(n \mathcal V_n^2\) where
+ \(\mathcal V_n(x,y)\) = dcov(x,y),
+ which is based on interpoint Euclidean distances
+ \(\|x_{i}-x_{j}\|\). The index
+ is an optional exponent on Euclidean distance.
Similarly, the dcor
test statistic is based on the normalized
+coefficient, the distance correlation. (See the manual page for dcor
.)
Distance correlation is a new measure of dependence between random +vectors introduced by Szekely, Rizzo, and Bakirov (2007). +For all distributions with finite first moments, distance +correlation \(\mathcal R\) generalizes the idea of correlation in two +fundamental ways:
+(1) \(\mathcal R(X,Y)\) is defined for \(X\) and \(Y\) in arbitrary dimension.
+(2) \(\mathcal R(X,Y)=0\) characterizes independence of \(X\) and + \(Y\).
+Characterization (2) also holds for powers of Euclidean distance \(\|x_i-x_j\|^s\), where \(0<s<2\), but (2) does not hold when \(s=2\).
+Distance correlation satisfies \(0 \le \mathcal R \le 1\), and
+\(\mathcal R = 0\) only if \(X\) and \(Y\) are independent. Distance
+covariance \(\mathcal V\) provides a new approach to the problem of
+testing the joint independence of random vectors. The formal
+definitions of the population coefficients \(\mathcal V\) and
+\(\mathcal R\) are given in (SRB 2007). The definitions of the
+empirical coefficients are given in the energy
+dcov
topic.
For all values of the index in (0,2), under independence +the asymptotic distribution of \(n\mathcal V_n^2\) +is a quadratic form of centered Gaussian random variables, +with coefficients that depend on the distributions of \(X\) and \(Y\). For the general problem of testing independence when the distributions of \(X\) and \(Y\) are unknown, the test based on \(n\mathcal V^2_n\) can be implemented as a permutation test. See (SRB 2007) for +theoretical properties of the test, including statistical consistency.
+dcov.test
or dcor.test
returns a list with class htest
containing
description of test
observed value of the test statistic
dCov(x,y) or dCor(x,y)
a vector: [dCov(x,y), dCor(x,y), dVar(x), dVar(y)]
logical, permutation test applied
replicates of the test statistic
approximate p-value of the test
sample size
description of data
For the dcov test of independence, +the distance covariance test statistic is the V-statistic +\(\mathrm{n\, dCov^2} = n \mathcal{V}_n^2\) (not dCov).
+Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007),
+ Measuring and Testing Dependence by Correlation of Distances,
+ Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.
+
doi:10.1214/009053607000000505
Szekely, G.J. and Rizzo, M.L. (2009),
+ Brownian Distance Covariance,
+ Annals of Applied Statistics,
+ Vol. 3, No. 4, 1236-1265.
+
doi:10.1214/09-AOAS312
Szekely, G.J. and Rizzo, M.L. (2009), + Rejoinder: Brownian Distance Covariance, + Annals of Applied Statistics, Vol. 3, No. 4, 1303-1308.
+ x <- iris[1:50, 1:4]
+ y <- iris[51:100, 1:4]
+ set.seed(1)
+ dcor.test(dist(x), dist(y), R=199)
+#>
+#> dCor independence test (permutation test)
+#>
+#> data: index 1, replicates 199
+#> dCor = 0.30605, p-value = 0.955
+#> sample estimates:
+#> dCov dCor dVar(X) dVar(Y)
+#> 0.1025087 0.3060479 0.2712927 0.4135274
+#>
+ set.seed(1)
+ dcov.test(x, y, R=199)
+#>
+#> dCov independence test (permutation test)
+#>
+#> data: index 1, replicates 199
+#> nV^2 = 0.5254, p-value = 0.955
+#> sample estimates:
+#> dCov
+#> 0.1025087
+#>
+
dcov2d.Rd
For bivariate data only, these are fast O(n log n) implementations of distance -correlation and distance covariance statistics. The U-statistic for dcov^2 is unbiased; -the V-statistic is the original definition in SRB 2007. These algorithms do not -store the distance matrices, so they are suitable for large samples.
-numeric vector
numeric vector
"V" or "U", for V- or U-statistics
logical
The unbiased (squared) dcov is documented in dcovU
, for multivariate data in arbitrary, not necessarily equal dimensions. dcov2d
and dcor2d
provide a faster O(n log n) algorithm for bivariate (x, y) only (X and Y are real-valued random vectors). The O(n log n) algorithm was proposed by Huo and Szekely (2016). The algorithm is faster above a certain sample size n. It does not store the distance matrix so the sample size can be very large.
By default, dcov2d
returns the V-statistic \(V_n = dCov_n^2(x, y)\), and if type="U", it returns the U-statistic, unbiased for \(dCov^2(X, Y)\). The argument all.stats=TRUE is used internally when the function is called from dcor2d
.
By default, dcor2d
returns \(dCor_n^2(x, y)\), and if type="U", it returns a bias-corrected estimator of squared dcor equivalent to bcdcor
.
These functions do not store the distance matrices so they are helpful when sample size is large and the data is bivariate.
-The U-statistic \(U_n\) can be negative in the lower tail so
-the square root of the U-statistic is not applied.
-Similarly, dcor2d(x, y, "U")
is bias-corrected and can be
-negative in the lower tail, so we do not take the
-square root. The original definitions of dCov and dCor
-(SRB2007, SR2009) were based on V-statistics, which are non-negative,
-and defined using the square root of V-statistics.
It has been suggested that instead of taking the square root of the U-statistic, one could take the root of \(|U_n|\) before applying the sign, but that introduces more bias than the original dCor, and should never be used.
-Huo, X. and Szekely, G.J. (2016). Fast computing for -distance covariance. Technometrics, 58(4), 435-447.
-Szekely, G.J. and Rizzo, M.L. (2014), - Partial Distance Correlation with Methods for Dissimilarities. - Annals of Statistics, Vol. 42 No. 6, 2382-2412.
-Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007),
- Measuring and Testing Dependence by Correlation of Distances,
- Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.
-
doi:10.1214/009053607000000505
# \donttest{
- ## these are equivalent, but 2d is faster for n > 50
- n <- 100
- x <- rnorm(100)
- y <- rnorm(100)
- all.equal(dcov(x, y)^2, dcov2d(x, y), check.attributes = FALSE)
-#> [1] TRUE
- all.equal(bcdcor(x, y), dcor2d(x, y, "U"), check.attributes = FALSE)
-#> [1] TRUE
-
- x <- rlnorm(400)
- y <- rexp(400)
- dcov.test(x, y, R=199) #permutation test
-#>
-#> dCov independence test (permutation test)
-#>
-#> data: index 1, replicates 199
-#> nV^2 = 1.3947, p-value = 0.48
-#> sample estimates:
-#> dCov
-#> 0.05904902
-#>
- dcor.test(x, y, R=199)
-#>
-#> dCor independence test (permutation test)
-#>
-#> data: index 1, replicates 199
-#> dCor = 0.084338, p-value = 0.455
-#> sample estimates:
-#> dCov dCor dVar(X) dVar(Y)
-#> 0.05904902 0.08433776 0.82428775 0.59470610
-#>
- # }
-
dcov2d.Rd
For bivariate data only, these are fast O(n log n) implementations of distance +correlation and distance covariance statistics. The U-statistic for dcov^2 is unbiased; +the V-statistic is the original definition in SRB 2007. These algorithms do not +store the distance matrices, so they are suitable for large samples.
+The unbiased (squared) dcov is documented in dcovU
, for multivariate data in arbitrary, not necessarily equal dimensions. dcov2d
and dcor2d
provide a faster O(n log n) algorithm for bivariate (x, y) only (X and Y are real-valued random vectors). The O(n log n) algorithm was proposed by Huo and Szekely (2016). The algorithm is faster above a certain sample size n. It does not store the distance matrix so the sample size can be very large.
By default, dcov2d
returns the V-statistic \(V_n = dCov_n^2(x, y)\), and if type="U", it returns the U-statistic, unbiased for \(dCov^2(X, Y)\). The argument all.stats=TRUE is used internally when the function is called from dcor2d
.
By default, dcor2d
returns \(dCor_n^2(x, y)\), and if type="U", it returns a bias-corrected estimator of squared dcor equivalent to bcdcor
.
These functions do not store the distance matrices so they are helpful when sample size is large and the data is bivariate.
+The U-statistic \(U_n\) can be negative in the lower tail so
+the square root of the U-statistic is not applied.
+Similarly, dcor2d(x, y, "U")
is bias-corrected and can be
+negative in the lower tail, so we do not take the
+square root. The original definitions of dCov and dCor
+(SRB2007, SR2009) were based on V-statistics, which are non-negative,
+and defined using the square root of V-statistics.
It has been suggested that instead of taking the square root of the U-statistic, one could take the root of \(|U_n|\) before applying the sign, but that introduces more bias than the original dCor, and should never be used.
+Huo, X. and Szekely, G.J. (2016). Fast computing for +distance covariance. Technometrics, 58(4), 435-447.
+Szekely, G.J. and Rizzo, M.L. (2014), + Partial Distance Correlation with Methods for Dissimilarities. + Annals of Statistics, Vol. 42 No. 6, 2382-2412.
+Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007),
+ Measuring and Testing Dependence by Correlation of Distances,
+ Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.
+
doi:10.1214/009053607000000505
# \donttest{
+ ## these are equivalent, but 2d is faster for n > 50
+ n <- 100
+ x <- rnorm(100)
+ y <- rnorm(100)
+ all.equal(dcov(x, y)^2, dcov2d(x, y), check.attributes = FALSE)
+#> [1] TRUE
+ all.equal(bcdcor(x, y), dcor2d(x, y, "U"), check.attributes = FALSE)
+#> [1] TRUE
+
+ x <- rlnorm(400)
+ y <- rexp(400)
+ dcov.test(x, y, R=199) #permutation test
+#>
+#> dCov independence test (permutation test)
+#>
+#> data: index 1, replicates 199
+#> nV^2 = 1.3947, p-value = 0.48
+#> sample estimates:
+#> dCov
+#> 0.05904902
+#>
+ dcor.test(x, y, R=199)
+#>
+#> dCor independence test (permutation test)
+#>
+#> data: index 1, replicates 199
+#> dCor = 0.084338, p-value = 0.455
+#> sample estimates:
+#> dCov dCor dVar(X) dVar(Y)
+#> 0.05904902 0.08433776 0.82428775 0.59470610
+#>
+ # }
+
dcovU_stats.Rd
This function computes unbiased estimators of squared distance - covariance, distance variance, and a bias-corrected estimator of - (squared) distance correlation.
-dcovU_stats(Dx, Dy)
distance matrix of first sample
distance matrix of second sample
The unbiased (squared) dcov is inner product definition of - dCov, in the Hilbert space of U-centered distance matrices.
-The sample sizes (number of rows) of the two samples must - agree, and samples must not contain missing values. The - arguments must be square symmetric matrices.
-dcovU_stats
returns a vector of the components of bias-corrected
-dcor: [dCovU, bcdcor, dVarXU, dVarYU].
Unbiased distance covariance (SR2014) corresponds to the biased
-(original) \(\mathrm{dCov^2}\). Since dcovU
is an
-unbiased statistic, it is signed and we do not take the square root.
-For the original distance covariance test of independence (SRB2007,
-SR2009), the distance covariance test statistic is the V-statistic
-\(\mathrm{n\, dCov^2} = n \mathcal{V}_n^2\) (not dCov).
-Similarly, bcdcor
is bias-corrected, so we do not take the
-square root as with dCor.
Szekely, G.J. and Rizzo, M.L. (2014), - Partial Distance Correlation with Methods for Dissimilarities. - Annals of Statistics, Vol. 42 No. 6, 2382-2412.
-Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007),
- Measuring and Testing Dependence by Correlation of Distances,
- Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.
-
doi:10.1214/009053607000000505
Szekely, G.J. and Rizzo, M.L. (2009),
- Brownian Distance Covariance,
- Annals of Applied Statistics,
- Vol. 3, No. 4, 1236-1265.
-
doi:10.1214/09-AOAS312
dcovU_stats.Rd
This function computes unbiased estimators of squared distance + covariance, distance variance, and a bias-corrected estimator of + (squared) distance correlation.
+dcovU_stats(Dx, Dy)
The unbiased (squared) dcov is inner product definition of + dCov, in the Hilbert space of U-centered distance matrices.
+The sample sizes (number of rows) of the two samples must + agree, and samples must not contain missing values. The + arguments must be square symmetric matrices.
+dcovU_stats
returns a vector of the components of bias-corrected
+dcor: [dCovU, bcdcor, dVarXU, dVarYU].
Unbiased distance covariance (SR2014) corresponds to the biased
+(original) \(\mathrm{dCov^2}\). Since dcovU
is an
+unbiased statistic, it is signed and we do not take the square root.
+For the original distance covariance test of independence (SRB2007,
+SR2009), the distance covariance test statistic is the V-statistic
+\(\mathrm{n\, dCov^2} = n \mathcal{V}_n^2\) (not dCov).
+Similarly, bcdcor
is bias-corrected, so we do not take the
+square root as with dCor.
Szekely, G.J. and Rizzo, M.L. (2014), + Partial Distance Correlation with Methods for Dissimilarities. + Annals of Statistics, Vol. 42 No. 6, 2382-2412.
+Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007),
+ Measuring and Testing Dependence by Correlation of Distances,
+ Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.
+
doi:10.1214/009053607000000505
Szekely, G.J. and Rizzo, M.L. (2009),
+ Brownian Distance Covariance,
+ Annals of Applied Statistics,
+ Vol. 3, No. 4, 1236-1265.
+
doi:10.1214/09-AOAS312
dcovu.Rd
These functions compute unbiased estimators of squared distance - covariance and a bias-corrected estimator of - (squared) distance correlation.
-bcdcor(x, y)
-dcovU(x, y)
data or dist object of first sample
data or dist object of second sample
The unbiased (squared) dcov is inner product definition of - dCov, in the Hilbert space of U-centered distance matrices.
-The sample sizes (number of rows) of the two samples must - agree, and samples must not contain missing values.
-Argument types supported are -numeric data matrix, data.frame, or tibble, with observations in rows; -numeric vector; ordered or unordered factors. In case of unordered factors -a 0-1 distance matrix is computed.
-dcovU
returns the unbiased estimator of squared dcov.
-bcdcor
returns a bias-corrected estimator of squared dcor.
Unbiased distance covariance (SR2014) corresponds to the biased
-(original) \(\mathrm{dCov^2}\). Since dcovU
is an
-unbiased statistic, it is signed and we do not take the square root.
-For the original distance covariance test of independence (SRB2007,
-SR2009), the distance covariance test statistic is the V-statistic
-\(\mathrm{n\, dCov^2} = n \mathcal{V}_n^2\) (not dCov).
-Similarly, bcdcor
is bias-corrected, so we do not take the
-square root as with dCor.
Szekely, G.J. and Rizzo, M.L. (2014), - Partial Distance Correlation with Methods for Dissimilarities. - Annals of Statistics, Vol. 42 No. 6, 2382-2412.
-Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007),
- Measuring and Testing Dependence by Correlation of Distances,
- Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.
-
doi:10.1214/009053607000000505
Szekely, G.J. and Rizzo, M.L. (2009),
- Brownian Distance Covariance,
- Annals of Applied Statistics,
- Vol. 3, No. 4, 1236-1265.
-
doi:10.1214/09-AOAS312
x <- iris[1:50, 1:4]
- y <- iris[51:100, 1:4]
- dcovU(x, y)
-#> dCovU
-#> -0.002748351
- bcdcor(x, y)
-#> bcdcor
-#> -0.0271709
-
dcovu.Rd
These functions compute unbiased estimators of squared distance + covariance and a bias-corrected estimator of + (squared) distance correlation.
+bcdcor(x, y)
+dcovU(x, y)
The unbiased (squared) dcov is inner product definition of + dCov, in the Hilbert space of U-centered distance matrices.
+The sample sizes (number of rows) of the two samples must + agree, and samples must not contain missing values.
+Argument types supported are +numeric data matrix, data.frame, or tibble, with observations in rows; +numeric vector; ordered or unordered factors. In case of unordered factors +a 0-1 distance matrix is computed.
+dcovU
returns the unbiased estimator of squared dcov.
+bcdcor
returns a bias-corrected estimator of squared dcor.
Unbiased distance covariance (SR2014) corresponds to the biased
+(original) \(\mathrm{dCov^2}\). Since dcovU
is an
+unbiased statistic, it is signed and we do not take the square root.
+For the original distance covariance test of independence (SRB2007,
+SR2009), the distance covariance test statistic is the V-statistic
+\(\mathrm{n\, dCov^2} = n \mathcal{V}_n^2\) (not dCov).
+Similarly, bcdcor
is bias-corrected, so we do not take the
+square root as with dCor.
Szekely, G.J. and Rizzo, M.L. (2014), + Partial Distance Correlation with Methods for Dissimilarities. + Annals of Statistics, Vol. 42 No. 6, 2382-2412.
+Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007),
+ Measuring and Testing Dependence by Correlation of Distances,
+ Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.
+
doi:10.1214/009053607000000505
Szekely, G.J. and Rizzo, M.L. (2009),
+ Brownian Distance Covariance,
+ Annals of Applied Statistics,
+ Vol. 3, No. 4, 1236-1265.
+
doi:10.1214/09-AOAS312
x <- iris[1:50, 1:4]
+ y <- iris[51:100, 1:4]
+ dcovU(x, y)
+#> dCovU
+#> -0.002748351
+ bcdcor(x, y)
+#> bcdcor
+#> -0.0271709
+
disco.Rd
E-statistics DIStance COmponents and tests, analogous to variance components - and anova.
-disco(x, factors, distance, index=1.0, R, method=c("disco","discoB","discoF"))
-disco.between(x, factors, distance, index=1.0, R)
data matrix or distance matrix or dist object
matrix or data frame of factor labels or integers (not design matrix)
logical, TRUE if x is distance matrix
exponent on Euclidean distance in (0,2]
number of replicates for a permutation test
test statistic
disco
calculates the distance components decomposition of
- total dispersion and if R > 0 tests for significance using the test statistic
- disco "F" ratio (default method="disco"
),
- or using the between component statistic (method="discoB"
),
- each implemented by permutation test.
If x
is a dist
object, argument distance
is
- ignored. If x
is a distance matrix, set distance=TRUE
.
In the current release disco
computes the decomposition for one-way models
- only.
When method="discoF"
, disco
returns a list similar to the
- return value from anova.lm
, and the print.disco
method is
- provided to format the output into a similar table. Details:
disco
returns a class disco
object, which is a list containing
call
method
vector of observed statistics
vector of p-values
number of factors
number of observations
between-sample distance components
one-way within-sample distance components
within-sample distance component
total dispersion
degrees of freedom for treatments
degrees of freedom for error
index (exponent on distance)
factor names
factor levels
sample sizes
matrix containing decomposition
When method="discoB"
, disco
passes the arguments to
-disco.between
, which returns a class htest
object.
disco.between
returns a class htest
object, where the test
-statistic is the between-sample statistic (proportional to the numerator of the F ratio
-of the disco
test.
M. L. Rizzo and G. J. Szekely (2010).
-DISCO Analysis: A Nonparametric Extension of
-Analysis of Variance, Annals of Applied Statistics,
-Vol. 4, No. 2, 1034-1055.
-
doi:10.1214/09-AOAS245
The current version does all calculations via matrix arithmetic and -boot function. Support for more general additive models -and a formula interface is under development.
-disco
methods have been added to the cluster distance summary
-function edist
, and energy tests for equality of distribution
-(see eqdist.etest
).
edist
- eqdist.e
- eqdist.etest
- ksample.e
## warpbreaks one-way decompositions
- data(warpbreaks)
- attach(warpbreaks)
-#> The following objects are masked from warpbreaks (pos = 3):
-#>
-#> breaks, tension, wool
- disco(breaks, factors=wool, R=99)
-#> disco(x = breaks, factors = wool, R = 99)
-#>
-#> Distance Components: index 1.00
-#> Source Df Sum Dist Mean Dist F-ratio p-value
-#> factors 1 10.77778 10.77778 1.542 0.21
-#> Within 52 363.55556 6.99145
-#> Total 53 374.33333
-
- ## warpbreaks two-way wool+tension
- disco(breaks, factors=data.frame(wool, tension), R=0)
-#> disco(x = breaks, factors = data.frame(wool, tension), R = 0)
-#>
-#> Distance Components: index 1.00
-#> Source Df Sum Dist Mean Dist F-ratio p-value
-#> wool 1 10.77778 10.77778 1.542 NA
-#> tension 2 47.00000 23.50000 3.661 NA
-#> Within 50 316.55556 6.33111
-#> Total 53 374.33333
-
- ## warpbreaks two-way wool*tension
- disco(breaks, factors=data.frame(wool, tension, wool:tension), R=0)
-#> disco(x = breaks, factors = data.frame(wool, tension, wool:tension),
-#> R = 0)
-#>
-#> Distance Components: index 1.00
-#> Source Df Sum Dist Mean Dist F-ratio p-value
-#> wool 1 10.77778 10.77778 1.542 NA
-#> tension 2 47.00000 23.50000 3.661 NA
-#> wool.tension 5 85.00000 17.00000 2.820 NA
-#> Within 45 231.55556 5.14568
-#> Total 53 374.33333
-
- ## When index=2 for univariate data, we get ANOVA decomposition
- disco(breaks, factors=tension, index=2.0, R=99)
-#> disco(x = breaks, factors = tension, index = 2, R = 99)
-#>
-#> Distance Components: index 2.00
-#> Source Df Sum Dist Mean Dist F-ratio p-value
-#> factors 2 2034.25926 1017.12963 7.206 0.01
-#> Within 51 7198.55556 141.14815
-#> Total 53 9232.81481
- aov(breaks ~ tension)
-#> Call:
-#> aov(formula = breaks ~ tension)
-#>
-#> Terms:
-#> tension Residuals
-#> Sum of Squares 2034.259 7198.556
-#> Deg. of Freedom 2 51
-#>
-#> Residual standard error: 11.88058
-#> Estimated effects may be unbalanced
-
- ## Multivariate response
- ## Example on producing plastic film from Krzanowski (1998, p. 381)
- tear <- c(6.5, 6.2, 5.8, 6.5, 6.5, 6.9, 7.2, 6.9, 6.1, 6.3,
- 6.7, 6.6, 7.2, 7.1, 6.8, 7.1, 7.0, 7.2, 7.5, 7.6)
- gloss <- c(9.5, 9.9, 9.6, 9.6, 9.2, 9.1, 10.0, 9.9, 9.5, 9.4,
- 9.1, 9.3, 8.3, 8.4, 8.5, 9.2, 8.8, 9.7, 10.1, 9.2)
- opacity <- c(4.4, 6.4, 3.0, 4.1, 0.8, 5.7, 2.0, 3.9, 1.9, 5.7,
- 2.8, 4.1, 3.8, 1.6, 3.4, 8.4, 5.2, 6.9, 2.7, 1.9)
- Y <- cbind(tear, gloss, opacity)
- rate <- factor(gl(2,10), labels=c("Low", "High"))
-
- ## test for equal distributions by rate
- disco(Y, factors=rate, R=99)
-#> disco(x = Y, factors = rate, R = 99)
-#>
-#> Distance Components: index 1.00
-#> Source Df Sum Dist Mean Dist F-ratio p-value
-#> factors 1 1.27003 1.27003 0.981 0.38
-#> Within 18 23.30105 1.29450
-#> Total 19 24.57108
- disco(Y, factors=rate, R=99, method="discoB")
-#>
-#> DISCO (Between-sample)
-#>
-#> data: x
-#> DISCO between statistic = 1.27, p-value = 0.3535
-#>
-
- ## Just extract the decomposition table
- disco(Y, factors=rate, R=0)$stats
-#> Trt Within df1 df2 Stat p-value
-#> [1,] 1.270028 23.30105 1 18 0.9810934 NA
-
- ## Compare eqdist.e methods for rate
- ## disco between stat is half of original when sample sizes equal
- eqdist.e(Y, sizes=c(10, 10), method="original")
-#> E-statistic
-#> 2.540056
- eqdist.e(Y, sizes=c(10, 10), method="discoB")
-#> [1] 1.270028
-
- ## The between-sample distance component
- disco.between(Y, factors=rate, R=0)
-#> [1] 1.270028
-
disco.Rd
E-statistics DIStance COmponents and tests, analogous to variance components + and anova.
+disco(x, factors, distance, index=1.0, R, method=c("disco","discoB","discoF"))
+disco.between(x, factors, distance, index=1.0, R)
data matrix or distance matrix or dist object
matrix or data frame of factor labels or integers (not design matrix)
logical, TRUE if x is distance matrix
exponent on Euclidean distance in (0,2]
number of replicates for a permutation test
test statistic
disco
calculates the distance components decomposition of
+ total dispersion and if R > 0 tests for significance using the test statistic
+ disco "F" ratio (default method="disco"
),
+ or using the between component statistic (method="discoB"
),
+ each implemented by permutation test.
If x
is a dist
object, argument distance
is
+ ignored. If x
is a distance matrix, set distance=TRUE
.
In the current release disco
computes the decomposition for one-way models
+ only.
When method="discoF"
, disco
returns a list similar to the
+ return value from anova.lm
, and the print.disco
method is
+ provided to format the output into a similar table. Details:
disco
returns a class disco
object, which is a list containing
call
method
vector of observed statistics
vector of p-values
number of factors
number of observations
between-sample distance components
one-way within-sample distance components
within-sample distance component
total dispersion
degrees of freedom for treatments
degrees of freedom for error
index (exponent on distance)
factor names
factor levels
sample sizes
matrix containing decomposition
When method="discoB"
, disco
passes the arguments to
+disco.between
, which returns a class htest
object.
disco.between
returns a class htest
object, where the test
+statistic is the between-sample statistic (proportional to the numerator of the F ratio
+of the disco
test.
M. L. Rizzo and G. J. Szekely (2010).
+DISCO Analysis: A Nonparametric Extension of
+Analysis of Variance, Annals of Applied Statistics,
+Vol. 4, No. 2, 1034-1055.
+
doi:10.1214/09-AOAS245
The current version does all calculations via matrix arithmetic and +boot function. Support for more general additive models +and a formula interface is under development.
+disco
methods have been added to the cluster distance summary
+function edist
, and energy tests for equality of distribution
+(see eqdist.etest
).
edist
+ eqdist.e
+ eqdist.etest
+ ksample.e
## warpbreaks one-way decompositions
+ data(warpbreaks)
+ attach(warpbreaks)
+ disco(breaks, factors=wool, R=99)
+#> disco(x = breaks, factors = wool, R = 99)
+#>
+#> Distance Components: index 1.00
+#> Source Df Sum Dist Mean Dist F-ratio p-value
+#> factors 1 10.77778 10.77778 1.542 0.21
+#> Within 52 363.55556 6.99145
+#> Total 53 374.33333
+
+ ## warpbreaks two-way wool+tension
+ disco(breaks, factors=data.frame(wool, tension), R=0)
+#> disco(x = breaks, factors = data.frame(wool, tension), R = 0)
+#>
+#> Distance Components: index 1.00
+#> Source Df Sum Dist Mean Dist F-ratio p-value
+#> wool 1 10.77778 10.77778 1.542 NA
+#> tension 2 47.00000 23.50000 3.661 NA
+#> Within 50 316.55556 6.33111
+#> Total 53 374.33333
+
+ ## warpbreaks two-way wool*tension
+ disco(breaks, factors=data.frame(wool, tension, wool:tension), R=0)
+#> disco(x = breaks, factors = data.frame(wool, tension, wool:tension),
+#> R = 0)
+#>
+#> Distance Components: index 1.00
+#> Source Df Sum Dist Mean Dist F-ratio p-value
+#> wool 1 10.77778 10.77778 1.542 NA
+#> tension 2 47.00000 23.50000 3.661 NA
+#> wool.tension 5 85.00000 17.00000 2.820 NA
+#> Within 45 231.55556 5.14568
+#> Total 53 374.33333
+
+ ## When index=2 for univariate data, we get ANOVA decomposition
+ disco(breaks, factors=tension, index=2.0, R=99)
+#> disco(x = breaks, factors = tension, index = 2, R = 99)
+#>
+#> Distance Components: index 2.00
+#> Source Df Sum Dist Mean Dist F-ratio p-value
+#> factors 2 2034.25926 1017.12963 7.206 0.01
+#> Within 51 7198.55556 141.14815
+#> Total 53 9232.81481
+ aov(breaks ~ tension)
+#> Call:
+#> aov(formula = breaks ~ tension)
+#>
+#> Terms:
+#> tension Residuals
+#> Sum of Squares 2034.259 7198.556
+#> Deg. of Freedom 2 51
+#>
+#> Residual standard error: 11.88058
+#> Estimated effects may be unbalanced
+
+ ## Multivariate response
+ ## Example on producing plastic film from Krzanowski (1998, p. 381)
+ tear <- c(6.5, 6.2, 5.8, 6.5, 6.5, 6.9, 7.2, 6.9, 6.1, 6.3,
+ 6.7, 6.6, 7.2, 7.1, 6.8, 7.1, 7.0, 7.2, 7.5, 7.6)
+ gloss <- c(9.5, 9.9, 9.6, 9.6, 9.2, 9.1, 10.0, 9.9, 9.5, 9.4,
+ 9.1, 9.3, 8.3, 8.4, 8.5, 9.2, 8.8, 9.7, 10.1, 9.2)
+ opacity <- c(4.4, 6.4, 3.0, 4.1, 0.8, 5.7, 2.0, 3.9, 1.9, 5.7,
+ 2.8, 4.1, 3.8, 1.6, 3.4, 8.4, 5.2, 6.9, 2.7, 1.9)
+ Y <- cbind(tear, gloss, opacity)
+ rate <- factor(gl(2,10), labels=c("Low", "High"))
+
+ ## test for equal distributions by rate
+ disco(Y, factors=rate, R=99)
+#> disco(x = Y, factors = rate, R = 99)
+#>
+#> Distance Components: index 1.00
+#> Source Df Sum Dist Mean Dist F-ratio p-value
+#> factors 1 1.27003 1.27003 0.981 0.38
+#> Within 18 23.30105 1.29450
+#> Total 19 24.57108
+ disco(Y, factors=rate, R=99, method="discoB")
+#>
+#> DISCO (Between-sample)
+#>
+#> data: x
+#> DISCO between statistic = 1.27, p-value = 0.36
+#>
+
+ ## Just extract the decomposition table
+ disco(Y, factors=rate, R=0)$stats
+#> Trt Within df1 df2 Stat p-value
+#> [1,] 1.270028 23.30105 1 18 0.9810934 NA
+
+ ## Compare eqdist.e methods for rate
+ ## disco between stat is half of original when sample sizes equal
+ eqdist.e(Y, sizes=c(10, 10), method="original")
+#> E-statistic
+#> 2.540056
+ eqdist.e(Y, sizes=c(10, 10), method="discoB")
+#> [1] 1.270028
+
+ ## The between-sample distance component
+ disco.between(Y, factors=rate, R=0)
+#> [1] 1.270028
+
dmatrix.Rd
Utilities for working with distance matrices.
+is.dmatrix
is a utility that checks whether the argument is a distance or dissimilarity matrix; is it square symmetric, non-negative, with zero diagonal? calc_dist
computes a distance matrix directly from a data matrix.
is.dmatrix(x, tol = 100 * .Machine$double.eps)
+calc_dist(x)
Energy functions work with the distance matrices of samples. The is.dmatrix
function is used internally when converting arguments to distance matrices. The default tol
is the same as default tolerance of isSymmetric
.
calc_dist
is an exported Rcpp function that returns a Euclidean distance matrix from the input data matrix.
is.dmatrix
returns TRUE if (within tolerance) x
is a distance/dissimilarity matrix; otherwise FALSE. It will return FALSE if x
is a class dist
object.
calc_dist
returns the Euclidean distance matrix for the data matrix x
, which has observations in rows.
In practice, if dist(x)
is not yet computed, calc_dist(x)
will be faster than as.matrix(dist(x))
.
On working with non-Euclidean dissimilarities, see the references.
+Szekely, G.J. and Rizzo, M.L. (2014), + Partial Distance Correlation with Methods for Dissimilarities. + Annals of Statistics, Vol. 42 No. 6, 2382-2412.
+edist.Rd
Returns the E-distances (energy statistics) between clusters.
-data matrix of pooled sample or Euclidean distances
vector of sample sizes
logical: if TRUE, x is a distance matrix
a permutation of the row indices of x
distance exponent in (0,2]
how to weight the statistics
A vector containing the pairwise two-sample multivariate
- \(\mathcal{E}\)-statistics for comparing clusters or samples is returned.
- The e-distance between clusters is computed from the original pooled data,
- stacked in matrix x
where each row is a multivariate observation, or
- from the distance matrix x
of the original data, or distance object
- returned by dist
. The first sizes[1]
rows of the original data
- matrix are the first sample, the next sizes[2]
rows are the second
- sample, etc. The permutation vector ix
may be used to obtain
- e-distances corresponding to a clustering solution at a given level in
- the hierarchy.
The default method cluster
summarizes the e-distances between
- clusters in a table.
- The e-distance between two clusters \(C_i, C_j\)
- of size \(n_i, n_j\)
- proposed by Szekely and Rizzo (2005)
- is the e-distance \(e(C_i,C_j)\), defined by
- $$e(C_i,C_j)=\frac{n_i n_j}{n_i+n_j}[2M_{ij}-M_{ii}-M_{jj}],
- $$
- where
- $$M_{ij}=\frac{1}{n_i n_j}\sum_{p=1}^{n_i} \sum_{q=1}^{n_j}
- \|X_{ip}-X_{jq}\|^\alpha,$$
- \(\|\cdot\|\) denotes Euclidean norm, \(\alpha=\)
- alpha
, and \(X_{ip}\) denotes the p-th observation in the i-th cluster. The
- exponent alpha
should be in the interval (0,2].
The coefficient \(\frac{n_i n_j}{n_i+n_j}\)
- is one-half of the harmonic mean of the sample sizes. The
- discoB
method is related but with
- different ways of summarizing the pairwise differences between samples.
- The disco
methods apply the coefficient
- \(\frac{n_i n_j}{2N}\) where N is the total number
- of observations. This weights each (i,j) statistic by sample size
- relative to N. See the disco
topic for more details.
A object of class dist
containing the lower triangle of the
- e-distance matrix of cluster distances corresponding to the permutation
- of indices ix
is returned. The method
attribute of the
- distance object is assigned a value of type, index.
Szekely, G. J. and Rizzo, M. L. (2005) Hierarchical Clustering
- via Joint Between-Within Distances: Extending Ward's Minimum
- Variance Method, Journal of Classification 22(2) 151-183.
-
doi:10.1007/s00357-005-0012-9
M. L. Rizzo and G. J. Szekely (2010).
-DISCO Analysis: A Nonparametric Extension of
-Analysis of Variance, Annals of Applied Statistics,
-Vol. 4, No. 2, 1034-1055.
-
doi:10.1214/09-AOAS245
Szekely, G. J. and Rizzo, M. L. (2004) Testing for Equal - Distributions in High Dimension, InterStat, November (5).
-Szekely, G. J. (2000) Technical Report 03-05, - \(\mathcal{E}\)-statistics: Energy of - Statistical Samples, Department of Mathematics and Statistics, - Bowling Green State University.
- ## compute cluster e-distances for 3 samples of iris data
- data(iris)
- edist(iris[,1:4], c(50,50,50))
-#> 1 2
-#> 2 123.55381
-#> 3 195.30396 38.85415
-
- ## pairwise disco statistics
- edist(iris[,1:4], c(50,50,50), method="discoB")
-#> 1 2
-#> 2 41.18460
-#> 3 65.10132 12.95138
-
- ## compute e-distances from a distance object
- data(iris)
- edist(dist(iris[,1:4]), c(50, 50, 50), distance=TRUE, alpha = 1)
-#> 1 2
-#> 2 123.55381
-#> 3 195.30396 38.85415
-
- ## compute e-distances from a distance matrix
- data(iris)
- d <- as.matrix(dist(iris[,1:4]))
- edist(d, c(50, 50, 50), distance=TRUE, alpha = 1)
-#> 1 2
-#> 2 123.55381
-#> 3 195.30396 38.85415
-
-
-
edist.Rd
Returns the E-distances (energy statistics) between clusters.
+A vector containing the pairwise two-sample multivariate
+ \(\mathcal{E}\)-statistics for comparing clusters or samples is returned.
+ The e-distance between clusters is computed from the original pooled data,
+ stacked in matrix x
where each row is a multivariate observation, or
+ from the distance matrix x
of the original data, or distance object
+ returned by dist
. The first sizes[1]
rows of the original data
+ matrix are the first sample, the next sizes[2]
rows are the second
+ sample, etc. The permutation vector ix
may be used to obtain
+ e-distances corresponding to a clustering solution at a given level in
+ the hierarchy.
The default method cluster
summarizes the e-distances between
+ clusters in a table.
+ The e-distance between two clusters \(C_i, C_j\)
+ of size \(n_i, n_j\)
+ proposed by Szekely and Rizzo (2005)
+ is the e-distance \(e(C_i,C_j)\), defined by
+ $$e(C_i,C_j)=\frac{n_i n_j}{n_i+n_j}[2M_{ij}-M_{ii}-M_{jj}],
+ $$
+ where
+ $$M_{ij}=\frac{1}{n_i n_j}\sum_{p=1}^{n_i} \sum_{q=1}^{n_j}
+ \|X_{ip}-X_{jq}\|^\alpha,$$
+ \(\|\cdot\|\) denotes Euclidean norm, \(\alpha=\)
+ alpha
, and \(X_{ip}\) denotes the p-th observation in the i-th cluster. The
+ exponent alpha
should be in the interval (0,2].
The coefficient \(\frac{n_i n_j}{n_i+n_j}\)
+ is one-half of the harmonic mean of the sample sizes. The
+ discoB
method is related but with
+ different ways of summarizing the pairwise differences between samples.
+ The disco
methods apply the coefficient
+ \(\frac{n_i n_j}{2N}\) where N is the total number
+ of observations. This weights each (i,j) statistic by sample size
+ relative to N. See the disco
topic for more details.
A object of class dist
containing the lower triangle of the
+ e-distance matrix of cluster distances corresponding to the permutation
+ of indices ix
is returned. The method
attribute of the
+ distance object is assigned a value of type, index.
Szekely, G. J. and Rizzo, M. L. (2005) Hierarchical Clustering
+ via Joint Between-Within Distances: Extending Ward's Minimum
+ Variance Method, Journal of Classification 22(2) 151-183.
+
doi:10.1007/s00357-005-0012-9
M. L. Rizzo and G. J. Szekely (2010).
+DISCO Analysis: A Nonparametric Extension of
+Analysis of Variance, Annals of Applied Statistics,
+Vol. 4, No. 2, 1034-1055.
+
doi:10.1214/09-AOAS245
Szekely, G. J. and Rizzo, M. L. (2004) Testing for Equal + Distributions in High Dimension, InterStat, November (5).
+Szekely, G. J. (2000) Technical Report 03-05, + \(\mathcal{E}\)-statistics: Energy of + Statistical Samples, Department of Mathematics and Statistics, + Bowling Green State University.
+ ## compute cluster e-distances for 3 samples of iris data
+ data(iris)
+ edist(iris[,1:4], c(50,50,50))
+#> 1 2
+#> 2 123.55381
+#> 3 195.30396 38.85415
+
+ ## pairwise disco statistics
+ edist(iris[,1:4], c(50,50,50), method="discoB")
+#> 1 2
+#> 2 41.18460
+#> 3 65.10132 12.95138
+
+ ## compute e-distances from a distance object
+ data(iris)
+ edist(dist(iris[,1:4]), c(50, 50, 50), distance=TRUE, alpha = 1)
+#> 1 2
+#> 2 123.55381
+#> 3 195.30396 38.85415
+
+ ## compute e-distances from a distance matrix
+ data(iris)
+ d <- as.matrix(dist(iris[,1:4]))
+ edist(d, c(50, 50, 50), distance=TRUE, alpha = 1)
+#> 1 2
+#> 2 123.55381
+#> 3 195.30396 38.85415
+
+
+
eigen.Rd
Pre-computed eigenvalues corresponding to the asymptotic sampling - distribution of the energy test statistic for univariate - normality, under the null hypothesis. Four Cases are computed:
Simple hypothesis, known parameters.
Estimated mean, known variance.
Known mean, estimated variance.
Composite hypothesis, estimated parameters.
Case 4 eigenvalues are used in the test function normal.test
-when method=="limit"
.
data(EVnormal)
Numeric matrix with 125 rows and 5 columns; - column 1 is the index, and columns 2-5 are - the eigenvalues of Cases 1-4.
-Computed
-Szekely, G. J. and Rizzo, M. L. (2005) A New Test for - Multivariate Normality, Journal of Multivariate Analysis, - 93/1, 58-80, - doi:10.1016/j.jmva.2003.12.002 -.
-eigen.Rd
Pre-computed eigenvalues corresponding to the asymptotic sampling + distribution of the energy test statistic for univariate + normality, under the null hypothesis. Four Cases are computed:
Simple hypothesis, known parameters.
Estimated mean, known variance.
Known mean, estimated variance.
Composite hypothesis, estimated parameters.
Case 4 eigenvalues are used in the test function normal.test
+when method=="limit"
.
data(EVnormal)
Numeric matrix with 125 rows and 5 columns; + column 1 is the index, and columns 2-5 are + the eigenvalues of Cases 1-4.
+Computed
+Szekely, G. J. and Rizzo, M. L. (2005) A New Test for + Multivariate Normality, Journal of Multivariate Analysis, + 93/1, 58-80, + doi:10.1016/j.jmva.2003.12.002 +.
+energy-deprecated.Rd
These deprecated functions have been replaced by revised functions and will be removed in future releases of the energy package.
+DCOR(x, y, index=1.0)
DCOR is an R version replaced by faster compiled code.
+energy.hclust.Rd
Performs hierarchical clustering by minimum (energy) E-distance method.
-energy.hclust(dst, alpha = 1)
dist
object
distance exponent
Dissimilarities are \(d(x,y) = \|x-y\|^\alpha\), - where the exponent \(\alpha\) is in the interval (0,2]. - This function performs agglomerative hierarchical clustering. - Initially, each of the n singletons is a cluster. At each of n-1 steps, the - procedure merges the pair of clusters with minimum e-distance. - The e-distance between two clusters \(C_i, C_j\) of sizes \(n_i, n_j\) - is given by - $$e(C_i, C_j)=\frac{n_i n_j}{n_i+n_j}[2M_{ij}-M_{ii}-M_{jj}], - $$ - where - $$M_{ij}=\frac{1}{n_i n_j}\sum_{p=1}^{n_i} \sum_{q=1}^{n_j} - \|X_{ip}-X_{jq}\|^\alpha,$$ - \(\|\cdot\|\) denotes Euclidean norm, and \(X_{ip}\) denotes the p-th observation in the i-th cluster.
-The return value is an object of class hclust
, so hclust
- methods such as print or plot methods, plclust
, and cutree
- are available. See the documentation for hclust
.
The e-distance measures both the heterogeneity between clusters and the - homogeneity within clusters. \(\mathcal E\)-clustering - (\(\alpha=1\)) is particularly effective in - high dimension, and is more effective than some standard hierarchical - methods when clusters have equal means (see example below). - For other advantages see the references.
-edist
computes the energy distances for the result (or any partition)
- and returns the cluster distances in a dist
object. See the edist
- examples.
An object of class hclust
which describes the tree produced by
- the clustering process. The object is a list with components:
an n-1 by 2 matrix, where row i of merge
describes the
- merging of clusters at step i of the clustering. If an element j in the
- row is negative, then observation -j was merged at this
- stage. If j is positive then the merge was with the cluster
- formed at the (earlier) stage j of the algorithm.
the clustering height: a vector of n-1 non-decreasing - real numbers (the e-distance between merging clusters)
a vector giving a permutation of the indices of
- original observations suitable for plotting, in the sense that a
- cluster plot using this ordering and matrix merge
will not have
- crossings of the branches.
labels for each of the objects being clustered.
the call which produced the result.
the cluster method that has been used (e-distance).
the distance that has been used to create dst
.
Currently stats::hclust
implements Ward's method by method="ward.D2"
,
-which applies the squared distances. That method was previously "ward"
.
-Because both hclust
and energy use the same type of Lance-Williams recursive formula to update cluster distances, now with the additional option method="ward.D"
in hclust
, the
-energy distance method is easily implemented by hclust
. (Some "Ward" algorithms do not use Lance-Williams, however). Energy clustering (with alpha=1
) and "ward.D" now return the same result, except that the cluster heights of energy hierarchical clustering with alpha=1
are two times the heights from hclust
. In order to ensure compatibility with hclust methods, energy.hclust
now passes arguments through to hclust
after possibly applying the optional exponent to distance.
Szekely, G. J. and Rizzo, M. L. (2005) Hierarchical Clustering
- via Joint Between-Within Distances: Extending Ward's Minimum
- Variance Method, Journal of Classification 22(2) 151-183.
-
doi:10.1007/s00357-005-0012-9
Szekely, G. J. and Rizzo, M. L. (2004) Testing for Equal - Distributions in High Dimension, InterStat, November (5).
-Szekely, G. J. (2000) Technical Report 03-05: - \(\mathcal{E}\)-statistics: Energy of - Statistical Samples, Department of Mathematics and Statistics, Bowling - Green State University.
-edist
ksample.e
eqdist.etest
hclust
if (FALSE) {
- library(cluster)
- data(animals)
- plot(energy.hclust(dist(animals)))
-
- data(USArrests)
- ecl <- energy.hclust(dist(USArrests))
- print(ecl)
- plot(ecl)
- cutree(ecl, k=3)
- cutree(ecl, h=150)
-
- ## compare performance of e-clustering, Ward's method, group average method
- ## when sampled populations have equal means: n=200, d=5, two groups
- z <- rbind(matrix(rnorm(1000), nrow=200), matrix(rnorm(1000, 0, 5), nrow=200))
- g <- c(rep(1, 200), rep(2, 200))
- d <- dist(z)
- e <- energy.hclust(d)
- a <- hclust(d, method="average")
- w <- hclust(d^2, method="ward.D2")
- list("E" = table(cutree(e, k=2) == g), "Ward" = table(cutree(w, k=2) == g),
- "Avg" = table(cutree(a, k=2) == g))
- }
-
-
energy.hclust.Rd
Performs hierarchical clustering by minimum (energy) E-distance method.
+energy.hclust(dst, alpha = 1)
Dissimilarities are \(d(x,y) = \|x-y\|^\alpha\), + where the exponent \(\alpha\) is in the interval (0,2]. + This function performs agglomerative hierarchical clustering. + Initially, each of the n singletons is a cluster. At each of n-1 steps, the + procedure merges the pair of clusters with minimum e-distance. + The e-distance between two clusters \(C_i, C_j\) of sizes \(n_i, n_j\) + is given by + $$e(C_i, C_j)=\frac{n_i n_j}{n_i+n_j}[2M_{ij}-M_{ii}-M_{jj}], + $$ + where + $$M_{ij}=\frac{1}{n_i n_j}\sum_{p=1}^{n_i} \sum_{q=1}^{n_j} + \|X_{ip}-X_{jq}\|^\alpha,$$ + \(\|\cdot\|\) denotes Euclidean norm, and \(X_{ip}\) denotes the p-th observation in the i-th cluster.
+The return value is an object of class hclust
, so hclust
+ methods such as print or plot methods, plclust
, and cutree
+ are available. See the documentation for hclust
.
The e-distance measures both the heterogeneity between clusters and the + homogeneity within clusters. \(\mathcal E\)-clustering + (\(\alpha=1\)) is particularly effective in + high dimension, and is more effective than some standard hierarchical + methods when clusters have equal means (see example below). + For other advantages see the references.
+edist
computes the energy distances for the result (or any partition)
+ and returns the cluster distances in a dist
object. See the edist
+ examples.
An object of class hclust
which describes the tree produced by
+ the clustering process. The object is a list with components:
an n-1 by 2 matrix, where row i of merge
describes the
+ merging of clusters at step i of the clustering. If an element j in the
+ row is negative, then observation -j was merged at this
+ stage. If j is positive then the merge was with the cluster
+ formed at the (earlier) stage j of the algorithm.
the clustering height: a vector of n-1 non-decreasing + real numbers (the e-distance between merging clusters)
a vector giving a permutation of the indices of
+ original observations suitable for plotting, in the sense that a
+ cluster plot using this ordering and matrix merge
will not have
+ crossings of the branches.
labels for each of the objects being clustered.
the call which produced the result.
the cluster method that has been used (e-distance).
the distance that has been used to create dst
.
Currently stats::hclust
implements Ward's method by method="ward.D2"
,
+which applies the squared distances. That method was previously "ward"
.
+Because both hclust
and energy use the same type of Lance-Williams recursive formula to update cluster distances, now with the additional option method="ward.D"
in hclust
, the
+energy distance method is easily implemented by hclust
. (Some "Ward" algorithms do not use Lance-Williams, however). Energy clustering (with alpha=1
) and "ward.D" now return the same result, except that the cluster heights of energy hierarchical clustering with alpha=1
are two times the heights from hclust
. In order to ensure compatibility with hclust methods, energy.hclust
now passes arguments through to hclust
after possibly applying the optional exponent to distance.
Szekely, G. J. and Rizzo, M. L. (2005) Hierarchical Clustering
+ via Joint Between-Within Distances: Extending Ward's Minimum
+ Variance Method, Journal of Classification 22(2) 151-183.
+
doi:10.1007/s00357-005-0012-9
Szekely, G. J. and Rizzo, M. L. (2004) Testing for Equal + Distributions in High Dimension, InterStat, November (5).
+Szekely, G. J. (2000) Technical Report 03-05: + \(\mathcal{E}\)-statistics: Energy of + Statistical Samples, Department of Mathematics and Statistics, Bowling + Green State University.
+edist
ksample.e
eqdist.etest
hclust
if (FALSE) { # \dontrun{
+ library(cluster)
+ data(animals)
+ plot(energy.hclust(dist(animals)))
+
+ data(USArrests)
+ ecl <- energy.hclust(dist(USArrests))
+ print(ecl)
+ plot(ecl)
+ cutree(ecl, k=3)
+ cutree(ecl, h=150)
+
+ ## compare performance of e-clustering, Ward's method, group average method
+ ## when sampled populations have equal means: n=200, d=5, two groups
+ z <- rbind(matrix(rnorm(1000), nrow=200), matrix(rnorm(1000, 0, 5), nrow=200))
+ g <- c(rep(1, 200), rep(2, 200))
+ d <- dist(z)
+ e <- energy.hclust(d)
+ a <- hclust(d, method="average")
+ w <- hclust(d^2, method="ward.D2")
+ list("E" = table(cutree(e, k=2) == g), "Ward" = table(cutree(w, k=2) == g),
+ "Avg" = table(cutree(a, k=2) == g))
+ } # }
+
+
eqdist.etest.Rd
Performs the nonparametric multisample E-statistic (energy) test - for equality of multivariate distributions.
-data matrix of pooled sample
vector of sample sizes
logical: if TRUE, first argument is a distance matrix
use original (default) or distance components (discoB, discoF)
number of bootstrap replicates
a permutation of the row indices of x
The k-sample multivariate \(\mathcal{E}\)-test of equal distributions
- is performed. The statistic is computed from the original
- pooled samples, stacked in matrix x
where each row
- is a multivariate observation, or the corresponding distance matrix. The
- first sizes[1]
rows of x
are the first sample, the next
- sizes[2]
rows of x
are the second sample, etc.
The test is implemented by nonparametric bootstrap, an approximate
- permutation test with R
replicates.
The function eqdist.e
returns the test statistic only; it simply
- passes the arguments through to eqdist.etest
with R = 0
.
The k-sample multivariate \(\mathcal{E}\)-statistic for testing equal distributions
- is returned. The statistic is computed from the original pooled samples, stacked in
- matrix x
where each row is a multivariate observation, or from the distance
- matrix x
of the original data. The
- first sizes[1]
rows of x
are the first sample, the next
- sizes[2]
rows of x
are the second sample, etc.
The two-sample \(\mathcal{E}\)-statistic proposed by - Szekely and Rizzo (2004) - is the e-distance \(e(S_i,S_j)\), defined for two samples \(S_i, S_j\) - of size \(n_i, n_j\) by - $$e(S_i,S_j)=\frac{n_i n_j}{n_i+n_j}[2M_{ij}-M_{ii}-M_{jj}], - $$ - where - $$M_{ij}=\frac{1}{n_i n_j}\sum_{p=1}^{n_i} \sum_{q=1}^{n_j} - \|X_{ip}-X_{jq}\|,$$ - \(\|\cdot\|\) denotes Euclidean norm, and \(X_{ip}\) denotes the p-th observation in the i-th sample.
- -The original (default method) k-sample - \(\mathcal{E}\)-statistic is defined by summing the pairwise e-distances over - all \(k(k-1)/2\) pairs - of samples: - $$\mathcal{E}=\sum_{1 \leq i < j \leq k} e(S_i,S_j). - $$ - Large values of \(\mathcal{E}\) are significant.
-The discoB
method computes the between-sample disco statistic.
- For a one-way analysis, it is related to the original statistic as follows.
- In the above equation, the weights \(\frac{n_i n_j}{n_i+n_j}\)
- are replaced with
- $$\frac{n_i + n_j}{2N}\frac{n_i n_j}{n_i+n_j} =
- \frac{n_i n_j}{2N}$$
- where N is the total number of observations: \(N=n_1+...+n_k\).
The discoF
method is based on the disco F ratio, while the discoB
- method is based on the between sample component.
Also see disco
and disco.between
functions.
A list with class htest
containing
description of test
observed value of the test statistic
approximate p-value of the test
description of data
eqdist.e
returns test statistic only.
The pairwise e-distances between samples can be conveniently
-computed by the edist
function, which returns a dist
object.
Szekely, G. J. and Rizzo, M. L. (2004) Testing for Equal - Distributions in High Dimension, InterStat, November (5).
-M. L. Rizzo and G. J. Szekely (2010).
- DISCO Analysis: A Nonparametric Extension of
- Analysis of Variance, Annals of Applied Statistics,
- Vol. 4, No. 2, 1034-1055.
-
doi:10.1214/09-AOAS245
Szekely, G. J. (2000) Technical Report 03-05: - \(\mathcal{E}\)-statistics: Energy of - Statistical Samples, Department of Mathematics and Statistics, Bowling - Green State University.
-ksample.e
,
- edist
,
- disco
,
- disco.between
,
- energy.hclust
.
data(iris)
-
- ## test if the 3 varieties of iris data (d=4) have equal distributions
- eqdist.etest(iris[,1:4], c(50,50,50), R = 199)
-#>
-#> Multivariate 3-sample E-test of equal distributions
-#>
-#> data: sample sizes 50 50 50, replicates 199
-#> E-statistic = 357.71, p-value = 0.005
-#>
-
- ## example that uses method="disco"
- x <- matrix(rnorm(100), nrow=20)
- y <- matrix(rnorm(100), nrow=20)
- X <- rbind(x, y)
- d <- dist(X)
-
- # should match edist default statistic
- set.seed(1234)
- eqdist.etest(d, sizes=c(20, 20), distance=TRUE, R = 199)
-#>
-#> 2-sample E-test of equal distributions
-#>
-#> data: sample sizes 20 20, replicates 199
-#> E-statistic = 1.9307, p-value = 0.93
-#>
-
- # comparison with edist
- edist(d, sizes=c(20, 10), distance=TRUE)
-#> 1
-#> 2 1.954117
-
- # for comparison
- g <- as.factor(rep(1:2, c(20, 20)))
- set.seed(1234)
- disco(d, factors=g, distance=TRUE, R=199)
-#> disco(x = d, factors = g, distance = TRUE, R = 199)
-#>
-#> Distance Components: index 1.00
-#> Source Df Sum Dist Mean Dist F-ratio p-value
-#> factors 1 0.96533 0.96533 0.625 0.93
-#> Within 38 58.67770 1.54415
-#> Total 39 59.64303
-
- # should match statistic in edist method="discoB", above
- set.seed(1234)
- disco.between(d, factors=g, distance=TRUE, R=199)
-#>
-#> DISCO (Between-sample)
-#>
-#> data: d
-#> DISCO between statistic = 0.96533, p-value = 0.9296
-#>
-
eqdist.etest.Rd
Performs the nonparametric multisample E-statistic (energy) test + for equality of multivariate distributions.
+The k-sample multivariate \(\mathcal{E}\)-test of equal distributions
+ is performed. The statistic is computed from the original
+ pooled samples, stacked in matrix x
where each row
+ is a multivariate observation, or the corresponding distance matrix. The
+ first sizes[1]
rows of x
are the first sample, the next
+ sizes[2]
rows of x
are the second sample, etc.
The test is implemented by nonparametric bootstrap, an approximate
+ permutation test with R
replicates.
The function eqdist.e
returns the test statistic only; it simply
+ passes the arguments through to eqdist.etest
with R = 0
.
The k-sample multivariate \(\mathcal{E}\)-statistic for testing equal distributions
+ is returned. The statistic is computed from the original pooled samples, stacked in
+ matrix x
where each row is a multivariate observation, or from the distance
+ matrix x
of the original data. The
+ first sizes[1]
rows of x
are the first sample, the next
+ sizes[2]
rows of x
are the second sample, etc.
The two-sample \(\mathcal{E}\)-statistic proposed by + Szekely and Rizzo (2004) + is the e-distance \(e(S_i,S_j)\), defined for two samples \(S_i, S_j\) + of size \(n_i, n_j\) by + $$e(S_i,S_j)=\frac{n_i n_j}{n_i+n_j}[2M_{ij}-M_{ii}-M_{jj}], + $$ + where + $$M_{ij}=\frac{1}{n_i n_j}\sum_{p=1}^{n_i} \sum_{q=1}^{n_j} + \|X_{ip}-X_{jq}\|,$$ + \(\|\cdot\|\) denotes Euclidean norm, and \(X_{ip}\) denotes the p-th observation in the i-th sample.
+ +The original (default method) k-sample + \(\mathcal{E}\)-statistic is defined by summing the pairwise e-distances over + all \(k(k-1)/2\) pairs + of samples: + $$\mathcal{E}=\sum_{1 \leq i < j \leq k} e(S_i,S_j). + $$ + Large values of \(\mathcal{E}\) are significant.
+The discoB
method computes the between-sample disco statistic.
+ For a one-way analysis, it is related to the original statistic as follows.
+ In the above equation, the weights \(\frac{n_i n_j}{n_i+n_j}\)
+ are replaced with
+ $$\frac{n_i + n_j}{2N}\frac{n_i n_j}{n_i+n_j} =
+ \frac{n_i n_j}{2N}$$
+ where N is the total number of observations: \(N=n_1+...+n_k\).
The discoF
method is based on the disco F ratio, while the discoB
+ method is based on the between sample component.
Also see disco
and disco.between
functions.
A list with class htest
containing
description of test
observed value of the test statistic
approximate p-value of the test
description of data
eqdist.e
returns test statistic only.
The pairwise e-distances between samples can be conveniently
+computed by the edist
function, which returns a dist
object.
Szekely, G. J. and Rizzo, M. L. (2004) Testing for Equal + Distributions in High Dimension, InterStat, November (5).
+M. L. Rizzo and G. J. Szekely (2010).
+ DISCO Analysis: A Nonparametric Extension of
+ Analysis of Variance, Annals of Applied Statistics,
+ Vol. 4, No. 2, 1034-1055.
+
doi:10.1214/09-AOAS245
Szekely, G. J. (2000) Technical Report 03-05: + \(\mathcal{E}\)-statistics: Energy of + Statistical Samples, Department of Mathematics and Statistics, Bowling + Green State University.
+ksample.e
,
+ edist
,
+ disco
,
+ disco.between
,
+ energy.hclust
.
data(iris)
+
+ ## test if the 3 varieties of iris data (d=4) have equal distributions
+ eqdist.etest(iris[,1:4], c(50,50,50), R = 199)
+#>
+#> Multivariate 3-sample E-test of equal distributions
+#>
+#> data: sample sizes 50 50 50, replicates 199
+#> E-statistic = 357.71, p-value = 0.005
+#>
+
+ ## example that uses method="disco"
+ x <- matrix(rnorm(100), nrow=20)
+ y <- matrix(rnorm(100), nrow=20)
+ X <- rbind(x, y)
+ d <- dist(X)
+
+ # should match edist default statistic
+ set.seed(1234)
+ eqdist.etest(d, sizes=c(20, 20), distance=TRUE, R = 199)
+#>
+#> 2-sample E-test of equal distributions
+#>
+#> data: sample sizes 20 20, replicates 199
+#> E-statistic = 1.9307, p-value = 0.93
+#>
+
+ # comparison with edist
+ edist(d, sizes=c(20, 10), distance=TRUE)
+#> 1
+#> 2 1.954117
+
+ # for comparison
+ g <- as.factor(rep(1:2, c(20, 20)))
+ set.seed(1234)
+ disco(d, factors=g, distance=TRUE, R=199)
+#> disco(x = d, factors = g, distance = TRUE, R = 199)
+#>
+#> Distance Components: index 1.00
+#> Source Df Sum Dist Mean Dist F-ratio p-value
+#> factors 1 0.96533 0.96533 0.625 0.93
+#> Within 38 58.67770 1.54415
+#> Total 39 59.64303
+
+ # should match statistic in edist method="discoB", above
+ set.seed(1234)
+ disco.between(d, factors=g, distance=TRUE, R=199)
+#>
+#> DISCO (Between-sample)
+#>
+#> data: d
+#> DISCO between statistic = 0.96533, p-value = 0.93
+#>
+
indep-deprecated.Rd
Computes a multivariate nonparametric test of independence.
+ The default method implements the distance covariance test
+ dcov.test
.
indep.test(x, y, method = c("dcov","mvI"), index = 1, R)
indep.test
with the default method = "dcov"
computes
+ the distance
+ covariance test of independence. index
is an exponent on
+ the Euclidean distances. Valid choices for index
are in (0,2],
+ with default value 1 (Euclidean distance). The arguments are passed
+ to the dcov.test
function. See the help topic dcov.test
for
+ the description and documentation and also see the references below.
indep.test
with method = "mvI"
+ computes the coefficient \(\mathcal I_n\) and performs a nonparametric
+ \(\mathcal E\)-test of independence. The arguments are passed to
+ mvI.test
. The
+ index
argument is ignored (index = 1
is applied).
+ See the help topic mvI.test
and also
+ see the reference (2006) below for details.
The test decision is obtained via
+ bootstrap, with R
replicates.
+ The sample sizes (number of rows) of the two samples must agree, and
+ samples must not contain missing values.
These energy tests of independence are based on related theoretical
+ results, but different test statistics.
+ The dcov
method is faster than mvI
method by
+ approximately a factor of O(n).
indep.test
returns a list with class
+ htest
containing
description of test
observed value of the + test statistic \(n \mathcal V_n^2\) + or \(n \mathcal I_n^2\)
\(\mathcal V_n\) or \(\mathcal I_n\)
a vector [dCov(x,y), dCor(x,y), dVar(x), dVar(y)] + (method dcov)
replicates of the test statistic
approximate p-value of the test
description of data
As of energy-1.1-0,
+indep.etest
is deprecated and replaced by indep.test
, which
+has methods for two different energy tests of independence. indep.test
applies
+the distance covariance test (see dcov.test
) by default (method = "dcov"
).
+The original indep.etest
applied the independence coefficient
+\(\mathcal I_n\), which is now obtained by method = "mvI"
.
Szekely, G.J. and Rizzo, M.L. (2009),
+ Brownian Distance Covariance,
+ Annals of Applied Statistics, Vol. 3 No. 4, pp.
+ 1236-1265. (Also see discussion and rejoinder.)
+
doi:10.1214/09-AOAS312
Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007),
+ Measuring and Testing Dependence by Correlation of Distances,
+ Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.
+
doi:10.1214/009053607000000505
Bakirov, N.K., Rizzo, M.L., and Szekely, G.J. (2006), A Multivariate
+ Nonparametric Test of Independence, Journal of Multivariate Analysis
+ 93/1, 58-80,
doi:10.1016/j.jmva.2005.10.005
# \donttest{
+ ## independent multivariate data
+ x <- matrix(rnorm(60), nrow=20, ncol=3)
+ y <- matrix(rnorm(40), nrow=20, ncol=2)
+ indep.test(x, y, method = "dcov", R = 99)
+#>
+#> dCov independence test (permutation test)
+#>
+#> data: index 1, replicates 99
+#> nV^2 = 3.2897, p-value = 0.79
+#> sample estimates:
+#> dCov
+#> 0.4055658
+#>
+ indep.test(x, y, method = "mvI", R = 99)
+#>
+#> mvI energy test of independence
+#>
+#> data: x (20 by 3), y(20 by 2), replicates 99
+#> n I^2 = 1.0105, p-value = 0.61
+#> sample estimates:
+#> I
+#> 0.2247749
+#>
+
+ ## dependent multivariate data
+ if (require(MASS)) {
+ Sigma <- matrix(c(1, .1, 0, 0 , 1, 0, 0 ,.1, 1), 3, 3)
+ x <- mvrnorm(30, c(0, 0, 0), diag(3))
+ y <- mvrnorm(30, c(0, 0, 0), Sigma) * x
+ indep.test(x, y, R = 99) #dcov method
+ indep.test(x, y, method = "mvI", R = 99)
+ }
+#> Loading required package: MASS
+#>
+#> mvI energy test of independence
+#>
+#> data: x (30 by 3), y(30 by 3), replicates 99
+#> n I^2 = 1.1769, p-value = 0.04
+#> sample estimates:
+#> I
+#> 0.1980682
+#>
+ # }
+
kgroups.Rd
Perform k-groups clustering by energy distance.
-kgroups(x, k, iter.max = 10, nstart = 1, cluster = NULL)
Data frame or data matrix or distance object
number of clusters
maximum number of iterations
number of restarts
initial clustering vector
K-groups is based on the multisample energy distance for comparing distributions. -Based on the disco decomposition of total dispersion (a Gini type mean distance) the objective function should either maximize the total between cluster energy distance, or equivalently, minimize the total within cluster energy distance. It is more computationally efficient to minimize within distances, and that makes it possible to use a modified version of the Hartigan-Wong algorithm (1979) to implement K-groups clustering.
-The within cluster Gini mean distance is
-$$G(C_j) = \frac{1}{n_j^2} \sum_{i,m=1}^{n_j} |x_{i,j} - x_{m,j}|$$
-and the K-groups within cluster distance is
-$$W_j = \frac{n_j}{2}G(C_j) = \frac{1}{2 n_j} \sum_{i,m=1}^{n_j} |x_{i,j} - x_{m,j}.$$
-If z is the data matrix for cluster \(C_j\), then \(W_j\) could be computed as
-sum(dist(z)) / nrow(z)
.
If cluster is not NULL, the clusters are initialized by this vector (can be a factor or integer vector). Otherwise clusters are initialized with random labels in k approximately equal size clusters.
-If x
is not a distance object (class(x) == "dist") then x
is converted to a data matrix for analysis.
Run up to iter.max
complete passes through the data set until a local min is reached. If nstart > 1
, on second and later starts, clusters are initialized at random, and the best result is returned.
An object of class kgroups
containing the components
the function call
vector of cluster indices
cluster sizes
vector of Gini within cluster distances
sum of within cluster distances
number of moves
number of iterations
number of clusters
cluster
is a vector containing the group labels, 1 to k. print.kgroups
prints some of the components of the kgroups object.
- - -Expect that count is 0 if the algorithm converged to a local min (that is, 0 moves happened on the last iteration). If iterations equals iter.max and count is positive, then the algorithm did not converge to a local min.
-Li, Songzi (2015). -"K-groups: A Generalization of K-means by Energy Distance." -Ph.D. thesis, Bowling Green State University.
-Li, S. and Rizzo, M. L. (2017). -"K-groups: A Generalization of K-means Clustering". -ArXiv e-print 1711.04359. https://arxiv.org/abs/1711.04359
-Szekely, G. J., and M. L. Rizzo. "Testing for equal distributions in high dimension." InterStat 5, no. 16.10 (2004).
-Rizzo, M. L., and G. J. Szekely. "Disco analysis: A nonparametric extension of analysis of variance." The Annals of Applied Statistics (2010): 1034-1055.
-Hartigan, J. A. and Wong, M. A. (1979). "Algorithm AS 136: A K-means clustering algorithm." Applied Statistics, 28, 100-108. doi: 10.2307/2346830.
- x <- as.matrix(iris[ ,1:4])
- set.seed(123)
- kg <- kgroups(x, k = 3, iter.max = 5, nstart = 2)
- kg
-#>
-#> kgroups(x = x, k = 3, iter.max = 5, nstart = 2)
-#>
-#> K-groups cluster analysis
-#> 3 groups of size 50 38 62
-#> Within cluster distances:
-#> 17.07201 18.92376 31.53301
-#> Iterations: 3 Count: 0
- fitted(kg)
-#> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
-#> [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
-#> [75] 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2
-#> [112] 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 2 2 2 3 2
-#> [149] 2 3
-
- # \donttest{
- d <- dist(x)
- set.seed(123)
- kg <- kgroups(d, k = 3, iter.max = 5, nstart = 2)
- kg
-#>
-#> kgroups(x = d, k = 3, iter.max = 5, nstart = 2)
-#>
-#> K-groups cluster analysis
-#> 3 groups of size 50 38 62
-#> Within cluster distances:
-#> 17.07201 18.92376 31.53301
-#> Iterations: 3 Count: 0
-
- kg$cluster
-#> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
-#> [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
-#> [75] 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2
-#> [112] 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 2 2 2 3 2
-#> [149] 2 3
-
- fitted(kg)
-#> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
-#> [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
-#> [75] 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2
-#> [112] 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 2 2 2 3 2
-#> [149] 2 3
- fitted(kg, method = "groups")
-#> [[1]]
-#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
-#> [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
-#>
-#> [[2]]
-#> [1] 53 78 101 103 104 105 106 108 109 110 111 112 113 116 117 118 119 121 123
-#> [20] 125 126 129 130 131 132 133 135 136 137 138 140 141 142 144 145 146 148 149
-#>
-#> [[3]]
-#> [1] 51 52 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
-#> [20] 71 72 73 74 75 76 77 79 80 81 82 83 84 85 86 87 88 89 90
-#> [39] 91 92 93 94 95 96 97 98 99 100 102 107 114 115 120 122 124 127 128
-#> [58] 134 139 143 147 150
-#>
- # }
-
kgroups.Rd
Perform k-groups clustering by energy distance.
+kgroups(x, k, iter.max = 10, nstart = 1, cluster = NULL)
K-groups is based on the multisample energy distance for comparing distributions. +Based on the disco decomposition of total dispersion (a Gini type mean distance) the objective function should either maximize the total between cluster energy distance, or equivalently, minimize the total within cluster energy distance. It is more computationally efficient to minimize within distances, and that makes it possible to use a modified version of the Hartigan-Wong algorithm (1979) to implement K-groups clustering.
+The within cluster Gini mean distance is
+$$G(C_j) = \frac{1}{n_j^2} \sum_{i,m=1}^{n_j} |x_{i,j} - x_{m,j}|$$
+and the K-groups within cluster distance is
+$$W_j = \frac{n_j}{2}G(C_j) = \frac{1}{2 n_j} \sum_{i,m=1}^{n_j} |x_{i,j} - x_{m,j}.$$
+If z is the data matrix for cluster \(C_j\), then \(W_j\) could be computed as
+sum(dist(z)) / nrow(z)
.
If cluster is not NULL, the clusters are initialized by this vector (can be a factor or integer vector). Otherwise clusters are initialized with random labels in k approximately equal size clusters.
+If x
is not a distance object (class(x) == "dist") then x
is converted to a data matrix for analysis.
Run up to iter.max
complete passes through the data set until a local min is reached. If nstart > 1
, on second and later starts, clusters are initialized at random, and the best result is returned.
An object of class kgroups
containing the components
the function call
vector of cluster indices
cluster sizes
vector of Gini within cluster distances
sum of within cluster distances
number of moves
number of iterations
number of clusters
cluster
is a vector containing the group labels, 1 to k. print.kgroups
+prints some of the components of the kgroups object.
Expect that count is 0 if the algorithm converged to a local min (that is, 0 moves happened on the last iteration). If iterations equals iter.max and count is positive, then the algorithm did not converge to a local min.
+Li, Songzi (2015). +"K-groups: A Generalization of K-means by Energy Distance." +Ph.D. thesis, Bowling Green State University.
+Li, S. and Rizzo, M. L. (2017). +"K-groups: A Generalization of K-means Clustering". +ArXiv e-print 1711.04359. https://arxiv.org/abs/1711.04359
+Szekely, G. J., and M. L. Rizzo. "Testing for equal distributions in high dimension." InterStat 5, no. 16.10 (2004).
+Rizzo, M. L., and G. J. Szekely. "Disco analysis: A nonparametric extension of analysis of variance." The Annals of Applied Statistics (2010): 1034-1055.
+Hartigan, J. A. and Wong, M. A. (1979). "Algorithm AS 136: A K-means clustering algorithm." Applied Statistics, 28, 100-108. doi: 10.2307/2346830.
+ x <- as.matrix(iris[ ,1:4])
+ set.seed(123)
+ kg <- kgroups(x, k = 3, iter.max = 5, nstart = 2)
+ kg
+#>
+#> kgroups(x = x, k = 3, iter.max = 5, nstart = 2)
+#>
+#> K-groups cluster analysis
+#> 3 groups of size 50 38 62
+#> Within cluster distances:
+#> 17.07201 18.92376 31.53301
+#> Iterations: 3 Count: 0
+ fitted(kg)
+#> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
+#> [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
+#> [75] 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2
+#> [112] 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 2 2 2 3 2
+#> [149] 2 3
+
+ # \donttest{
+ d <- dist(x)
+ set.seed(123)
+ kg <- kgroups(d, k = 3, iter.max = 5, nstart = 2)
+ kg
+#>
+#> kgroups(x = d, k = 3, iter.max = 5, nstart = 2)
+#>
+#> K-groups cluster analysis
+#> 3 groups of size 50 38 62
+#> Within cluster distances:
+#> 17.07201 18.92376 31.53301
+#> Iterations: 3 Count: 0
+
+ kg$cluster
+#> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
+#> [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
+#> [75] 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2
+#> [112] 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 2 2 2 3 2
+#> [149] 2 3
+
+ fitted(kg)
+#> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
+#> [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
+#> [75] 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2
+#> [112] 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 2 2 2 3 2
+#> [149] 2 3
+ fitted(kg, method = "groups")
+#> [[1]]
+#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
+#> [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
+#>
+#> [[2]]
+#> [1] 53 78 101 103 104 105 106 108 109 110 111 112 113 116 117 118 119 121 123
+#> [20] 125 126 129 130 131 132 133 135 136 137 138 140 141 142 144 145 146 148 149
+#>
+#> [[3]]
+#> [1] 51 52 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
+#> [20] 71 72 73 74 75 76 77 79 80 81 82 83 84 85 86 87 88 89 90
+#> [39] 91 92 93 94 95 96 97 98 99 100 102 107 114 115 120 122 124 127 128
+#> [58] 134 139 143 147 150
+#>
+ # }
+
mutualIndep.Rd
The test statistic is the sum of d-1 bias-corrected squared dcor statistics where the number of variables is d. Implementation is by permuation test.
+mutualIndep.test(x, R)
A population coefficient for mutual independence of d random variables, \(d \geq 2\), is +$$ + \sum_{k=1}^{d-1} \mathcal R^2(X_k, [X_{k+1},\dots,X_d]). +$$ +which is non-negative and equals zero iff mutual independence holds. +For example, if d=4 the population coefficient is +$$ +\mathcal R^2(X_1, [X_2,X_3,X_4]) + +\mathcal R^2(X_2, [X_3,X_4]) + +\mathcal R^2(X_3, X_4), +$$ +A permutation test is implemented based on the corresponding sample coefficient. +To test mutual independence of $$X_1,\dots,X_d$$ the test statistic is the sum of the d-1 +statistics (bias-corrected \(dcor^2\) statistics): +$$\sum_{k=1}^{d-1} \mathcal R_n^*(X_k, [X_{k+1},\dots,X_d])$$.
+mutualIndep.test
returns an object of class power.htest
.
See Szekely and Rizzo (2014) for details on unbiased \(dCov^2\) and bias-corrected \(dCor^2\) (bcdcor
) statistics.
Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007),
+ Measuring and Testing Dependence by Correlation of Distances,
+ Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.
+
doi:10.1214/009053607000000505
Szekely, G.J. and Rizzo, M.L. (2014), + Partial Distance Correlation with Methods for Dissimilarities. + Annals of Statistics, Vol. 42 No. 6, 2382-2412.
+x <- matrix(rnorm(100), nrow=20, ncol=5)
+mutualIndep.test(x, 199)
+#>
+#> Energy Test of Mutual Independence
+#>
+#> statistic = -0.09018846
+#> p.value = 0.66
+#> call = mutualIndep.test(x = x, R = 199)
+#> data.name = x dim 20,5
+#> estimate = -0.060, -0.025, -0.024, 0.019
+#>
+#> NOTE: statistic=sum(bcdcor); permutation test
+#>
+
Historically this is the first energy test of independence. The
distance covariance test dcov.test
, distance correlation dcor
, and related methods are more recent (2007, 2009).
The distance covariance test dcov.test
and distance correlation test dcor.test
are much faster and have different properties than mvI.test
. All are based on a population independence coefficient that characterizes independence and of these tests are statistically consistent. However, dCor is scale invariant while \(I_n\) is not. In applications dcor.test
or dcov.test
are the recommended tests.
The distance covariance test dcov.test
and distance correlation test dcor.test
are much faster and have different properties than mvI.test
. All are based on a population independence coefficient that characterizes independence and all of these tests are statistically consistent. However, dCor is scale invariant while \(\mathcal I_n\) is not. In applications dcor.test
or dcov.test
are the recommended tests.
Computing formula from Bakirov, Rizzo, and Szekely (2006), equation (2):
Suppose the two samples are \(X_1,\dots,X_n \in R^p\) and \(Y_1,\dots,Y_n \in R^q\). Define \(Z_{kl} = (X_k, Y_l) \in R^{p+q}.\)
The independence coefficient \(\mathcal I_n\) is defined @@ -98,7 +98,7 @@
\(\mathcal I_n\) is invariant to shifts and orthogonal transformations of X and Y.
\(\sqrt{n} \, \mathcal I_n\) determines a statistically consistent test of independence against all fixed dependent alternatives (Corollary 1).
The population independence coefficient \(\mathcal I\) is a normalized distance between the joint characteristic function and the product of the marginal characteristic functions. \(\mathcal I_n\) converges almost surely to \(\mathcal I\) as \(n \to \infty\). X and Y are independent if and only if \(\mathcal I(X, Y) = 0\). -See the reference below for more details.
dcor2d
mvnorm-test.Rd
Performs the E-statistic (energy) test of multivariate or univariate normality.
-mvnorm.test(x, R)
-mvnorm.etest(x, R)
-mvnorm.e(x)
data matrix of multivariate sample, or univariate data vector
number of bootstrap replicates
If x
is a matrix, each row is a multivariate observation. The
- data will be standardized to zero mean and identity covariance matrix
- using the sample mean vector and sample covariance matrix. If x
- is a vector, mvnorm.e
returns the univariate statistic
- normal.e(x)
.
- If the data contains missing values or the sample covariance matrix is
- singular, mvnorm.e
returns NA.
The \(\mathcal{E}\)-test of multivariate normality was proposed - and implemented by Szekely and Rizzo (2005). The test statistic for - d-variate normality is given by - $$\mathcal{E} = n (\frac{2}{n} \sum_{i=1}^n E\|y_i-Z\| - - E\|Z-Z'\| - \frac{1}{n^2} \sum_{i=1}^n \sum_{j=1}^n \|y_i-y_j\|), - $$ - where \(y_1,\ldots,y_n\) is the standardized sample, - \(Z, Z'\) are iid standard d-variate normal, and - \(\| \cdot \|\) denotes Euclidean norm.
-The \(\mathcal{E}\)-test of multivariate (univariate) normality
- is implemented by parametric bootstrap with R
replicates.
The value of the \(\mathcal{E}\)-statistic for multivariate
- normality is returned by mvnorm.e
.
mvnorm.test
returns a list with class htest
containing
description of test
observed value of the test statistic
approximate p-value of the test
description of data
mvnorm.etest
is replaced by mvnorm.test
.
normal.test
for the energy test of univariate
-normality and normal.e
for the statistic.
If the data is univariate, the test statistic is formally
-the same as the multivariate case, but a more efficient computational
-formula is applied in normal.e
.
normal.test
also provides an optional method for the
-test based on the asymptotic sampling distribution of the test
-statistic.
Szekely, G. J. and Rizzo, M. L. (2005) A New Test for - Multivariate Normality, Journal of Multivariate Analysis, - 93/1, 58-80, - doi:10.1016/j.jmva.2003.12.002 -.
-Mori, T. F., Szekely, G. J. and Rizzo, M. L. "On energy tests of normality." Journal of Statistical Planning and Inference 213 (2021): 1-15.
-Rizzo, M. L. (2002). A New Rotation Invariant Goodness-of-Fit Test, Ph.D. dissertation, Bowling Green State University.
-Szekely, G. J. (1989) Potential and Kinetic Energy in Statistics, -Lecture Notes, Budapest Institute of Technology (Technical University).
- ## compute normality test statistic for iris Setosa data
- data(iris)
- mvnorm.e(iris[1:50, 1:4])
-#> [1] 1.203397
-
- ## test if the iris Setosa data has multivariate normal distribution
- mvnorm.test(iris[1:50,1:4], R = 199)
-#>
-#> Energy test of multivariate normality: estimated parameters
-#>
-#> data: x, sample size 50, dimension 4, replicates 199
-#> E-statistic = 1.2034, p-value = 0.02513
-#>
-
mvnorm-test.Rd
Performs the E-statistic (energy) test of multivariate or univariate normality.
+mvnorm.test(x, R)
+mvnorm.etest(x, R)
+mvnorm.e(x)
If x
is a matrix, each row is a multivariate observation. The
+ data will be standardized to zero mean and identity covariance matrix
+ using the sample mean vector and sample covariance matrix. If x
+ is a vector, mvnorm.e
returns the univariate statistic
+ normal.e(x)
.
+ If the data contains missing values or the sample covariance matrix is
+ singular, mvnorm.e
returns NA.
The \(\mathcal{E}\)-test of multivariate normality was proposed + and implemented by Szekely and Rizzo (2005). The test statistic for + d-variate normality is given by + $$\mathcal{E} = n (\frac{2}{n} \sum_{i=1}^n E\|y_i-Z\| - + E\|Z-Z'\| - \frac{1}{n^2} \sum_{i=1}^n \sum_{j=1}^n \|y_i-y_j\|), + $$ + where \(y_1,\ldots,y_n\) is the standardized sample, + \(Z, Z'\) are iid standard d-variate normal, and + \(\| \cdot \|\) denotes Euclidean norm.
+The \(\mathcal{E}\)-test of multivariate (univariate) normality
+ is implemented by parametric bootstrap with R
replicates.
The value of the \(\mathcal{E}\)-statistic for multivariate
+ normality is returned by mvnorm.e
.
mvnorm.test
returns a list with class htest
containing
description of test
observed value of the test statistic
approximate p-value of the test
description of data
mvnorm.etest
is replaced by mvnorm.test
.
normal.test
for the energy test of univariate
+normality and normal.e
for the statistic.
If the data is univariate, the test statistic is formally
+the same as the multivariate case, but a more efficient computational
+formula is applied in normal.e
.
normal.test
also provides an optional method for the
+test based on the asymptotic sampling distribution of the test
+statistic.
Szekely, G. J. and Rizzo, M. L. (2005) A New Test for + Multivariate Normality, Journal of Multivariate Analysis, + 93/1, 58-80, + doi:10.1016/j.jmva.2003.12.002 +.
+Mori, T. F., Szekely, G. J. and Rizzo, M. L. "On energy tests of normality." Journal of Statistical Planning and Inference 213 (2021): 1-15.
+Rizzo, M. L. (2002). A New Rotation Invariant Goodness-of-Fit Test, Ph.D. dissertation, Bowling Green State University.
+Szekely, G. J. (1989) Potential and Kinetic Energy in Statistics, +Lecture Notes, Budapest Institute of Technology (Technical University).
+ ## compute normality test statistic for iris Setosa data
+ data(iris)
+ mvnorm.e(iris[1:50, 1:4])
+#> [1] 1.203397
+
+ ## test if the iris Setosa data has multivariate normal distribution
+ mvnorm.test(iris[1:50,1:4], R = 199)
+#>
+#> Energy test of multivariate normality: estimated parameters
+#>
+#> data: x, sample size 50, dimension 4, replicates 199
+#> E-statistic = 1.2034, p-value = 0.01005
+#>
+
normalGOF.Rd
Performs the energy test of univariate normality - for the composite hypothesis Case 4, estimated parameters.
-normal.test(x, method=c("mc","limit"), R)
-normal.e(x)
univariate data vector
method for p-value
number of replications if Monte Carlo method
If method="mc"
this test function applies the parametric
-bootstrap method implemented in mvnorm.test
.
If method="limit"
, the p-value of the test is computed from
-the asymptotic distribution of the test statistic under the null
-hypothesis. The asymptotic
-distribution is a quadratic form of centered Gaussian random variables,
-which has the form
-$$\sum_{k=1}^\infty \lambda_k Z_k^2,$$
-where \(\lambda_k\) are positive constants (eigenvalues) and
-\(Z_k\) are iid standard normal variables. Eigenvalues are
-pre-computed and stored internally.
-A p-value is computed using Imhof's method as implemented in the
-CompQuadForm package.
Note that the "limit" method is intended for moderately large -samples because it applies the asymptotic distribution.
-The energy test of normality was proposed
- and implemented by Szekely and Rizzo (2005).
- See mvnorm.test
- for more details.
normal.e
returns the energy goodness-of-fit statistic for
-a univariate sample.
normal.test
returns a list with class htest
containing
observed value of the test statistic
p-value of the test
sample estimates: mean, sd
description of data
mvnorm.test
and mvnorm.e
for the
- energy test of multivariate normality and the test statistic
- for multivariate samples.
Szekely, G. J. and Rizzo, M. L. (2005) A New Test for - Multivariate Normality, Journal of Multivariate Analysis, - 93/1, 58-80, - doi:10.1016/j.jmva.2003.12.002 -.
-Mori, T. F., Szekely, G. J. and Rizzo, M. L. "On energy tests of normality." Journal of Statistical Planning and Inference 213 (2021): 1-15.
-Rizzo, M. L. (2002). A New Rotation Invariant Goodness-of-Fit Test, - Ph.D. dissertation, Bowling Green State University.
-J. P. Imhof (1961). Computing the Distribution of Quadratic Forms in -Normal Variables, Biometrika, Volume 48, Issue 3/4, -419-426.
- x <- iris[1:50, 1]
- normal.e(x)
-#> [1] 0.4650295
- normal.test(x, R=199)
-#>
-#> Energy test of normality: estimated parameters
-#>
-#> data: x, sample size 50, dimension 1, replicates 199
-#> E-statistic = 0.46503, p-value = 0.3518
-#> sample estimates:
-#> mean sd
-#> 5.0060000 0.3524897
-#>
- normal.test(x, method="limit")
-#>
-#> Energy test of normality: limit distribution
-#>
-#> data: Case 4: composite hypothesis, estimated parameters
-#> statistic = 0.46503, p-value = 0.2869
-#> sample estimates:
-#> mean sd
-#> 5.0060000 0.3524897
-#>
-
normalGOF.Rd
Performs the energy test of univariate normality + for the composite hypothesis Case 4, estimated parameters.
+normal.test(x, method=c("mc","limit"), R)
+normal.e(x)
If method="mc"
this test function applies the parametric
+bootstrap method implemented in mvnorm.test
.
If method="limit"
, the p-value of the test is computed from
+the asymptotic distribution of the test statistic under the null
+hypothesis. The asymptotic
+distribution is a quadratic form of centered Gaussian random variables,
+which has the form
+$$\sum_{k=1}^\infty \lambda_k Z_k^2,$$
+where \(\lambda_k\) are positive constants (eigenvalues) and
+\(Z_k\) are iid standard normal variables. Eigenvalues are
+pre-computed and stored internally.
+A p-value is computed using Imhof's method as implemented in the
+CompQuadForm package.
Note that the "limit" method is intended for moderately large +samples because it applies the asymptotic distribution.
+The energy test of normality was proposed
+ and implemented by Szekely and Rizzo (2005).
+ See mvnorm.test
+ for more details.
normal.e
returns the energy goodness-of-fit statistic for
+a univariate sample.
normal.test
returns a list with class htest
containing
observed value of the test statistic
p-value of the test
sample estimates: mean, sd
description of data
mvnorm.test
and mvnorm.e
for the
+ energy test of multivariate normality and the test statistic
+ for multivariate samples.
Szekely, G. J. and Rizzo, M. L. (2005) A New Test for + Multivariate Normality, Journal of Multivariate Analysis, + 93/1, 58-80, + doi:10.1016/j.jmva.2003.12.002 +.
+Mori, T. F., Szekely, G. J. and Rizzo, M. L. "On energy tests of normality." Journal of Statistical Planning and Inference 213 (2021): 1-15.
+Rizzo, M. L. (2002). A New Rotation Invariant Goodness-of-Fit Test, + Ph.D. dissertation, Bowling Green State University.
+J. P. Imhof (1961). Computing the Distribution of Quadratic Forms in +Normal Variables, Biometrika, Volume 48, Issue 3/4, +419-426.
+ x <- iris[1:50, 1]
+ normal.e(x)
+#> [1] 0.4650295
+ normal.test(x, R=199)
+#>
+#> Energy test of normality: estimated parameters
+#>
+#> data: x, sample size 50, dimension 1, replicates 199
+#> E-statistic = 0.46503, p-value = 0.2915
+#> sample estimates:
+#> mean sd
+#> 5.0060000 0.3524897
+#>
+ normal.test(x, method="limit")
+#>
+#> Energy test of normality: limit distribution
+#>
+#> data: Case 4: composite hypothesis, estimated parameters
+#> statistic = 0.46503, p-value = 0.2869
+#> sample estimates:
+#> mean sd
+#> 5.0060000 0.3524897
+#>
+
pdcor.Rd
Partial distance correlation pdcor, pdcov, and tests.
-pdcov.test(x, y, z, R)
- pdcor.test(x, y, z, R)
- pdcor(x, y, z)
- pdcov(x, y, z)
data or dist object of first sample
data or dist object of second sample
data or dist object of third sample
replicates for permutation test
pdcor(x, y, z)
and pdcov(x, y, z)
compute the partial distance
-correlation and partial distance covariance, respectively,
-of x and y removing z.
A test for zero partial distance correlation (or zero partial distance covariance) is implemented in pdcor.test
, and pdcov.test
.
Argument types supported are numeric data matrix, data.frame, tibble, numeric vector, class "dist" object, or factor. For unordered factors a 0-1 distance matrix is computed.
-Each test returns an object of class htest
.
Szekely, G.J. and Rizzo, M.L. (2014), - Partial Distance Correlation with Methods for Dissimilarities. - Annals of Statistics, Vol. 42 No. 6, 2382-2412.
- n = 30
- R <- 199
-
- ## mutually independent standard normal vectors
- x <- rnorm(n)
- y <- rnorm(n)
- z <- rnorm(n)
-
- pdcor(x, y, z)
-#> pdcor
-#> 0.03256314
- pdcov(x, y, z)
-#> [1] 0.01237857
- set.seed(1)
- pdcov.test(x, y, z, R=R)
-#>
-#> pdcov test
-#>
-#> data: replicates 199
-#> n V^* = 0.37136, p-value = 0.105
-#> sample estimates:
-#> pdcor
-#> 0.03256314
-#>
- set.seed(1)
- pdcor.test(x, y, z, R=R)
-#>
-#> pdcor test
-#>
-#> data: replicates 199
-#> pdcor = 0.032563, p-value = 0.105
-#> sample estimates:
-#> pdcor
-#> 0.03256314
-#>
-
-# \donttest{
- if (require(MASS)) {
- p = 4
- mu <- rep(0, p)
- Sigma <- diag(p)
-
- ## linear dependence
- y <- mvrnorm(n, mu, Sigma) + x
- print(pdcov.test(x, y, z, R=R))
-
- ## non-linear dependence
- y <- mvrnorm(n, mu, Sigma) * x
- print(pdcov.test(x, y, z, R=R))
- }
-#>
-#> pdcov test
-#>
-#> data: replicates 199
-#> n V^* = 18.664, p-value = 0.005
-#> sample estimates:
-#> pdcor
-#> 0.7661325
-#>
-#>
-#> pdcov test
-#>
-#> data: replicates 199
-#> n V^* = 0.44957, p-value = 0.165
-#> sample estimates:
-#> pdcor
-#> 0.04511353
-#>
- # }
-
pdcor.Rd
Partial distance correlation pdcor, pdcov, and tests.
+pdcov.test(x, y, z, R)
+ pdcor.test(x, y, z, R)
+ pdcor(x, y, z)
+ pdcov(x, y, z)
pdcor(x, y, z)
and pdcov(x, y, z)
compute the partial distance
+correlation and partial distance covariance, respectively,
+of x and y removing z.
A test for zero partial distance correlation (or zero partial distance covariance) is implemented in pdcor.test
, and pdcov.test
.
Argument types supported are numeric data matrix, data.frame, tibble, numeric vector, class "dist" object, or factor. For unordered factors a 0-1 distance matrix is computed.
+Each test returns an object of class htest
.
Szekely, G.J. and Rizzo, M.L. (2014), + Partial Distance Correlation with Methods for Dissimilarities. + Annals of Statistics, Vol. 42 No. 6, 2382-2412.
+ n = 30
+ R <- 199
+
+ ## mutually independent standard normal vectors
+ x <- rnorm(n)
+ y <- rnorm(n)
+ z <- rnorm(n)
+
+ pdcor(x, y, z)
+#> pdcor
+#> -0.04653524
+ pdcov(x, y, z)
+#> [1] -0.01763282
+ set.seed(1)
+ pdcov.test(x, y, z, R=R)
+#>
+#> pdcov test
+#>
+#> data: replicates 199
+#> n V^* = -0.52898, p-value = 0.85
+#> sample estimates:
+#> pdcor
+#> -0.04653524
+#>
+ set.seed(1)
+ pdcor.test(x, y, z, R=R)
+#>
+#> pdcor test
+#>
+#> data: replicates 199
+#> pdcor = -0.046535, p-value = 0.85
+#> sample estimates:
+#> pdcor
+#> -0.04653524
+#>
+
+# \donttest{
+ if (require(MASS)) {
+ p = 4
+ mu <- rep(0, p)
+ Sigma <- diag(p)
+
+ ## linear dependence
+ y <- mvrnorm(n, mu, Sigma) + x
+ print(pdcov.test(x, y, z, R=R))
+
+ ## non-linear dependence
+ y <- mvrnorm(n, mu, Sigma) * x
+ print(pdcov.test(x, y, z, R=R))
+ }
+#>
+#> pdcov test
+#>
+#> data: replicates 199
+#> n V^* = 12.29, p-value = 0.005
+#> sample estimates:
+#> pdcor
+#> 0.6850001
+#>
+#>
+#> pdcov test
+#>
+#> data: replicates 199
+#> n V^* = 0.45494, p-value = 0.105
+#> sample estimates:
+#> pdcor
+#> 0.05892834
+#>
+ # }
+
poisson.Rd
Performs the mean distance goodness-of-fit test and the energy goodness-of-fit test of Poisson distribution with unknown parameter.
-poisson.e(x)
-poisson.m(x)
-poisson.etest(x, R)
-poisson.mtest(x, R)
-poisson.tests(x, R, test="all")
vector of nonnegative integers, the sample data
number of bootstrap replicates
name of test(s)
Two distance-based tests of Poissonity are applied in poisson.tests
, "M" and "E". The default is to
-do all tests and return results in a data frame.
-Valid choices for test
are "M", "E", or "all" with
-default "all".
If "all" tests, all tests are performed by a single parametric bootstrap computing all test statistics on each sample.
-The "M" choice is two tests, one based on a Cramer-von Mises distance and the other an Anderson-Darling distance. The "E" choice is the energy goodness-of-fit test.
-R
must be a positive integer for a test. If R
is missing or 0, a warning is printed but test statistics are computed (without testing).
The mean distance test of Poissonity (M-test) is based on the result that the sequence
- of expected values E|X-j|, j=0,1,2,... characterizes the distribution of
- the random variable X. As an application of this characterization one can
- get an estimator \(\hat F(j)\) of the CDF. The test statistic
- (see poisson.m
) is a Cramer-von Mises type of distance, with
- M-estimates replacing the usual EDF estimates of the CDF:
- $$M_n = n\sum_{j=0}^\infty (\hat F(j) - F(j\;; \hat \lambda))^2
- f(j\;; \hat \lambda).$$
In poisson.tests
, an Anderson-Darling type of weight is also applied when test="M"
or test="all"
.
The tests are implemented by parametric bootstrap with
- R
replicates.
An energy goodness-of-fit test (E) is based on the test statistic -$$Q_n = n (\frac{2}{n} \sum_{i=1}^n E|x_i - X| - E|X-X'| - \frac{1}{n^2} \sum_{i,j=1}^n |x_i - x_j|, -$$ -where X and X' are iid with the hypothesized null distribution. For a test of H: X ~ Poisson(\(\lambda\)), we can express E|X-X'| in terms of Bessel functions, and E|x_i - X| in terms of the CDF of Poisson(\(\lambda\)).
-If test=="all" or not specified, all tests are run with a single parametric bootstrap. poisson.mtest
implements only the Poisson M-test with Cramer-von Mises type distance. poisson.etest
implements only the Poisson energy test.
The functions poisson.m
and poisson.e
return the test statistics. The function
-poisson.mtest
or poisson.etest
return an htest
object containing
Description of test
observed value of the test statistic
approximate p-value of the test
replicates R
sample mean
poisson.tests
returns "M-CvM test", "M-AD test" and "Energy test" results in a data frame with columns
sample mean
observed value of the test statistic
approximate p-value of the test
Description of test
which can be coerced to a tibble
.
The running time of the M test is much faster than the E-test.
-Szekely, G. J. and Rizzo, M. L. (2004) Mean Distance Test of Poisson Distribution, Statistics and Probability Letters, -67/3, 241-247. doi:10.1016/j.spl.2004.01.005 -.
-Szekely, G. J. and Rizzo, M. L. (2005) A New Test for - Multivariate Normality, Journal of Multivariate Analysis, - 93/1, 58-80, - doi:10.1016/j.jmva.2003.12.002 -.
- x <- rpois(50, 2)
- poisson.m(x)
-#> M-CvM M-AD
-#> 0.07368603 0.42332826
- poisson.e(x)
-#> E
-#> 0.6370008
- # \donttest{
- poisson.etest(x, R=199)
-#>
-#> Poisson E-test
-#>
-#> data: replicates: 199
-#> E = 0.637, p-value = 0.4623
-#> sample estimates:
-#> [1] 2.06
-#>
- poisson.mtest(x, R=199)
-#>
-#> Poisson M-test
-#>
-#> data: x replicates: 199
-#> M-CvM = 0.073686, p-value = 0.4422
-#> sample estimates:
-#> [1] 2.06
-#>
- poisson.tests(x, R=199)
-#> estimate statistic p.value method
-#> M-CvM 2.06 0.07368603 0.4773869 M-CvM test
-#> M-AD 2.06 0.42332826 0.4673367 M-AD test
-#> E 2.06 0.63700084 0.4673367 Energy test
- # }
-
poisson.Rd
Performs the mean distance goodness-of-fit test and the energy goodness-of-fit test of Poisson distribution with unknown parameter.
+poisson.e(x)
+poisson.m(x)
+poisson.etest(x, R)
+poisson.mtest(x, R)
+poisson.tests(x, R, test="all")
Two distance-based tests of Poissonity are applied in poisson.tests
, "M" and "E". The default is to
+do all tests and return results in a data frame.
+Valid choices for test
are "M", "E", or "all" with
+default "all".
If "all" tests, all tests are performed by a single parametric bootstrap computing all test statistics on each sample.
+The "M" choice is two tests, one based on a Cramer-von Mises distance and the other an Anderson-Darling distance. The "E" choice is the energy goodness-of-fit test.
+R
must be a positive integer for a test. If R
is missing or 0, a warning is printed but test statistics are computed (without testing).
The mean distance test of Poissonity (M-test) is based on the result that the sequence
+ of expected values E|X-j|, j=0,1,2,... characterizes the distribution of
+ the random variable X. As an application of this characterization one can
+ get an estimator \(\hat F(j)\) of the CDF. The test statistic
+ (see poisson.m
) is a Cramer-von Mises type of distance, with
+ M-estimates replacing the usual EDF estimates of the CDF:
+ $$M_n = n\sum_{j=0}^\infty (\hat F(j) - F(j\;; \hat \lambda))^2
+ f(j\;; \hat \lambda).$$
In poisson.tests
, an Anderson-Darling type of weight is also applied when test="M"
or test="all"
.
The tests are implemented by parametric bootstrap with
+ R
replicates.
An energy goodness-of-fit test (E) is based on the test statistic +$$Q_n = n (\frac{2}{n} \sum_{i=1}^n E|x_i - X| - E|X-X'| - \frac{1}{n^2} \sum_{i,j=1}^n |x_i - x_j|, +$$ +where X and X' are iid with the hypothesized null distribution. For a test of H: X ~ Poisson(\(\lambda\)), we can express E|X-X'| in terms of Bessel functions, and E|x_i - X| in terms of the CDF of Poisson(\(\lambda\)).
+If test=="all" or not specified, all tests are run with a single parametric bootstrap. poisson.mtest
implements only the Poisson M-test with Cramer-von Mises type distance. poisson.etest
implements only the Poisson energy test.
The functions poisson.m
and poisson.e
return the test statistics. The function
+poisson.mtest
or poisson.etest
return an htest
object containing
Description of test
observed value of the test statistic
approximate p-value of the test
replicates R
sample mean
poisson.tests
returns "M-CvM test", "M-AD test" and "Energy test" results in a data frame with columns
sample mean
observed value of the test statistic
approximate p-value of the test
Description of test
which can be coerced to a tibble
.
The running time of the M test is much faster than the E-test.
+Szekely, G. J. and Rizzo, M. L. (2004) Mean Distance Test of Poisson Distribution, Statistics and Probability Letters, +67/3, 241-247. doi:10.1016/j.spl.2004.01.005 +.
+Szekely, G. J. and Rizzo, M. L. (2005) A New Test for + Multivariate Normality, Journal of Multivariate Analysis, + 93/1, 58-80, + doi:10.1016/j.jmva.2003.12.002 +.
+ x <- rpois(50, 2)
+ poisson.m(x)
+#> M-CvM M-AD
+#> 0.07368603 0.42332826
+ poisson.e(x)
+#> E
+#> 0.6370008
+ # \donttest{
+ poisson.etest(x, R=199)
+#>
+#> Poisson E-test
+#>
+#> data: replicates: 199
+#> E = 0.637, p-value = 0.4623
+#> sample estimates:
+#> [1] 2.06
+#>
+ poisson.mtest(x, R=199)
+#>
+#> Poisson M-test
+#>
+#> data: x replicates: 199
+#> M-CvM = 0.073686, p-value = 0.4422
+#> sample estimates:
+#> [1] 2.06
+#>
+ poisson.tests(x, R=199)
+#> estimate statistic p.value method
+#> M-CvM 2.06 0.07368603 0.4773869 M-CvM test
+#> M-AD 2.06 0.42332826 0.4673367 M-AD test
+#> E 2.06 0.63700084 0.4673367 Energy test
+ # }
+
sortrank.Rd
A utility that returns a list with the components -equivalent to sort(x), order(x), rank(x, ties.method = "first").
-sortrank(x)
vector compatible with sort(x)
This utility exists to save a little time on large vectors when two or all three of the sort(), order(), rank() results are required. In case of ties, the ranks component matches rank(x, ties.method = "first")
.
A list with components
-the sorted input vector x
the permutation = order(x) which rearranges x into ascending order
the ranks of x
This function was benchmarked faster than the combined calls to sort
and rank
.
See sort
.
sortrank(rnorm(5))
-#> $x
-#> [1] -0.5785381 -0.4833321 0.6799946 0.8886331 1.6365181
-#>
-#> $ix
-#> [1] 5 1 3 4 2
-#>
-#> $r
-#> [1] 2 5 3 4 1
-#>
-
sortrank.Rd
A utility that returns a list with the components +equivalent to sort(x), order(x), rank(x, ties.method = "first").
+sortrank(x)
This utility exists to save a little time on large vectors when two or all three of the sort(), order(), rank() results are required. In case of ties, the ranks component matches rank(x, ties.method = "first")
.
A list with components
+the sorted input vector x
the permutation = order(x) which rearranges x into ascending order
the ranks of x
This function was benchmarked faster than the combined calls to sort
and rank
.
See sort
.
sortrank(rnorm(5))
+#> $x
+#> [1] -0.5785381 -0.4833321 0.6799946 0.8886331 1.6365181
+#>
+#> $ix
+#> [1] 5 1 3 4 2
+#>
+#> $r
+#> [1] 2 5 3 4 1
+#>
+