Testing the significance of each monothetic clustering split by permutation
methods. The "simple-withhold" method ("sw"
) shuffles the observations
between two groups without the splitting variable. The other two methods
shuffle the values in the splitting variable to create a new data set, then
it either splits again on that variable ("resplit-limit", "rl"
) or use all
variables as the splitting candidates ("resplit-nolimit", "rn"
).
perm.test( object, data, auto.pick = FALSE, sig.val = 0.05, method = c("sw", "rl", "rn"), rep = 1000L, stat = c("f", "aw"), bon.adj = TRUE, ncores = 1L )
object | The |
---|---|
data | The data set which is being clustered. |
auto.pick | Whether the algorithm stops when p-value becomes larger than
|
sig.val | Significance value to decide when to stop splitting. This
option is ignored if |
method | Can be chosen between |
rep | Number of permutations required to calculate test statistic. |
stat | Statistic to use. Choosing between |
bon.adj | Whether to adjust for multiple testing problem using Bonferroni correction. |
ncores | Number of CPU cores on the current host. When set to NULL, all available cores are used. |
The same MonoClust
object with an extra column (p-value), as well
as the numofclusters
object if auto.pick = TRUE
.
The stat
calculated from the shuffles create the reference distribution
to find the p-value. Because the splitting variable that was chosen is
already the best in terms of reduction of inertia, that variable is withheld
from the distance matrix used in the permutation test.
This method shuffles the values of the splitting variables while keeping
other variables fixed to create a new data set, then the chosen stat
is
calculated for each rep to compare with the observed stat
.
Similar to Method 2 but all variables are splitting candidates.
A hypothesis test occurred lower in the monothetic clustering tree could have its p-value corrected for multiple tests happened before it in order to reach that node. The formula is $$adj.p = unadj.p \times depth,$$ with \(depth\) is 1 at the root node.
This function uses foreach::foreach()
to facilitate parallel
processing. It distributes reps to processes.
Calinski, T. and Harabasz, J (1974). "A dendrite method for cluster analysis". en. In: Communications in Statistics 3.1, pp. 1-27. doi: 10.1080/03610927408827101 .
Rousseeuw, P. J. (1987). "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis". In: Journal of Computational and Applied Mathematics 20, pp. 53-65. ISSN: 03770427. doi: 10.1016/0377-0427(87)90125-7 .
library(cluster) data(ruspini) # \donttest{ ruspini6sol <- MonoClust(ruspini, nclusters = 6) ruspini6.p_value <- perm.test(ruspini6sol, data = ruspini, method = "sw", rep = 1000) ruspini6.p_value#> n = 75 #> #> Node) Split, N, Cluster Inertia, Proportion Inertia Explained, Bonferroni adj. p-value #> * denotes terminal node #> #> 1) root 75 244373.900 0.6344215 0.003 #> 2) y < 91 35 43328.460 0.9472896 0.003 #> 4) x < 37 20 3689.500 0.003 * #> 5) x >= 37 15 1456.533 0.003 * #> 3) y >= 91 40 46009.380 0.7910436 0.003 #> 6) x < 63.5 23 3176.783 0.9648762 0.003 #> 12) x < 45 13 600.000 0.003 * #> 13) x >= 45 10 1033.400 0.003 * #> 7) x >= 63.5 17 4558.235 0.9585605 0.003 #> 14) x < 85.5 4 381.750 0.003 * #> 15) x >= 85.5 13 1422.154 0.003 * #> #> Note: One or more of the splits chosen had an alternative split that reduced inertia by the same amount. See "alt" column of "frame" object for details.# }