Package weka.clusterers
Class EM
- All Implemented Interfaces:
public class EM
extends RandomizableDensityBasedClusterer
implements NumberOfClustersRequestable, WeightedInstancesHandler
Simple EM (expectation maximisation) class.
EM assigns a probability distribution to each instance which indicates the probability of it belonging to each of the clusters. EM can decide how many clusters to create by cross validation, or you may specify apriori how many clusters to generate.
The cross validation performed to determine the number of clusters is done in the following steps:
1. the number of clusters is set to 1
2. the training set is split randomly into 10 folds.
3. EM is performed 10 times using the 10 folds the usual CV way.
4. the loglikelihood is averaged over all 10 results.
5. if loglikelihood has increased the number of clusters is increased by 1 and the program continues at step 2.
The number of folds is fixed to 10, as long as the number of instances in the training set is not smaller 10. If this is the case the number of folds is set equal to the number of instances.
Missing values are globally replaced with ReplaceMissingValues. Valid options are:
EM assigns a probability distribution to each instance which indicates the probability of it belonging to each of the clusters. EM can decide how many clusters to create by cross validation, or you may specify apriori how many clusters to generate.
The cross validation performed to determine the number of clusters is done in the following steps:
1. the number of clusters is set to 1
2. the training set is split randomly into 10 folds.
3. EM is performed 10 times using the 10 folds the usual CV way.
4. the loglikelihood is averaged over all 10 results.
5. if loglikelihood has increased the number of clusters is increased by 1 and the program continues at step 2.
The number of folds is fixed to 10, as long as the number of instances in the training set is not smaller 10. If this is the case the number of folds is set equal to the number of instances.
Missing values are globally replaced with ReplaceMissingValues. Valid options are:
-N <num> number of clusters. If omitted or -1 specified, then cross validation is used to select the number of clusters.
-X <num> Number of folds to use when cross-validating to find the best number of clusters.
-K <num> Number of runs of k-means to perform. (default 10)
-max <num> Maximum number of clusters to consider during cross-validation. If omitted or -1 specified, then there is no upper limit on the number of clusters.
-ll-cv <num> Minimum improvement in cross-validated log likelihood required to consider increasing the number of clusters. (default 1e-6)
-I <num> max iterations. (default 100)
-ll-iter <num> Minimum improvement in log likelihood required to perform another iteration of the E and M steps. (default 1e-6)
-V verbose.
-M <num> minimum allowable standard deviation for normal density computation (default 1e-6)
-O Display model in old format (good when there are many clusters)
-num-slots <num> Number of execution slots. (default 1 - i.e. no parallelism)
-S <num> Random number seed. (default 100)
-output-debug-info If set, clusterer is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, clusterer capabilities are not checked before clusterer is built (use with caution).
- Version:
- $Revision: 15519 $
- Author:
- Mark Hall (, Eibe Frank (
- See Also:
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoid
(Instances data) Generates a clusterer.double[]
Returns the cluster priors.Returns the tip text for this propertyReturns the tip text for this propertyReturns default capabilities of the clusterer (i.e., the ones of SimpleKMeans).double[][][]
Return the normal distributions for the cluster modelsdouble[]
Return the priors for the clustersboolean
Get debug modeboolean
Get whether to display model output in the old, original
Get the maximum number of clusters to consider when cross-validatingint
Get the maximum number of iterationsdouble
Get the minimum improvement in cross-validated log likelihood required to consider increasing the number of clusters when cross-validating to find the best number of clustersdouble
Get the minimum improvement in log likelihood necessary to perform another iteration of the E and M steps.double
Get the minimum allowable standard
Get the number of clustersint
Get the degree of parallelism to
Get the number of folds to use when cross-validating to find the best number of
Returns the number of runs of k-means to perform.String[]
Gets the current settings of EM.Returns the revision string.Returns a string describing this clustererReturns an enumeration describing the available options.double[]
Computes the log of the conditional density (per cluster) for a given instance.static void
Main method for testing this class.Returns the tip text for this propertyReturns the tip text for this propertyReturns the tip text for this propertyReturns the tip text for this propertyReturns the tip text for this propertyint
Returns the number of clusters.Returns the tip text for this propertyReturns the tip text for this propertyReturns the tip text for this propertyReturns the tip text for this propertyvoid
(boolean v) Set debug mode - verbose outputvoid
(boolean d) Set whether to display model output in the old, original format.void
(int n) Set the maximum number of clusters to consider when cross-validatingvoid
(int i) Set the maximum number of iterations to performvoid
(double min) Set the minimum improvement in cross-validated log likelihood required to consider increasing the number of clusters when cross-validating to find the best number of clustersvoid
(double min) Set the minimum improvement in log likelihood necessary to perform another iteration of the E and M steps.void
(double m) Set the minimum value for standard deviation when calculating normal density.void
(double[] m) void
(int n) Set the number of clusters (-1 to select by CV).void
(int slots) Set the degree of parallelism to use.void
(int folds) Set the number of folds to use when cross-validating to find the best number of clusters.void
(int intValue) Set the number of runs of SimpleKMeans to perform.void
(String[] options) Parses a given list of options.toString()
Outputs the generated clusters into a string.Methods inherited from class weka.clusterers.RandomizableDensityBasedClusterer
getSeed, seedTipText, setSeed
Methods inherited from class weka.clusterers.AbstractDensityBasedClusterer
distributionForInstance, logDensityForInstance, logJointDensitiesForInstance, makeCopies
Methods inherited from class weka.clusterers.AbstractClusterer
clusterInstance, doNotCheckCapabilitiesTipText, forName, getDoNotCheckCapabilities, makeCopies, makeCopy, postExecution, preExecution, run, runClusterer, setDoNotCheckCapabilities
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface weka.clusterers.Clusterer
Constructor Details
public EM()Constructor.
Method Details
Returns a string describing this clusterer- Returns:
- a description of the evaluator suitable for displaying in the explorer/experimenter gui
Returns an enumeration describing the available options.- Specified by:
in interfaceOptionHandler
- Overrides:
in classRandomizableDensityBasedClusterer
- Returns:
- an enumeration of all the available options.
Parses a given list of options. Valid options are:-N <num> number of clusters. If omitted or -1 specified, then cross validation is used to select the number of clusters.
-X <num> Number of folds to use when cross-validating to find the best number of clusters.
-K <num> Number of runs of k-means to perform. (default 10)
-max <num> Maximum number of clusters to consider during cross-validation. If omitted or -1 specified, then there is no upper limit on the number of clusters.
-ll-cv <num> Minimum improvement in cross-validated log likelihood required to consider increasing the number of clusters. (default 1e-6)
-I <num> max iterations. (default 100)
-ll-iter <num> Minimum improvement in log likelihood required to perform another iteration of the E and M steps. (default 1e-6)
-V verbose.
-M <num> minimum allowable standard deviation for normal density computation (default 1e-6)
-O Display model in old format (good when there are many clusters)
-num-slots <num> Number of execution slots. (default 1 - i.e. no parallelism)
-S <num> Random number seed. (default 100)
-output-debug-info If set, clusterer is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, clusterer capabilities are not checked before clusterer is built (use with caution).
- Specified by:
in interfaceOptionHandler
- Overrides:
in classRandomizableDensityBasedClusterer
- Parameters:
- the list of options as an array of strings- Throws:
- if an option is not supported
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
public int getNumKMeansRuns()Returns the number of runs of k-means to perform.- Returns:
- the number of runs
public void setNumKMeansRuns(int intValue) Set the number of runs of SimpleKMeans to perform.- Parameters:
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
public void setNumFolds(int folds) Set the number of folds to use when cross-validating to find the best number of clusters.- Parameters:
- the number of folds to use
public int getNumFolds()Get the number of folds to use when cross-validating to find the best number of clusters.- Returns:
- the number of folds to use
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
public void setMinLogLikelihoodImprovementCV(double min) Set the minimum improvement in cross-validated log likelihood required to consider increasing the number of clusters when cross-validating to find the best number of clusters- Parameters:
- the minimum improvement in log likelihood
public double getMinLogLikelihoodImprovementCV()Get the minimum improvement in cross-validated log likelihood required to consider increasing the number of clusters when cross-validating to find the best number of clusters- Returns:
- the minimum improvement in log likelihood
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
public void setMinLogLikelihoodImprovementIterating(double min) Set the minimum improvement in log likelihood necessary to perform another iteration of the E and M steps.- Parameters:
- the minimum improvement in log likelihood
public double getMinLogLikelihoodImprovementIterating()Get the minimum improvement in log likelihood necessary to perform another iteration of the E and M steps.- Returns:
- the minimum improvement in log likelihood
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
public void setNumExecutionSlots(int slots) Set the degree of parallelism to use.- Parameters:
- the number of tasks to run in parallel when computing the nearest neighbors and evaluating different values of k between the lower and upper bounds
public int getNumExecutionSlots()Get the degree of parallelism to use.- Returns:
- the number of tasks to run in parallel when computing the nearest neighbors and evaluating different values of k between the lower and upper bounds
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
public void setDisplayModelInOldFormat(boolean d) Set whether to display model output in the old, original format.- Parameters:
- true if model ouput is to be shown in the old format
public boolean getDisplayModelInOldFormat()Get whether to display model output in the old, original format.- Returns:
- true if model ouput is to be shown in the old format
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
public void setMinStdDev(double m) Set the minimum value for standard deviation when calculating normal density. Reducing this value can help prevent arithmetic overflow resulting from multiplying large densities (arising from small standard deviations) when there are many singleton or near singleton values.- Parameters:
- minimum value for standard deviation
public void setMinStdDevPerAtt(double[] m) -
public double getMinStdDev()Get the minimum allowable standard deviation.- Returns:
- the minumum allowable standard deviation
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
Set the number of clusters (-1 to select by CV).- Specified by:
in interfaceNumberOfClustersRequestable
- Parameters:
- the number of clusters- Throws:
- if n is 0
public int getNumClusters()Get the number of clusters- Returns:
- the number of clusters.
public void setMaximumNumberOfClusters(int n) Set the maximum number of clusters to consider when cross-validating- Parameters:
- the maximum number of clusters to consider
public int getMaximumNumberOfClusters()Get the maximum number of clusters to consider when cross-validating- Returns:
- the maximum number of clusters to consider
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
Set the maximum number of iterations to perform- Parameters:
- the number of iterations- Throws:
- if i is less than 1
public int getMaxIterations()Get the maximum number of iterations- Returns:
- the number of iterations
Returns the tip text for this property- Overrides:
in classAbstractClusterer
- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
public void setDebug(boolean v) Set debug mode - verbose output- Overrides:
in classAbstractClusterer
- Parameters:
- true for verbose output
public boolean getDebug()Get debug mode- Overrides:
in classAbstractClusterer
- Returns:
- true if debug mode is set
Gets the current settings of EM.- Specified by:
in interfaceOptionHandler
- Overrides:
in classRandomizableDensityBasedClusterer
- Returns:
- an array of strings suitable for passing to setOptions()
public double[][][] getClusterModelsNumericAtts()Return the normal distributions for the cluster models- Returns:
- a
public double[] getClusterPriors()Return the priors for the clusters- Returns:
- a
Outputs the generated clusters into a string. -
Returns the number of clusters.- Specified by:
in interfaceClusterer
- Specified by:
in classAbstractClusterer
- Returns:
- the number of clusters generated for a training dataset.
- Throws:
- if number of clusters could not be returned successfully
Returns default capabilities of the clusterer (i.e., the ones of SimpleKMeans).- Specified by:
in interfaceCapabilitiesHandler
- Specified by:
in interfaceClusterer
- Overrides:
in classAbstractClusterer
- Returns:
- the capabilities of this clusterer
- See Also:
Generates a clusterer. Has to initialize all fields of the clusterer that are not being set via options.- Specified by:
in interfaceClusterer
- Specified by:
in classAbstractClusterer
- Parameters:
- set of instances serving as training data- Throws:
- if the clusterer has not been generated successfully
public double[] clusterPriors()Returns the cluster priors.- Specified by:
in interfaceDensityBasedClusterer
- Specified by:
in classAbstractDensityBasedClusterer
- Returns:
- the cluster priors
Computes the log of the conditional density (per cluster) for a given instance.- Specified by:
in interfaceDensityBasedClusterer
- Specified by:
in classAbstractDensityBasedClusterer
- Parameters:
- the instance to compute the density for- Returns:
- an array containing the estimated densities
- Throws:
- if the density could not be computed successfully
Returns the revision string.- Specified by:
in interfaceRevisionHandler
- Overrides:
in classAbstractClusterer
- Returns:
- the revision
Main method for testing this class.- Parameters:
- should contain the following arguments:-t training file [-T test file] [-N number of clusters] [-S random seed]