Package weka.clusterers
Class SimpleKMeans
java.lang.Object
weka.clusterers.AbstractClusterer
weka.clusterers.RandomizableClusterer
weka.clusterers.SimpleKMeans
- All Implemented Interfaces:
Serializable
,Cloneable
,Clusterer
,NumberOfClustersRequestable
,CapabilitiesHandler
,CapabilitiesIgnorer
,CommandlineRunnable
,OptionHandler
,Randomizable
,RevisionHandler
,TechnicalInformationHandler
,WeightedInstancesHandler
public class SimpleKMeans
extends RandomizableClusterer
implements NumberOfClustersRequestable, WeightedInstancesHandler, TechnicalInformationHandler
Cluster data using the k means algorithm. Can use
either the Euclidean distance (default) or the Manhattan distance. If the
Manhattan distance is used, then centroids are computed as the component-wise
median rather than mean. For more information see:
D. Arthur, S. Vassilvitskii: k-means++: the advantages of carefull seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007. BibTeX:
D. Arthur, S. Vassilvitskii: k-means++: the advantages of carefull seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007. BibTeX:
@inproceedings{Arthur2007, author = {D. Arthur and S. Vassilvitskii}, booktitle = {Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms}, pages = {1027-1035}, title = {k-means++: the advantages of carefull seeding}, year = {2007} }Valid options are:
-N <num> Number of clusters. (default 2).
-init Initialization method to use. 0 = random, 1 = k-means++, 2 = canopy, 3 = farthest first. (default = 0)
-C Use canopies to reduce the number of distance calculations.
-max-candidates <num> Maximum number of candidate canopies to retain in memory at any one time when using canopy clustering. T2 distance plus, data characteristics, will determine how many candidate canopies are formed before periodic and final pruning are performed, which might result in exceess memory consumption. This setting avoids large numbers of candidate canopies consuming memory. (default = 100)
-periodic-pruning <num> How often to prune low density canopies when using canopy clustering. (default = every 10,000 training instances)
-min-density Minimum canopy density, when using canopy clustering, below which a canopy will be pruned during periodic pruning. (default = 2 instances)
-t2 The T2 distance to use when using canopy clustering. Values < 0 indicate that a heuristic based on attribute std. deviation should be used to set this. (default = -1.0)
-t1 The T1 distance to use when using canopy clustering. A value < 0 is taken as a positive multiplier for T2. (default = -1.5)
-V Display std. deviations for centroids.
-M Don't replace missing values with mean/mode.
-A <classname and options> Distance function to use. (default: weka.core.EuclideanDistance)
-I <num> Maximum number of iterations.
-O Preserve order of instances.
-fast Enables faster distance calculations, using cut-off values. Disables the calculation/output of squared errors/distances.
-num-slots <num> Number of execution slots. (default 1 - i.e. no parallelism)
-S <num> Random number seed. (default 10)
-output-debug-info If set, clusterer is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, clusterer capabilities are not checked before clusterer is built (use with caution).
- Version:
- $Revision: 15519 $
- Author:
- Mark Hall (mhall@cs.waikato.ac.nz), Eibe Frank (eibe@cs.waikato.ac.nz)
- See Also:
-
Field Summary
Modifier and TypeFieldDescriptionstatic final int
static final int
static final int
static final int
static final Tag[]
Initialization methods -
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionvoid
buildClusterer
(Instances data) Generates a clusterer.Returns the tip text for this property.Returns the tip text for this property.Returns the tip text for this property.Tip text for this propertyTip text for this propertyint
clusterInstance
(Instance instance) Classifies a given instance.Returns the tip text for this property.Returns the tip text for this property.Returns the tip text for this property.Returns the tip text for this property.int[]
Gets the assignments for each instance.int
Get the maximum number of candidate canopies to retain in memory during training.double
Get the minimum T2-based density below which a canopy will be pruned during periodic pruning.int
Get the how often to prune low density canopies during training (if using canopy clustering)double
Get the t1 radius to use when canopy clustering is being used as start points and/or to reduce the number of distance calcsdouble
Get the t2 radius to use when canopy clustering is being used as start points and/or to reduce the number of distance calcsReturns default capabilities of the clusterer.Gets the the cluster centroids.double[][][]
Returns for each cluster the weighted frequency counts for the values of each nominal attribute.double[]
Gets the sum of weights for all the instances in each cluster.Gets the standard deviations of the numeric attributes in each cluster.boolean
Gets whether standard deviations and nominal count.returns the distance function currently in use.boolean
Gets whether missing values are to be replaced.boolean
Gets whether to use faster distance calculation.Get the initialization method to useint
gets the number of maximum iterations to be executed.int
gets the number of clusters to generate.int
Get the degree of parallelism to use.String[]
Gets the current settings of SimpleKMeans.boolean
Gets whether order of instances must be preserved.boolean
Get whether to use canopies to reduce the number of distance computations requiredReturns the revision string.double
Gets the squared error for all clusters.Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.Returns a string describing this clusterer.Returns the tip text for this property.Returns an enumeration describing the available options.static void
Main method for executing this class.Returns the tip text for this property.int
Returns the number of clusters.Returns the tip text for this property.Returns the tip text for this propertyReturns the tip text for this property.Returns the tip text for this property.void
Set the maximum number of candidate canopies to retain in memory during training.void
setCanopyMinimumCanopyDensity
(double dens) Set the minimum T2-based density below which a canopy will be pruned during periodic pruning.void
setCanopyPeriodicPruningRate
(int p) Set the how often to prune low density canopies during training (if using canopy clustering)void
setCanopyT1
(double t1) Set the t1 radius to use when canopy clustering is being used as start points and/or to reduce the number of distance calcsvoid
setCanopyT2
(double t2) Set the t2 radius to use when canopy clustering is being used as start points and/or to reduce the number of distance calcsvoid
setDisplayStdDevs
(boolean stdD) Sets whether standard deviations and nominal count.void
sets the distance function to use for instance comparison.void
setDontReplaceMissingValues
(boolean r) Sets whether missing values are to be replaced.void
setFastDistanceCalc
(boolean value) Sets whether to use faster distance calculation.void
setInitializationMethod
(SelectedTag method) Set the initialization method to usevoid
setMaxIterations
(int n) set the maximum number of iterations to be executed.void
setNumClusters
(int n) set the number of clusters to generate.void
setNumExecutionSlots
(int slots) Set the degree of parallelism to use.void
setOptions
(String[] options) Parses a given list of options.void
setPreserveInstancesOrder
(boolean r) Sets whether order of instances must be preserved.void
setReduceNumberOfDistanceCalcsViaCanopies
(boolean c) Set whether to use canopies to reduce the number of distance computations requiredtoString()
return a string describing this clusterer.Methods inherited from class weka.clusterers.RandomizableClusterer
getSeed, seedTipText, setSeed
Methods inherited from class weka.clusterers.AbstractClusterer
debugTipText, distributionForInstance, doNotCheckCapabilitiesTipText, forName, getDebug, getDoNotCheckCapabilities, makeCopies, makeCopy, postExecution, preExecution, run, runClusterer, setDebug, setDoNotCheckCapabilities
-
Field Details
-
RANDOM
public static final int RANDOM- See Also:
-
KMEANS_PLUS_PLUS
public static final int KMEANS_PLUS_PLUS- See Also:
-
CANOPY
public static final int CANOPY- See Also:
-
FARTHEST_FIRST
public static final int FARTHEST_FIRST- See Also:
-
TAGS_SELECTION
Initialization methods
-
-
Constructor Details
-
SimpleKMeans
public SimpleKMeans()the default constructor.
-
-
Method Details
-
getTechnicalInformation
Description copied from interface:TechnicalInformationHandler
Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.- Specified by:
getTechnicalInformation
in interfaceTechnicalInformationHandler
- Returns:
- the technical information about this class
-
globalInfo
Returns a string describing this clusterer.- Returns:
- a description of the evaluator suitable for displaying in the explorer/experimenter gui
-
getCapabilities
Returns default capabilities of the clusterer.- Specified by:
getCapabilities
in interfaceCapabilitiesHandler
- Specified by:
getCapabilities
in interfaceClusterer
- Overrides:
getCapabilities
in classAbstractClusterer
- Returns:
- the capabilities of this clusterer
- See Also:
-
buildClusterer
Generates a clusterer. Has to initialize all fields of the clusterer that are not being set via options.- Specified by:
buildClusterer
in interfaceClusterer
- Specified by:
buildClusterer
in classAbstractClusterer
- Parameters:
data
- set of instances serving as training data- Throws:
Exception
- if the clusterer has not been generated successfully
-
clusterInstance
Classifies a given instance.- Specified by:
clusterInstance
in interfaceClusterer
- Overrides:
clusterInstance
in classAbstractClusterer
- Parameters:
instance
- the instance to be assigned to a cluster- Returns:
- the number of the assigned cluster as an interger if the class is enumerated, otherwise the predicted value
- Throws:
Exception
- if instance could not be classified successfully
-
numberOfClusters
Returns the number of clusters.- Specified by:
numberOfClusters
in interfaceClusterer
- Specified by:
numberOfClusters
in classAbstractClusterer
- Returns:
- the number of clusters generated for a training dataset.
- Throws:
Exception
- if number of clusters could not be returned successfully
-
listOptions
Returns an enumeration describing the available options.- Specified by:
listOptions
in interfaceOptionHandler
- Overrides:
listOptions
in classRandomizableClusterer
- Returns:
- an enumeration of all the available options.
-
numClustersTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setNumClusters
set the number of clusters to generate.- Specified by:
setNumClusters
in interfaceNumberOfClustersRequestable
- Parameters:
n
- the number of clusters to generate- Throws:
Exception
- if number of clusters is negative
-
getNumClusters
public int getNumClusters()gets the number of clusters to generate.- Returns:
- the number of clusters to generate
-
initializationMethodTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setInitializationMethod
Set the initialization method to use- Parameters:
method
- the initialization method to use
-
getInitializationMethod
Get the initialization method to use- Returns:
- method the initialization method to use
-
reduceNumberOfDistanceCalcsViaCanopiesTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setReduceNumberOfDistanceCalcsViaCanopies
public void setReduceNumberOfDistanceCalcsViaCanopies(boolean c) Set whether to use canopies to reduce the number of distance computations required- Parameters:
c
- true if canopies are to be used to reduce the number of distance computations
-
getReduceNumberOfDistanceCalcsViaCanopies
public boolean getReduceNumberOfDistanceCalcsViaCanopies()Get whether to use canopies to reduce the number of distance computations required- Returns:
- true if canopies are to be used to reduce the number of distance computations
-
canopyPeriodicPruningRateTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setCanopyPeriodicPruningRate
public void setCanopyPeriodicPruningRate(int p) Set the how often to prune low density canopies during training (if using canopy clustering)- Parameters:
p
- how often (every p instances) to prune low density canopies
-
getCanopyPeriodicPruningRate
public int getCanopyPeriodicPruningRate()Get the how often to prune low density canopies during training (if using canopy clustering)- Returns:
- how often (every p instances) to prune low density canopies
-
canopyMinimumCanopyDensityTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setCanopyMinimumCanopyDensity
public void setCanopyMinimumCanopyDensity(double dens) Set the minimum T2-based density below which a canopy will be pruned during periodic pruning.- Parameters:
dens
- the minimum canopy density
-
getCanopyMinimumCanopyDensity
public double getCanopyMinimumCanopyDensity()Get the minimum T2-based density below which a canopy will be pruned during periodic pruning.- Returns:
- the minimum canopy density
-
canopyMaxNumCanopiesToHoldInMemoryTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setCanopyMaxNumCanopiesToHoldInMemory
public void setCanopyMaxNumCanopiesToHoldInMemory(int max) Set the maximum number of candidate canopies to retain in memory during training. T2 distance and data characteristics determine how many candidate canopies are formed before periodic and final pruning are performed. There may not be enough memory available if T2 is set too low.- Parameters:
max
- the maximum number of candidate canopies to retain in memory during training
-
getCanopyMaxNumCanopiesToHoldInMemory
public int getCanopyMaxNumCanopiesToHoldInMemory()Get the maximum number of candidate canopies to retain in memory during training. T2 distance and data characteristics determine how many candidate canopies are formed before periodic and final pruning are performed. There may not be enough memory available if T2 is set too low.- Returns:
- the maximum number of candidate canopies to retain in memory during training
-
canopyT2TipText
Tip text for this property- Returns:
- the tip text for this property
-
setCanopyT2
public void setCanopyT2(double t2) Set the t2 radius to use when canopy clustering is being used as start points and/or to reduce the number of distance calcs- Parameters:
t2
- the t2 radius to use
-
getCanopyT2
public double getCanopyT2()Get the t2 radius to use when canopy clustering is being used as start points and/or to reduce the number of distance calcs- Returns:
- the t2 radius to use
-
canopyT1TipText
Tip text for this property- Returns:
- the tip text for this property
-
setCanopyT1
public void setCanopyT1(double t1) Set the t1 radius to use when canopy clustering is being used as start points and/or to reduce the number of distance calcs- Parameters:
t1
- the t1 radius to use
-
getCanopyT1
public double getCanopyT1()Get the t1 radius to use when canopy clustering is being used as start points and/or to reduce the number of distance calcs- Returns:
- the t1 radius to use
-
maxIterationsTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setMaxIterations
set the maximum number of iterations to be executed.- Parameters:
n
- the maximum number of iterations- Throws:
Exception
- if maximum number of iteration is smaller than 1
-
getMaxIterations
public int getMaxIterations()gets the number of maximum iterations to be executed.- Returns:
- the number of clusters to generate
-
displayStdDevsTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setDisplayStdDevs
public void setDisplayStdDevs(boolean stdD) Sets whether standard deviations and nominal count. Should be displayed in the clustering output.- Parameters:
stdD
- true if std. devs and counts should be displayed
-
getDisplayStdDevs
public boolean getDisplayStdDevs()Gets whether standard deviations and nominal count. Should be displayed in the clustering output.- Returns:
- true if std. devs and counts should be displayed
-
dontReplaceMissingValuesTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setDontReplaceMissingValues
public void setDontReplaceMissingValues(boolean r) Sets whether missing values are to be replaced.- Parameters:
r
- true if missing values are to be replaced
-
getDontReplaceMissingValues
public boolean getDontReplaceMissingValues()Gets whether missing values are to be replaced.- Returns:
- true if missing values are to be replaced
-
distanceFunctionTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getDistanceFunction
returns the distance function currently in use.- Returns:
- the distance function
-
setDistanceFunction
sets the distance function to use for instance comparison.- Parameters:
df
- the new distance function to use- Throws:
Exception
- if instances cannot be processed
-
preserveInstancesOrderTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setPreserveInstancesOrder
public void setPreserveInstancesOrder(boolean r) Sets whether order of instances must be preserved.- Parameters:
r
- true if missing values are to be replaced
-
getPreserveInstancesOrder
public boolean getPreserveInstancesOrder()Gets whether order of instances must be preserved.- Returns:
- true if missing values are to be replaced
-
fastDistanceCalcTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setFastDistanceCalc
public void setFastDistanceCalc(boolean value) Sets whether to use faster distance calculation.- Parameters:
value
- true if faster calculation to be used
-
getFastDistanceCalc
public boolean getFastDistanceCalc()Gets whether to use faster distance calculation.- Returns:
- true if faster calculation is used
-
numExecutionSlotsTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setNumExecutionSlots
public void setNumExecutionSlots(int slots) Set the degree of parallelism to use.- Parameters:
slots
- the number of tasks to run in parallel when computing the nearest neighbors and evaluating different values of k between the lower and upper bounds
-
getNumExecutionSlots
public int getNumExecutionSlots()Get the degree of parallelism to use.- Returns:
- the number of tasks to run in parallel when computing the nearest neighbors and evaluating different values of k between the lower and upper bounds
-
setOptions
Parses a given list of options. Valid options are:-N <num> Number of clusters. (default 2).
-init Initialization method to use. 0 = random, 1 = k-means++, 2 = canopy, 3 = farthest first. (default = 0)
-C Use canopies to reduce the number of distance calculations.
-max-candidates <num> Maximum number of candidate canopies to retain in memory at any one time when using canopy clustering. T2 distance plus, data characteristics, will determine how many candidate canopies are formed before periodic and final pruning are performed, which might result in exceess memory consumption. This setting avoids large numbers of candidate canopies consuming memory. (default = 100)
-periodic-pruning <num> How often to prune low density canopies when using canopy clustering. (default = every 10,000 training instances)
-min-density Minimum canopy density, when using canopy clustering, below which a canopy will be pruned during periodic pruning. (default = 2 instances)
-t2 The T2 distance to use when using canopy clustering. Values < 0 indicate that a heuristic based on attribute std. deviation should be used to set this. (default = -1.0)
-t1 The T1 distance to use when using canopy clustering. A value < 0 is taken as a positive multiplier for T2. (default = -1.5)
-V Display std. deviations for centroids.
-M Don't replace missing values with mean/mode.
-A <classname and options> Distance function to use. (default: weka.core.EuclideanDistance)
-I <num> Maximum number of iterations.
-O Preserve order of instances.
-fast Enables faster distance calculations, using cut-off values. Disables the calculation/output of squared errors/distances.
-num-slots <num> Number of execution slots. (default 1 - i.e. no parallelism)
-S <num> Random number seed. (default 10)
-output-debug-info If set, clusterer is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, clusterer capabilities are not checked before clusterer is built (use with caution).
- Specified by:
setOptions
in interfaceOptionHandler
- Overrides:
setOptions
in classRandomizableClusterer
- Parameters:
options
- the list of options as an array of strings- Throws:
Exception
- if an option is not supported
-
getOptions
Gets the current settings of SimpleKMeans.- Specified by:
getOptions
in interfaceOptionHandler
- Overrides:
getOptions
in classRandomizableClusterer
- Returns:
- an array of strings suitable for passing to setOptions()
-
toString
return a string describing this clusterer. -
getClusterCentroids
Gets the the cluster centroids.- Returns:
- the cluster centroids
-
getClusterStandardDevs
Gets the standard deviations of the numeric attributes in each cluster.- Returns:
- the standard deviations of the numeric attributes in each cluster
-
getClusterNominalCounts
public double[][][] getClusterNominalCounts()Returns for each cluster the weighted frequency counts for the values of each nominal attribute.- Returns:
- the counts
-
getSquaredError
public double getSquaredError()Gets the squared error for all clusters.- Returns:
- the squared error, NaN if fast distance calculation is used
- See Also:
-
m_FastDistanceCalc
-
getClusterSizes
public double[] getClusterSizes()Gets the sum of weights for all the instances in each cluster.- Returns:
- The number of instances in each cluster
-
getAssignments
Gets the assignments for each instance.- Returns:
- Array of indexes of the centroid assigned to each instance
- Throws:
Exception
- if order of instances wasn't preserved or no assignments were made
-
getRevision
Returns the revision string.- Specified by:
getRevision
in interfaceRevisionHandler
- Overrides:
getRevision
in classAbstractClusterer
- Returns:
- the revision
-
main
Main method for executing this class.- Parameters:
args
- use -h to list all parameters
-