Class SimpleKMeans

All Implemented Interfaces:
Serializable, Cloneable, Clusterer, NumberOfClustersRequestable, CapabilitiesHandler, CapabilitiesIgnorer, CommandlineRunnable, OptionHandler, Randomizable, RevisionHandler, TechnicalInformationHandler, WeightedInstancesHandler

Cluster data using the k means algorithm. Can use either the Euclidean distance (default) or the Manhattan distance. If the Manhattan distance is used, then centroids are computed as the component-wise median rather than mean. For more information see:

D. Arthur, S. Vassilvitskii: k-means++: the advantages of carefull seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007.

BibTeX:

 @inproceedings{Arthur2007,
    author = {D. Arthur and S. Vassilvitskii},
    booktitle = {Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms},
    pages = {1027-1035},
    title = {k-means++: the advantages of carefull seeding},
    year = {2007}
 }
 

Valid options are:

 -N <num>
  Number of clusters.
  (default 2).
 
 -init
  Initialization method to use.
  0 = random, 1 = k-means++, 2 = canopy, 3 = farthest first.
  (default = 0)
 
 -C
  Use canopies to reduce the number of distance calculations.
 
 -max-candidates <num>
  Maximum number of candidate canopies to retain in memory
  at any one time when using canopy clustering.
  T2 distance plus, data characteristics,
  will determine how many candidate canopies are formed before
  periodic and final pruning are performed, which might result
  in exceess memory consumption. This setting avoids large numbers
  of candidate canopies consuming memory. (default = 100)
 
 -periodic-pruning <num>
  How often to prune low density canopies when using canopy clustering. 
  (default = every 10,000 training instances)
 
 -min-density
  Minimum canopy density, when using canopy clustering, below which
   a canopy will be pruned during periodic pruning. (default = 2 instances)
 
 -t2
  The T2 distance to use when using canopy clustering. Values < 0 indicate that
  a heuristic based on attribute std. deviation should be used to set this.
  (default = -1.0)
 
 -t1
  The T1 distance to use when using canopy clustering. A value < 0 is taken as a
  positive multiplier for T2. (default = -1.5)
 
 -V
  Display std. deviations for centroids.
 
 -M
  Don't replace missing values with mean/mode.
 
 -A <classname and options>
  Distance function to use.
  (default: weka.core.EuclideanDistance)
 
 -I <num>
  Maximum number of iterations.
 
 -O
  Preserve order of instances.
 
 -fast
  Enables faster distance calculations, using cut-off values.
  Disables the calculation/output of squared errors/distances.
 
 -num-slots <num>
  Number of execution slots.
  (default 1 - i.e. no parallelism)
 
 -S <num>
  Random number seed.
  (default 10)
 
 -output-debug-info
  If set, clusterer is run in debug mode and
  may output additional info to the console
 
 -do-not-check-capabilities
  If set, clusterer capabilities are not checked before clusterer is built
  (use with caution).
 
Version:
$Revision: 15519 $
Author:
Mark Hall (mhall@cs.waikato.ac.nz), Eibe Frank (eibe@cs.waikato.ac.nz)
See Also:
  • Field Details

  • Constructor Details

    • SimpleKMeans

      public SimpleKMeans()
      the default constructor.
  • Method Details

    • getTechnicalInformation

      public TechnicalInformation getTechnicalInformation()
      Description copied from interface: TechnicalInformationHandler
      Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.
      Specified by:
      getTechnicalInformation in interface TechnicalInformationHandler
      Returns:
      the technical information about this class
    • globalInfo

      public String globalInfo()
      Returns a string describing this clusterer.
      Returns:
      a description of the evaluator suitable for displaying in the explorer/experimenter gui
    • getCapabilities

      public Capabilities getCapabilities()
      Returns default capabilities of the clusterer.
      Specified by:
      getCapabilities in interface CapabilitiesHandler
      Specified by:
      getCapabilities in interface Clusterer
      Overrides:
      getCapabilities in class AbstractClusterer
      Returns:
      the capabilities of this clusterer
      See Also:
    • buildClusterer

      public void buildClusterer(Instances data) throws Exception
      Generates a clusterer. Has to initialize all fields of the clusterer that are not being set via options.
      Specified by:
      buildClusterer in interface Clusterer
      Specified by:
      buildClusterer in class AbstractClusterer
      Parameters:
      data - set of instances serving as training data
      Throws:
      Exception - if the clusterer has not been generated successfully
    • clusterInstance

      public int clusterInstance(Instance instance) throws Exception
      Classifies a given instance.
      Specified by:
      clusterInstance in interface Clusterer
      Overrides:
      clusterInstance in class AbstractClusterer
      Parameters:
      instance - the instance to be assigned to a cluster
      Returns:
      the number of the assigned cluster as an interger if the class is enumerated, otherwise the predicted value
      Throws:
      Exception - if instance could not be classified successfully
    • numberOfClusters

      public int numberOfClusters() throws Exception
      Returns the number of clusters.
      Specified by:
      numberOfClusters in interface Clusterer
      Specified by:
      numberOfClusters in class AbstractClusterer
      Returns:
      the number of clusters generated for a training dataset.
      Throws:
      Exception - if number of clusters could not be returned successfully
    • listOptions

      public Enumeration<Option> listOptions()
      Returns an enumeration describing the available options.
      Specified by:
      listOptions in interface OptionHandler
      Overrides:
      listOptions in class RandomizableClusterer
      Returns:
      an enumeration of all the available options.
    • numClustersTipText

      public String numClustersTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setNumClusters

      public void setNumClusters(int n) throws Exception
      set the number of clusters to generate.
      Specified by:
      setNumClusters in interface NumberOfClustersRequestable
      Parameters:
      n - the number of clusters to generate
      Throws:
      Exception - if number of clusters is negative
    • getNumClusters

      public int getNumClusters()
      gets the number of clusters to generate.
      Returns:
      the number of clusters to generate
    • initializationMethodTipText

      public String initializationMethodTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setInitializationMethod

      public void setInitializationMethod(SelectedTag method)
      Set the initialization method to use
      Parameters:
      method - the initialization method to use
    • getInitializationMethod

      public SelectedTag getInitializationMethod()
      Get the initialization method to use
      Returns:
      method the initialization method to use
    • reduceNumberOfDistanceCalcsViaCanopiesTipText

      public String reduceNumberOfDistanceCalcsViaCanopiesTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setReduceNumberOfDistanceCalcsViaCanopies

      public void setReduceNumberOfDistanceCalcsViaCanopies(boolean c)
      Set whether to use canopies to reduce the number of distance computations required
      Parameters:
      c - true if canopies are to be used to reduce the number of distance computations
    • getReduceNumberOfDistanceCalcsViaCanopies

      public boolean getReduceNumberOfDistanceCalcsViaCanopies()
      Get whether to use canopies to reduce the number of distance computations required
      Returns:
      true if canopies are to be used to reduce the number of distance computations
    • canopyPeriodicPruningRateTipText

      public String canopyPeriodicPruningRateTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setCanopyPeriodicPruningRate

      public void setCanopyPeriodicPruningRate(int p)
      Set the how often to prune low density canopies during training (if using canopy clustering)
      Parameters:
      p - how often (every p instances) to prune low density canopies
    • getCanopyPeriodicPruningRate

      public int getCanopyPeriodicPruningRate()
      Get the how often to prune low density canopies during training (if using canopy clustering)
      Returns:
      how often (every p instances) to prune low density canopies
    • canopyMinimumCanopyDensityTipText

      public String canopyMinimumCanopyDensityTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setCanopyMinimumCanopyDensity

      public void setCanopyMinimumCanopyDensity(double dens)
      Set the minimum T2-based density below which a canopy will be pruned during periodic pruning.
      Parameters:
      dens - the minimum canopy density
    • getCanopyMinimumCanopyDensity

      public double getCanopyMinimumCanopyDensity()
      Get the minimum T2-based density below which a canopy will be pruned during periodic pruning.
      Returns:
      the minimum canopy density
    • canopyMaxNumCanopiesToHoldInMemoryTipText

      public String canopyMaxNumCanopiesToHoldInMemoryTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setCanopyMaxNumCanopiesToHoldInMemory

      public void setCanopyMaxNumCanopiesToHoldInMemory(int max)
      Set the maximum number of candidate canopies to retain in memory during training. T2 distance and data characteristics determine how many candidate canopies are formed before periodic and final pruning are performed. There may not be enough memory available if T2 is set too low.
      Parameters:
      max - the maximum number of candidate canopies to retain in memory during training
    • getCanopyMaxNumCanopiesToHoldInMemory

      public int getCanopyMaxNumCanopiesToHoldInMemory()
      Get the maximum number of candidate canopies to retain in memory during training. T2 distance and data characteristics determine how many candidate canopies are formed before periodic and final pruning are performed. There may not be enough memory available if T2 is set too low.
      Returns:
      the maximum number of candidate canopies to retain in memory during training
    • canopyT2TipText

      public String canopyT2TipText()
      Tip text for this property
      Returns:
      the tip text for this property
    • setCanopyT2

      public void setCanopyT2(double t2)
      Set the t2 radius to use when canopy clustering is being used as start points and/or to reduce the number of distance calcs
      Parameters:
      t2 - the t2 radius to use
    • getCanopyT2

      public double getCanopyT2()
      Get the t2 radius to use when canopy clustering is being used as start points and/or to reduce the number of distance calcs
      Returns:
      the t2 radius to use
    • canopyT1TipText

      public String canopyT1TipText()
      Tip text for this property
      Returns:
      the tip text for this property
    • setCanopyT1

      public void setCanopyT1(double t1)
      Set the t1 radius to use when canopy clustering is being used as start points and/or to reduce the number of distance calcs
      Parameters:
      t1 - the t1 radius to use
    • getCanopyT1

      public double getCanopyT1()
      Get the t1 radius to use when canopy clustering is being used as start points and/or to reduce the number of distance calcs
      Returns:
      the t1 radius to use
    • maxIterationsTipText

      public String maxIterationsTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setMaxIterations

      public void setMaxIterations(int n) throws Exception
      set the maximum number of iterations to be executed.
      Parameters:
      n - the maximum number of iterations
      Throws:
      Exception - if maximum number of iteration is smaller than 1
    • getMaxIterations

      public int getMaxIterations()
      gets the number of maximum iterations to be executed.
      Returns:
      the number of clusters to generate
    • displayStdDevsTipText

      public String displayStdDevsTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setDisplayStdDevs

      public void setDisplayStdDevs(boolean stdD)
      Sets whether standard deviations and nominal count. Should be displayed in the clustering output.
      Parameters:
      stdD - true if std. devs and counts should be displayed
    • getDisplayStdDevs

      public boolean getDisplayStdDevs()
      Gets whether standard deviations and nominal count. Should be displayed in the clustering output.
      Returns:
      true if std. devs and counts should be displayed
    • dontReplaceMissingValuesTipText

      public String dontReplaceMissingValuesTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setDontReplaceMissingValues

      public void setDontReplaceMissingValues(boolean r)
      Sets whether missing values are to be replaced.
      Parameters:
      r - true if missing values are to be replaced
    • getDontReplaceMissingValues

      public boolean getDontReplaceMissingValues()
      Gets whether missing values are to be replaced.
      Returns:
      true if missing values are to be replaced
    • distanceFunctionTipText

      public String distanceFunctionTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getDistanceFunction

      public DistanceFunction getDistanceFunction()
      returns the distance function currently in use.
      Returns:
      the distance function
    • setDistanceFunction

      public void setDistanceFunction(DistanceFunction df) throws Exception
      sets the distance function to use for instance comparison.
      Parameters:
      df - the new distance function to use
      Throws:
      Exception - if instances cannot be processed
    • preserveInstancesOrderTipText

      public String preserveInstancesOrderTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setPreserveInstancesOrder

      public void setPreserveInstancesOrder(boolean r)
      Sets whether order of instances must be preserved.
      Parameters:
      r - true if missing values are to be replaced
    • getPreserveInstancesOrder

      public boolean getPreserveInstancesOrder()
      Gets whether order of instances must be preserved.
      Returns:
      true if missing values are to be replaced
    • fastDistanceCalcTipText

      public String fastDistanceCalcTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setFastDistanceCalc

      public void setFastDistanceCalc(boolean value)
      Sets whether to use faster distance calculation.
      Parameters:
      value - true if faster calculation to be used
    • getFastDistanceCalc

      public boolean getFastDistanceCalc()
      Gets whether to use faster distance calculation.
      Returns:
      true if faster calculation is used
    • numExecutionSlotsTipText

      public String numExecutionSlotsTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setNumExecutionSlots

      public void setNumExecutionSlots(int slots)
      Set the degree of parallelism to use.
      Parameters:
      slots - the number of tasks to run in parallel when computing the nearest neighbors and evaluating different values of k between the lower and upper bounds
    • getNumExecutionSlots

      public int getNumExecutionSlots()
      Get the degree of parallelism to use.
      Returns:
      the number of tasks to run in parallel when computing the nearest neighbors and evaluating different values of k between the lower and upper bounds
    • setOptions

      public void setOptions(String[] options) throws Exception
      Parses a given list of options.

      Valid options are:

       -N <num>
        Number of clusters.
        (default 2).
       
       -init
        Initialization method to use.
        0 = random, 1 = k-means++, 2 = canopy, 3 = farthest first.
        (default = 0)
       
       -C
        Use canopies to reduce the number of distance calculations.
       
       -max-candidates <num>
        Maximum number of candidate canopies to retain in memory
        at any one time when using canopy clustering.
        T2 distance plus, data characteristics,
        will determine how many candidate canopies are formed before
        periodic and final pruning are performed, which might result
        in exceess memory consumption. This setting avoids large numbers
        of candidate canopies consuming memory. (default = 100)
       
       -periodic-pruning <num>
        How often to prune low density canopies when using canopy clustering. 
        (default = every 10,000 training instances)
       
       -min-density
        Minimum canopy density, when using canopy clustering, below which
         a canopy will be pruned during periodic pruning. (default = 2 instances)
       
       -t2
        The T2 distance to use when using canopy clustering. Values < 0 indicate that
        a heuristic based on attribute std. deviation should be used to set this.
        (default = -1.0)
       
       -t1
        The T1 distance to use when using canopy clustering. A value < 0 is taken as a
        positive multiplier for T2. (default = -1.5)
       
       -V
        Display std. deviations for centroids.
       
       -M
        Don't replace missing values with mean/mode.
       
       -A <classname and options>
        Distance function to use.
        (default: weka.core.EuclideanDistance)
       
       -I <num>
        Maximum number of iterations.
       
       -O
        Preserve order of instances.
       
       -fast
        Enables faster distance calculations, using cut-off values.
        Disables the calculation/output of squared errors/distances.
       
       -num-slots <num>
        Number of execution slots.
        (default 1 - i.e. no parallelism)
       
       -S <num>
        Random number seed.
        (default 10)
       
       -output-debug-info
        If set, clusterer is run in debug mode and
        may output additional info to the console
       
       -do-not-check-capabilities
        If set, clusterer capabilities are not checked before clusterer is built
        (use with caution).
       
      Specified by:
      setOptions in interface OptionHandler
      Overrides:
      setOptions in class RandomizableClusterer
      Parameters:
      options - the list of options as an array of strings
      Throws:
      Exception - if an option is not supported
    • getOptions

      public String[] getOptions()
      Gets the current settings of SimpleKMeans.
      Specified by:
      getOptions in interface OptionHandler
      Overrides:
      getOptions in class RandomizableClusterer
      Returns:
      an array of strings suitable for passing to setOptions()
    • toString

      public String toString()
      return a string describing this clusterer.
      Overrides:
      toString in class Object
      Returns:
      a description of the clusterer as a string
    • getClusterCentroids

      public Instances getClusterCentroids()
      Gets the the cluster centroids.
      Returns:
      the cluster centroids
    • getClusterStandardDevs

      public Instances getClusterStandardDevs()
      Gets the standard deviations of the numeric attributes in each cluster.
      Returns:
      the standard deviations of the numeric attributes in each cluster
    • getClusterNominalCounts

      public double[][][] getClusterNominalCounts()
      Returns for each cluster the weighted frequency counts for the values of each nominal attribute.
      Returns:
      the counts
    • getSquaredError

      public double getSquaredError()
      Gets the squared error for all clusters.
      Returns:
      the squared error, NaN if fast distance calculation is used
      See Also:
      • m_FastDistanceCalc
    • getClusterSizes

      public double[] getClusterSizes()
      Gets the sum of weights for all the instances in each cluster.
      Returns:
      The number of instances in each cluster
    • getAssignments

      public int[] getAssignments() throws Exception
      Gets the assignments for each instance.
      Returns:
      Array of indexes of the centroid assigned to each instance
      Throws:
      Exception - if order of instances wasn't preserved or no assignments were made
    • getRevision

      public String getRevision()
      Returns the revision string.
      Specified by:
      getRevision in interface RevisionHandler
      Overrides:
      getRevision in class AbstractClusterer
      Returns:
      the revision
    • main

      public static void main(String[] args)
      Main method for executing this class.
      Parameters:
      args - use -h to list all parameters