weka.clusterers.Canopy

All Implemented Interfaces:: Serializable, Cloneable, Clusterer, NumberOfClustersRequestable, UpdateableClusterer, CapabilitiesHandler, CapabilitiesIgnorer, CommandlineRunnable, OptionHandler, Randomizable, RevisionHandler, TechnicalInformationHandler

public class Canopy extends RandomizableClusterer implements UpdateableClusterer, NumberOfClustersRequestable, OptionHandler, TechnicalInformationHandler

Cluster data using the capopy clustering algorithm, which requires just one pass over the data. Can run in eitherbatch or incremental mode. Results are generally not as good when running incrementally as the min/max for each numeric attribute is not known in advance. Has a heuristic (based on attribute std. deviations), that can be used in batch mode, for setting the T2 distance. The T2 distance determines how many canopies (clusters) are formed. When the user specifies a specific number (N) of clusters to generate, the algorithm will return the top N canopies (as determined by T2 density) when N < number of canopies (this applies to both batch and incremental learning); when N > number of canopies, the difference is made up by selecting training instances randomly (this can only be done when batch training). For more information see:

A. McCallum, K. Nigam, L.H. Ungar: Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching. In: Proceedings of the sixth ACM SIGKDD internation conference on knowledge discovery and data mining ACM-SIAM symposium on Discrete algorithms, 169-178, 2000.

BibTeX:

 @inproceedings{McCallum2000,
    author = {A. McCallum and K. Nigam and L.H. Ungar},
    booktitle = {Proceedings of the sixth ACM SIGKDD internation conference on knowledge discovery and data mining ACM-SIAM symposium on Discrete algorithms},
    pages = {169-178},
    title = {Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching},
    year = {2000}
 }

Valid options are:

 -N <num>
  Number of clusters.
  (default 2).

 -max-candidates <num>
  Maximum number of candidate canopies to retain in memory
  at any one time. T2 distance plus, data characteristics,
  will determine how many candidate canopies are formed before
  periodic and final pruning are performed, which might result
  in exceess memory consumption. This setting avoids large numbers
  of candidate canopies consuming memory. (default = 100)

 -periodic-pruning <num>
  How often to prune low density canopies. 
  (default = every 10,000 training instances)

 -min-density
  Minimum canopy density, below which a canopy will be pruned
  during periodic pruning. (default = 2 instances)

 -t2
  The T2 distance to use. Values < 0 indicate that
  a heuristic based on attribute std. deviation should be used to set this.
  Note that this heuristic can only be used when batch training
  (default = -1.0)

 -t1
  The T1 distance to use. A value < 0 is taken as a
  positive multiplier for T2. (default = -1.5)

 -M
  Don't replace missing values with mean/mode when running in batch mode.

 -S <num>
  Random number seed.
  (default 1)

 -output-debug-info
  If set, clusterer is run in debug mode and
  may output additional info to the console

 -do-not-check-capabilities
  If set, clusterer capabilities are not checked before clusterer is built
  (use with caution).

Version:

$Revision: 11012 $

Author:

Mark Hall (mhall{[at]}pentaho{[dot]}com)

See Also:

Serialized Form

Field Summary

Fields

Modifier and Type

Field

Description

static final double

DEFAULT_T1

static final double

DEFAULT_T2
Constructor Summary

Constructors

Constructor

Description

Canopy()
Method Summary

Modifier and Type

Method

Description

static Canopy

aggregateCanopies(List<Canopy> canopies, double aggregationT1, double aggregationT2, NormalizableDistance finalDistanceFunction, Filter missingValuesReplacer, int finalNumCanopies)

Aggregate the canopies from a list of Canopy clusterers together into one final model.

long[]

assignCanopies(Instance inst)

Uses T1 distance to assign canopies to the supplied instance.

void

buildClusterer(Instances data)

Generates a clusterer.

void

cleanUp()

Save memory

double[]

distributionForInstance(Instance instance)

Predicts the cluster memberships for a given instance.

String

dontReplaceMissingValuesTipText()

Returns the tip text for this property.

double

getActualT1()

Get the actual value of T1 (which may be different from the initial value if the heuristic is used)

double

getActualT2()

Get the actual value of T2 (which may be different from the initial value if the heuristic is used)

Instances

getCanopies()

Get the canopies (cluster centers).

Capabilities

getCapabilities()

Returns default capabilities of the clusterer.

List<long[]>

getClusterCanopyAssignments()

Get the canopies that each canopy (cluster center) is within T1 distance of

boolean

getDontReplaceMissingValues()

Gets whether missing values are to be replaced.

int

getMaxNumCandidateCanopiesToHoldInMemory()

Get the maximum number of candidate canopies to retain in memory during training.

double

getMinimumCanopyDensity()

Get the minimum T2-based density below which a canopy will be pruned during periodic pruning.

int

getNumClusters()

Get the number of clusters to generate

String[]

getOptions()

Gets the current settings of Canopy.

int

getPeriodicPruningRate()

Get the how often to prune low density canopies during training

double

getT1()

Get the T1 distance.

double

getT2()

Get the T2 distance to use.

TechnicalInformation

getTechnicalInformation()

Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.

String

globalInfo()

Returns a string describing this clusterer.

void

initializeDistanceFunction(Instances init)

Initialize the distance function (i.e set min/max values for numeric attributes) with the supplied instances.

Enumeration<Option>

listOptions()

Returns an enumeration describing the available options.

static void

main(String[] args)

String

maxNumCandidateCanopiesToHoldInMemory()

Returns the tip text for this property.

String

minimumCanopyDensityTipText()

Returns the tip text for this property.

static boolean

nonEmptyCanopySetIntersection(long[] first, long[] second)

Tests if two sets of canopies have a non-empty intersection

int

numberOfClusters()

Returns the number of clusters.

String

numClustersTipText()

Returns the tip text for this property.

String

periodicPruningRateTipText()

Returns the tip text for this property.

static String

printCanopyAssignments(Instances dataPoints, List<long[]> canopyAssignments)

Print the supplied instances and their canopies

static String

printSingleAssignment(long[] assignments)

void

setCanopies(Instances canopies)

Set the canopies to use (replaces any learned by this clusterer already)

void

setClusterCanopyAssignments(List<long[]> clusterCanopies)

Set the canopies that each canopy (cluster center) is within T1 distance of

void

setDontReplaceMissingValues(boolean r)

Sets whether missing values are to be replaced.

void

setMaxNumCandidateCanopiesToHoldInMemory(int max)

Set the maximum number of candidate canopies to retain in memory during training.

void

setMinimumCanopyDensity(double dens)

Set the minimum T2-based density below which a canopy will be pruned during periodic pruning.

void

setMissingValuesReplacer(Filter missingReplacer)

Set a ready-to-use missing values replacement filter

void

setNumClusters(int numClusters)

Set the number of clusters to generate

void

setOptions(String[] options)

Parses a given list of options.

void

setPeriodicPruningRate(int p)

Set the how often to prune low density canopies during training

void

setT1(double t1)

Set the T1 distance.

void

setT2(double t2)

Set the T2 distance to use.

String

t1TipText()

Tip text for this property

String

t2TipText()

Tip text for this property

String

toString()

String

toString(boolean header)

Return a textual description of this clusterer

void

updateClusterer(Instance newInstance)

Adds an instance to the clusterer.

void

updateFinished()

Signals the end of the updating.

Methods inherited from class weka.clusterers.RandomizableClusterer
getSeed, seedTipText, setSeed

Methods inherited from class weka.clusterers.AbstractClusterer
clusterInstance, debugTipText, doNotCheckCapabilitiesTipText, forName, getDebug, getDoNotCheckCapabilities, getRevision, makeCopies, makeCopy, postExecution, preExecution, run, runClusterer, setDebug, setDoNotCheckCapabilities

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait

Field Details
- DEFAULT_T2
  
  public static final double DEFAULT_T2
  See Also:
  
  Constant Field Values
- DEFAULT_T1
  
  public static final double DEFAULT_T1
  See Also:
  
  Constant Field Values
Constructor Details
- Canopy
  
  public Canopy()
Method Details
- globalInfo
  
  public String globalInfo()
  
  Returns a string describing this clusterer.
  
  Returns:
  
  a description of the evaluator suitable for displaying in the explorer/experimenter gui
- getTechnicalInformation
  
  public TechnicalInformation getTechnicalInformation()
  
  Description copied from interface: TechnicalInformationHandler
  
  Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.
  
  Specified by:
  
  getTechnicalInformation in interface TechnicalInformationHandler
  
  Returns:
  
  the technical information about this class
- getCapabilities
  
  public Capabilities getCapabilities()
  
  Returns default capabilities of the clusterer.
  Specified by:
  
  getCapabilities in interface CapabilitiesHandler
  
  Specified by:
  
  getCapabilities in interface Clusterer
  
  Overrides:
  
  getCapabilities in class AbstractClusterer
  
  Returns:
  
  the capabilities of this clusterer
  
  See Also:
  
  Capabilities
- listOptions
  
  public Enumeration<Option> listOptions()
  
  Returns an enumeration describing the available options.
  
  Specified by:
  
  listOptions in interface OptionHandler
  
  Overrides:
  
  listOptions in class RandomizableClusterer
  
  Returns:
  
  an enumeration of all the available options.
- setOptions
  
  public void setOptions(String[] options) throws Exception
  Parses a given list of options.
  Valid options are:
  
  -N <num> Number of clusters. (default 2).
  
  -max-candidates <num> Maximum number of candidate canopies to retain in memory at any one time. T2 distance plus, data characteristics, will determine how many candidate canopies are formed before periodic and final pruning are performed, which might result in exceess memory consumption. This setting avoids large numbers of candidate canopies consuming memory. (default = 100)
  
  -periodic-pruning <num> How often to prune low density canopies. (default = every 10,000 training instances)
  
  -min-density Minimum canopy density, below which a canopy will be pruned during periodic pruning. (default = 2 instances)
  
  -t2 The T2 distance to use. Values < 0 indicate that a heuristic based on attribute std. deviation should be used to set this. Note that this heuristic can only be used when batch training (default = -1.0)
  
  -t1 The T1 distance to use. A value < 0 is taken as a positive multiplier for T2. (default = -1.5)
  
  -M Don't replace missing values with mean/mode when running in batch mode.
  
  -S <num> Random number seed. (default 1)
  
  -output-debug-info If set, clusterer is run in debug mode and may output additional info to the console
  
  -do-not-check-capabilities If set, clusterer capabilities are not checked before clusterer is built (use with caution).
  Specified by:
  
  setOptions in interface OptionHandler
  
  Overrides:
  
  setOptions in class RandomizableClusterer
  
  Parameters:
  
  options - the list of options as an array of strings throws Exception if an option is not supported
  
  Throws:
  
  Exception - if an option is not supported
- getOptions
  
  public String[] getOptions()
  
  Gets the current settings of Canopy.
  
  Specified by:
  
  getOptions in interface OptionHandler
  
  Overrides:
  
  getOptions in class RandomizableClusterer
  
  Returns:
  
  an array of strings suitable for passing to setOptions()
- nonEmptyCanopySetIntersection
  
  public static boolean nonEmptyCanopySetIntersection(long[] first, long[] second) throws Exception
  
  Tests if two sets of canopies have a non-empty intersection
  
  Parameters:
  
  first - the first canopy set
  
  second - the second canopy set
  
  Returns:
  
  true if the intersection is non-empty
  
  Throws:
  
  Exception - if a problem occurs
- assignCanopies
  
  public long[] assignCanopies(Instance inst) throws Exception
  
  Uses T1 distance to assign canopies to the supplied instance. If the instance does not fall within T1 distance of any canopies then the instance has the closest canopy assigned to it.
  
  Parameters:
  
  inst - the instance to find covering canopies for
  
  Returns:
  
  a set of canopies that contain this instance according to T1 distance
  
  Throws:
  
  Exception - if a problem occurs
- updateClusterer
  
  public void updateClusterer(Instance newInstance) throws Exception
  
  Description copied from interface: UpdateableClusterer
  
  Adds an instance to the clusterer.
  
  Specified by:
  
  updateClusterer in interface UpdateableClusterer
  
  Parameters:
  
  newInstance - the instance to be added
  
  Throws:
  
  Exception - if something goes wrong
- distributionForInstance
  
  public double[] distributionForInstance(Instance instance) throws Exception
  
  Description copied from class: AbstractClusterer
  
  Predicts the cluster memberships for a given instance. Either this or clusterInstance() needs to be implemented by subclasses.
  
  Specified by:
  
  distributionForInstance in interface Clusterer
  
  Overrides:
  
  distributionForInstance in class AbstractClusterer
  
  Parameters:
  
  instance - the instance to be assigned a cluster.
  
  Returns:
  
  an array containing the estimated membership probabilities of the test instance in each cluster (this should sum to at most 1)
  
  Throws:
  
  Exception - if distribution could not be computed successfully
- updateFinished
  
  public void updateFinished()
  
  Description copied from interface: UpdateableClusterer
  
  Signals the end of the updating.
  
  Specified by:
  
  updateFinished in interface UpdateableClusterer
- initializeDistanceFunction
  
  public void initializeDistanceFunction(Instances init) throws Exception
  
  Initialize the distance function (i.e set min/max values for numeric attributes) with the supplied instances.
  
  Parameters:
  
  init - the instances to initialize with
  
  Throws:
  
  Exception - if a problem occurs
- buildClusterer
  
  public void buildClusterer(Instances data) throws Exception
  
  Description copied from class: AbstractClusterer
  
  Generates a clusterer. Has to initialize all fields of the clusterer that are not being set via options.
  
  Specified by:
  
  buildClusterer in interface Clusterer
  
  Specified by:
  
  buildClusterer in class AbstractClusterer
  
  Parameters:
  
  data - set of instances serving as training data
  
  Throws:
  
  Exception - if the clusterer has not been generated successfully
- numberOfClusters
  
  public int numberOfClusters() throws Exception
  
  Description copied from class: AbstractClusterer
  
  Returns the number of clusters.
  
  Specified by:
  
  numberOfClusters in interface Clusterer
  
  Specified by:
  
  numberOfClusters in class AbstractClusterer
  
  Returns:
  
  the number of clusters generated for a training dataset.
  
  Throws:
  
  Exception - if number of clusters could not be returned successfully
- setMissingValuesReplacer
  
  public void setMissingValuesReplacer(Filter missingReplacer)
  
  Set a ready-to-use missing values replacement filter
  
  Parameters:
  
  missingReplacer - the missing values replacement filter to use
- getCanopies
  
  public Instances getCanopies()
  
  Get the canopies (cluster centers).
  
  Returns:
  
  the canopies
- setCanopies
  
  public void setCanopies(Instances canopies)
  
  Set the canopies to use (replaces any learned by this clusterer already)
  
  Parameters:
  
  canopies - the canopies to use
- getClusterCanopyAssignments
  
  public List<long[]> getClusterCanopyAssignments()
  
  Get the canopies that each canopy (cluster center) is within T1 distance of
  
  Returns:
  
  a list of canopies for each cluster center
- setClusterCanopyAssignments
  
  public void setClusterCanopyAssignments(List<long[]> clusterCanopies)
  
  Set the canopies that each canopy (cluster center) is within T1 distance of
  
  Parameters:
  
  clusterCanopies - the list canopies for each cluster center
- getActualT2
  
  public double getActualT2()
  
  Get the actual value of T2 (which may be different from the initial value if the heuristic is used)
  
  Returns:
  
  the actual value of T2
- getActualT1
  
  public double getActualT1()
  
  Get the actual value of T1 (which may be different from the initial value if the heuristic is used)
  
  Returns:
  
  the actual value of T1
- t1TipText
  
  public String t1TipText()
  
  Tip text for this property
  
  Returns:
  
  the tip text for this property
- setT1
  
  public void setT1(double t1)
  
  Set the T1 distance. Values < 0 are taken as a positive multiplier for the T2 distance - e.g. T1_actual = Math.abs(t1) * t2;
  
  Parameters:
  
  t1 - the T1 distance to use
- getT1
  
  public double getT1()
  
  Get the T1 distance. Values < 0 are taken as a positive multiplier for the T2 distance - e.g. T1_actual = Math.abs(t1) * t2;
  
  Returns:
  
  the T1 distance to use
- t2TipText
  
  public String t2TipText()
  
  Tip text for this property
  
  Returns:
  
  the tip text for this property
- setT2
  
  public void setT2(double t2)
  
  Set the T2 distance to use. Values < 0 indicate that a heuristic based on attribute standard deviation should be used to set this (note that the heuristic is only applicable when batch training).
  
  Parameters:
  
  t2 - the T2 distance to use
- getT2
  
  public double getT2()
  
  Get the T2 distance to use. Values < 0 indicate that a heuristic based on attribute standard deviation should be used to set this (note that the heuristic is only applicable when batch training).
  
  Returns:
  
  the T2 distance to use
- numClustersTipText
  
  public String numClustersTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- setNumClusters
  
  public void setNumClusters(int numClusters) throws Exception
  
  Description copied from interface: NumberOfClustersRequestable
  
  Set the number of clusters to generate
  
  Specified by:
  
  setNumClusters in interface NumberOfClustersRequestable
  
  Parameters:
  
  numClusters - the number of clusters to generate
  
  Throws:
  
  Exception - if the requested number of clusters in inapropriate
- getNumClusters
  
  public int getNumClusters()
  
  Get the number of clusters to generate
  
  Returns:
  
  the number of clusters to generate
- periodicPruningRateTipText
  
  public String periodicPruningRateTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- setPeriodicPruningRate
  
  public void setPeriodicPruningRate(int p)
  
  Set the how often to prune low density canopies during training
  
  Parameters:
  
  p - how often (every p instances) to prune low density canopies
- getPeriodicPruningRate
  
  public int getPeriodicPruningRate()
  
  Get the how often to prune low density canopies during training
  
  Returns:
  
  how often (every p instances) to prune low density canopies
- minimumCanopyDensityTipText
  
  public String minimumCanopyDensityTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- setMinimumCanopyDensity
  
  public void setMinimumCanopyDensity(double dens)
  
  Set the minimum T2-based density below which a canopy will be pruned during periodic pruning.
  
  Parameters:
  
  dens - the minimum canopy density
- getMinimumCanopyDensity
  
  public double getMinimumCanopyDensity()
  
  Get the minimum T2-based density below which a canopy will be pruned during periodic pruning.
  
  Returns:
  
  the minimum canopy density
- maxNumCandidateCanopiesToHoldInMemory
  
  public String maxNumCandidateCanopiesToHoldInMemory()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- setMaxNumCandidateCanopiesToHoldInMemory
  
  public void setMaxNumCandidateCanopiesToHoldInMemory(int max)
  
  Set the maximum number of candidate canopies to retain in memory during training. T2 distance and data characteristics determine how many candidate canopies are formed before periodic and final pruning are performed. There may not be enough memory available if T2 is set too low.
  
  Parameters:
  
  max - the maximum number of candidate canopies to retain in memory during training
- getMaxNumCandidateCanopiesToHoldInMemory
  
  public int getMaxNumCandidateCanopiesToHoldInMemory()
  
  Get the maximum number of candidate canopies to retain in memory during training. T2 distance and data characteristics determine how many candidate canopies are formed before periodic and final pruning are performed. There may not be enough memory available if T2 is set too low.
  
  Returns:
  
  the maximum number of candidate canopies to retain in memory during training
- dontReplaceMissingValuesTipText
  
  public String dontReplaceMissingValuesTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- setDontReplaceMissingValues
  
  public void setDontReplaceMissingValues(boolean r)
  
  Sets whether missing values are to be replaced.
  
  Parameters:
  
  r - true if missing values are to be replaced
- getDontReplaceMissingValues
  
  public boolean getDontReplaceMissingValues()
  
  Gets whether missing values are to be replaced.
  
  Returns:
  
  true if missing values are to be replaced
- printSingleAssignment
  
  public static String printSingleAssignment(long[] assignments)
- printCanopyAssignments
  
  public static String printCanopyAssignments(Instances dataPoints, List<long[]> canopyAssignments)
  
  Print the supplied instances and their canopies
  
  Parameters:
  
  dataPoints - the instances to print
  
  canopyAssignments - the canopy assignments, one assignment array for each instance
  
  Returns:
  
  a string containing the printed assignments
- toString
  
  public String toString(boolean header)
  
  Return a textual description of this clusterer
  
  Parameters:
  
  header - true if the header should be printed
  
  Returns:
  
  a string describing the result of the clustering
- toString
  
  public String toString()
  
  Overrides:
  
  toString in class Object
- cleanUp
  
  public void cleanUp()
  
  Save memory
- aggregateCanopies
  
  public static Canopy aggregateCanopies(List<Canopy> canopies, double aggregationT1, double aggregationT2, NormalizableDistance finalDistanceFunction, Filter missingValuesReplacer, int finalNumCanopies)
  
  Aggregate the canopies from a list of Canopy clusterers together into one final model.
  
  Parameters:
  
  canopies - the list of Canopy clusterers to aggregate
  
  aggregationT1 - the T1 distance to use for the aggregated classifier
  
  aggregationT2 - the T2 distance to use when aggregating canopies
  
  finalDistanceFunction - the distance function to use with the final Canopy clusterer
  
  missingValuesReplacer - the missing value replacement filter to use with the final clusterer (can be null for no missing value replacement)
  
  finalNumCanopies - the final number of canopies
  
  Returns:
  
  a Canopy clusterer that aggregates all the canopies
- main
  
  public static void main(String[] args)

Class Canopy

Field Summary

Constructor Summary

Method Summary

Methods inherited from class weka.clusterers.RandomizableClusterer

Methods inherited from class weka.clusterers.AbstractClusterer

Methods inherited from class java.lang.Object

Field Details

DEFAULT_T2

DEFAULT_T1

Constructor Details

Canopy

Method Details

globalInfo

getTechnicalInformation

getCapabilities

listOptions

setOptions

getOptions

nonEmptyCanopySetIntersection

assignCanopies

updateClusterer

distributionForInstance

updateFinished

initializeDistanceFunction

buildClusterer

numberOfClusters

setMissingValuesReplacer

getCanopies

setCanopies

getClusterCanopyAssignments

setClusterCanopyAssignments

getActualT2

getActualT1

t1TipText

setT1

getT1

t2TipText

setT2

getT2

numClustersTipText

setNumClusters

getNumClusters

periodicPruningRateTipText

setPeriodicPruningRate

getPeriodicPruningRate

minimumCanopyDensityTipText

setMinimumCanopyDensity

getMinimumCanopyDensity

maxNumCandidateCanopiesToHoldInMemory

setMaxNumCandidateCanopiesToHoldInMemory

getMaxNumCandidateCanopiesToHoldInMemory

dontReplaceMissingValuesTipText

setDontReplaceMissingValues

getDontReplaceMissingValues

printSingleAssignment

printCanopyAssignments

toString

toString

cleanUp

aggregateCanopies

main