Package weka.clusterers
Class Canopy
java.lang.Object
weka.clusterers.AbstractClusterer
weka.clusterers.RandomizableClusterer
weka.clusterers.Canopy
- All Implemented Interfaces:
Serializable
,Cloneable
,Clusterer
,NumberOfClustersRequestable
,UpdateableClusterer
,CapabilitiesHandler
,CapabilitiesIgnorer
,CommandlineRunnable
,OptionHandler
,Randomizable
,RevisionHandler
,TechnicalInformationHandler
public class Canopy
extends RandomizableClusterer
implements UpdateableClusterer, NumberOfClustersRequestable, OptionHandler, TechnicalInformationHandler
Cluster data using the capopy clustering algorithm, which requires just one pass over the data. Can run in eitherbatch or incremental mode. Results are generally not as good when running incrementally as the min/max for each numeric attribute is not known in advance. Has a heuristic (based on attribute std. deviations), that can be used in batch mode, for setting the T2 distance. The T2 distance determines how many canopies (clusters) are formed. When the user specifies a specific number (N) of clusters to generate, the algorithm will return the top N canopies (as determined by T2 density) when N < number of canopies (this applies to both batch and incremental learning); when N > number of canopies, the difference is made up by selecting training instances randomly (this can only be done when batch training). For more information see:
A. McCallum, K. Nigam, L.H. Ungar: Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching. In: Proceedings of the sixth ACM SIGKDD internation conference on knowledge discovery and data mining ACM-SIAM symposium on Discrete algorithms, 169-178, 2000. BibTeX:
A. McCallum, K. Nigam, L.H. Ungar: Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching. In: Proceedings of the sixth ACM SIGKDD internation conference on knowledge discovery and data mining ACM-SIAM symposium on Discrete algorithms, 169-178, 2000. BibTeX:
@inproceedings{McCallum2000, author = {A. McCallum and K. Nigam and L.H. Ungar}, booktitle = {Proceedings of the sixth ACM SIGKDD internation conference on knowledge discovery and data mining ACM-SIAM symposium on Discrete algorithms}, pages = {169-178}, title = {Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching}, year = {2000} }Valid options are:
-N <num> Number of clusters. (default 2).
-max-candidates <num> Maximum number of candidate canopies to retain in memory at any one time. T2 distance plus, data characteristics, will determine how many candidate canopies are formed before periodic and final pruning are performed, which might result in exceess memory consumption. This setting avoids large numbers of candidate canopies consuming memory. (default = 100)
-periodic-pruning <num> How often to prune low density canopies. (default = every 10,000 training instances)
-min-density Minimum canopy density, below which a canopy will be pruned during periodic pruning. (default = 2 instances)
-t2 The T2 distance to use. Values < 0 indicate that a heuristic based on attribute std. deviation should be used to set this. Note that this heuristic can only be used when batch training (default = -1.0)
-t1 The T1 distance to use. A value < 0 is taken as a positive multiplier for T2. (default = -1.5)
-M Don't replace missing values with mean/mode when running in batch mode.
-S <num> Random number seed. (default 1)
-output-debug-info If set, clusterer is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, clusterer capabilities are not checked before clusterer is built (use with caution).
- Version:
- $Revision: 11012 $
- Author:
- Mark Hall (mhall{[at]}pentaho{[dot]}com)
- See Also:
-
Field Summary
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionstatic Canopy
aggregateCanopies
(List<Canopy> canopies, double aggregationT1, double aggregationT2, NormalizableDistance finalDistanceFunction, Filter missingValuesReplacer, int finalNumCanopies) Aggregate the canopies from a list of Canopy clusterers together into one final model.long[]
assignCanopies
(Instance inst) Uses T1 distance to assign canopies to the supplied instance.void
buildClusterer
(Instances data) Generates a clusterer.void
cleanUp()
Save memorydouble[]
distributionForInstance
(Instance instance) Predicts the cluster memberships for a given instance.Returns the tip text for this property.double
Get the actual value of T1 (which may be different from the initial value if the heuristic is used)double
Get the actual value of T2 (which may be different from the initial value if the heuristic is used)Get the canopies (cluster centers).Returns default capabilities of the clusterer.List<long[]>
Get the canopies that each canopy (cluster center) is within T1 distance ofboolean
Gets whether missing values are to be replaced.int
Get the maximum number of candidate canopies to retain in memory during training.double
Get the minimum T2-based density below which a canopy will be pruned during periodic pruning.int
Get the number of clusters to generateString[]
Gets the current settings of Canopy.int
Get the how often to prune low density canopies during trainingdouble
getT1()
Get the T1 distance.double
getT2()
Get the T2 distance to use.Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.Returns a string describing this clusterer.void
Initialize the distance function (i.e set min/max values for numeric attributes) with the supplied instances.Returns an enumeration describing the available options.static void
Returns the tip text for this property.Returns the tip text for this property.static boolean
nonEmptyCanopySetIntersection
(long[] first, long[] second) Tests if two sets of canopies have a non-empty intersectionint
Returns the number of clusters.Returns the tip text for this property.Returns the tip text for this property.static String
printCanopyAssignments
(Instances dataPoints, List<long[]> canopyAssignments) Print the supplied instances and their canopiesstatic String
printSingleAssignment
(long[] assignments) void
setCanopies
(Instances canopies) Set the canopies to use (replaces any learned by this clusterer already)void
setClusterCanopyAssignments
(List<long[]> clusterCanopies) Set the canopies that each canopy (cluster center) is within T1 distance ofvoid
setDontReplaceMissingValues
(boolean r) Sets whether missing values are to be replaced.void
Set the maximum number of candidate canopies to retain in memory during training.void
setMinimumCanopyDensity
(double dens) Set the minimum T2-based density below which a canopy will be pruned during periodic pruning.void
setMissingValuesReplacer
(Filter missingReplacer) Set a ready-to-use missing values replacement filtervoid
setNumClusters
(int numClusters) Set the number of clusters to generatevoid
setOptions
(String[] options) Parses a given list of options.void
setPeriodicPruningRate
(int p) Set the how often to prune low density canopies during trainingvoid
setT1
(double t1) Set the T1 distance.void
setT2
(double t2) Set the T2 distance to use.Tip text for this propertyTip text for this propertytoString()
toString
(boolean header) Return a textual description of this clusterervoid
updateClusterer
(Instance newInstance) Adds an instance to the clusterer.void
Signals the end of the updating.Methods inherited from class weka.clusterers.RandomizableClusterer
getSeed, seedTipText, setSeed
Methods inherited from class weka.clusterers.AbstractClusterer
clusterInstance, debugTipText, doNotCheckCapabilitiesTipText, forName, getDebug, getDoNotCheckCapabilities, getRevision, makeCopies, makeCopy, postExecution, preExecution, run, runClusterer, setDebug, setDoNotCheckCapabilities
-
Field Details
-
DEFAULT_T2
public static final double DEFAULT_T2- See Also:
-
DEFAULT_T1
public static final double DEFAULT_T1- See Also:
-
-
Constructor Details
-
Canopy
public Canopy()
-
-
Method Details
-
globalInfo
Returns a string describing this clusterer.- Returns:
- a description of the evaluator suitable for displaying in the explorer/experimenter gui
-
getTechnicalInformation
Description copied from interface:TechnicalInformationHandler
Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.- Specified by:
getTechnicalInformation
in interfaceTechnicalInformationHandler
- Returns:
- the technical information about this class
-
getCapabilities
Returns default capabilities of the clusterer.- Specified by:
getCapabilities
in interfaceCapabilitiesHandler
- Specified by:
getCapabilities
in interfaceClusterer
- Overrides:
getCapabilities
in classAbstractClusterer
- Returns:
- the capabilities of this clusterer
- See Also:
-
listOptions
Returns an enumeration describing the available options.- Specified by:
listOptions
in interfaceOptionHandler
- Overrides:
listOptions
in classRandomizableClusterer
- Returns:
- an enumeration of all the available options.
-
setOptions
Parses a given list of options. Valid options are:-N <num> Number of clusters. (default 2).
-max-candidates <num> Maximum number of candidate canopies to retain in memory at any one time. T2 distance plus, data characteristics, will determine how many candidate canopies are formed before periodic and final pruning are performed, which might result in exceess memory consumption. This setting avoids large numbers of candidate canopies consuming memory. (default = 100)
-periodic-pruning <num> How often to prune low density canopies. (default = every 10,000 training instances)
-min-density Minimum canopy density, below which a canopy will be pruned during periodic pruning. (default = 2 instances)
-t2 The T2 distance to use. Values < 0 indicate that a heuristic based on attribute std. deviation should be used to set this. Note that this heuristic can only be used when batch training (default = -1.0)
-t1 The T1 distance to use. A value < 0 is taken as a positive multiplier for T2. (default = -1.5)
-M Don't replace missing values with mean/mode when running in batch mode.
-S <num> Random number seed. (default 1)
-output-debug-info If set, clusterer is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, clusterer capabilities are not checked before clusterer is built (use with caution).
- Specified by:
setOptions
in interfaceOptionHandler
- Overrides:
setOptions
in classRandomizableClusterer
- Parameters:
options
- the list of options as an array of strings throws Exception if an option is not supported- Throws:
Exception
- if an option is not supported
-
getOptions
Gets the current settings of Canopy.- Specified by:
getOptions
in interfaceOptionHandler
- Overrides:
getOptions
in classRandomizableClusterer
- Returns:
- an array of strings suitable for passing to setOptions()
-
nonEmptyCanopySetIntersection
Tests if two sets of canopies have a non-empty intersection- Parameters:
first
- the first canopy setsecond
- the second canopy set- Returns:
- true if the intersection is non-empty
- Throws:
Exception
- if a problem occurs
-
assignCanopies
Uses T1 distance to assign canopies to the supplied instance. If the instance does not fall within T1 distance of any canopies then the instance has the closest canopy assigned to it.- Parameters:
inst
- the instance to find covering canopies for- Returns:
- a set of canopies that contain this instance according to T1 distance
- Throws:
Exception
- if a problem occurs
-
updateClusterer
Description copied from interface:UpdateableClusterer
Adds an instance to the clusterer.- Specified by:
updateClusterer
in interfaceUpdateableClusterer
- Parameters:
newInstance
- the instance to be added- Throws:
Exception
- if something goes wrong
-
distributionForInstance
Description copied from class:AbstractClusterer
Predicts the cluster memberships for a given instance. Either this or clusterInstance() needs to be implemented by subclasses.- Specified by:
distributionForInstance
in interfaceClusterer
- Overrides:
distributionForInstance
in classAbstractClusterer
- Parameters:
instance
- the instance to be assigned a cluster.- Returns:
- an array containing the estimated membership probabilities of the test instance in each cluster (this should sum to at most 1)
- Throws:
Exception
- if distribution could not be computed successfully
-
updateFinished
public void updateFinished()Description copied from interface:UpdateableClusterer
Signals the end of the updating.- Specified by:
updateFinished
in interfaceUpdateableClusterer
-
initializeDistanceFunction
Initialize the distance function (i.e set min/max values for numeric attributes) with the supplied instances.- Parameters:
init
- the instances to initialize with- Throws:
Exception
- if a problem occurs
-
buildClusterer
Description copied from class:AbstractClusterer
Generates a clusterer. Has to initialize all fields of the clusterer that are not being set via options.- Specified by:
buildClusterer
in interfaceClusterer
- Specified by:
buildClusterer
in classAbstractClusterer
- Parameters:
data
- set of instances serving as training data- Throws:
Exception
- if the clusterer has not been generated successfully
-
numberOfClusters
Description copied from class:AbstractClusterer
Returns the number of clusters.- Specified by:
numberOfClusters
in interfaceClusterer
- Specified by:
numberOfClusters
in classAbstractClusterer
- Returns:
- the number of clusters generated for a training dataset.
- Throws:
Exception
- if number of clusters could not be returned successfully
-
setMissingValuesReplacer
Set a ready-to-use missing values replacement filter- Parameters:
missingReplacer
- the missing values replacement filter to use
-
getCanopies
Get the canopies (cluster centers).- Returns:
- the canopies
-
setCanopies
Set the canopies to use (replaces any learned by this clusterer already)- Parameters:
canopies
- the canopies to use
-
getClusterCanopyAssignments
Get the canopies that each canopy (cluster center) is within T1 distance of- Returns:
- a list of canopies for each cluster center
-
setClusterCanopyAssignments
Set the canopies that each canopy (cluster center) is within T1 distance of- Parameters:
clusterCanopies
- the list canopies for each cluster center
-
getActualT2
public double getActualT2()Get the actual value of T2 (which may be different from the initial value if the heuristic is used)- Returns:
- the actual value of T2
-
getActualT1
public double getActualT1()Get the actual value of T1 (which may be different from the initial value if the heuristic is used)- Returns:
- the actual value of T1
-
t1TipText
Tip text for this property- Returns:
- the tip text for this property
-
setT1
public void setT1(double t1) Set the T1 distance. Values < 0 are taken as a positive multiplier for the T2 distance - e.g. T1_actual = Math.abs(t1) * t2;- Parameters:
t1
- the T1 distance to use
-
getT1
public double getT1()Get the T1 distance. Values < 0 are taken as a positive multiplier for the T2 distance - e.g. T1_actual = Math.abs(t1) * t2;- Returns:
- the T1 distance to use
-
t2TipText
Tip text for this property- Returns:
- the tip text for this property
-
setT2
public void setT2(double t2) Set the T2 distance to use. Values < 0 indicate that a heuristic based on attribute standard deviation should be used to set this (note that the heuristic is only applicable when batch training).- Parameters:
t2
- the T2 distance to use
-
getT2
public double getT2()Get the T2 distance to use. Values < 0 indicate that a heuristic based on attribute standard deviation should be used to set this (note that the heuristic is only applicable when batch training).- Returns:
- the T2 distance to use
-
numClustersTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setNumClusters
Description copied from interface:NumberOfClustersRequestable
Set the number of clusters to generate- Specified by:
setNumClusters
in interfaceNumberOfClustersRequestable
- Parameters:
numClusters
- the number of clusters to generate- Throws:
Exception
- if the requested number of clusters in inapropriate
-
getNumClusters
public int getNumClusters()Get the number of clusters to generate- Returns:
- the number of clusters to generate
-
periodicPruningRateTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setPeriodicPruningRate
public void setPeriodicPruningRate(int p) Set the how often to prune low density canopies during training- Parameters:
p
- how often (every p instances) to prune low density canopies
-
getPeriodicPruningRate
public int getPeriodicPruningRate()Get the how often to prune low density canopies during training- Returns:
- how often (every p instances) to prune low density canopies
-
minimumCanopyDensityTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setMinimumCanopyDensity
public void setMinimumCanopyDensity(double dens) Set the minimum T2-based density below which a canopy will be pruned during periodic pruning.- Parameters:
dens
- the minimum canopy density
-
getMinimumCanopyDensity
public double getMinimumCanopyDensity()Get the minimum T2-based density below which a canopy will be pruned during periodic pruning.- Returns:
- the minimum canopy density
-
maxNumCandidateCanopiesToHoldInMemory
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setMaxNumCandidateCanopiesToHoldInMemory
public void setMaxNumCandidateCanopiesToHoldInMemory(int max) Set the maximum number of candidate canopies to retain in memory during training. T2 distance and data characteristics determine how many candidate canopies are formed before periodic and final pruning are performed. There may not be enough memory available if T2 is set too low.- Parameters:
max
- the maximum number of candidate canopies to retain in memory during training
-
getMaxNumCandidateCanopiesToHoldInMemory
public int getMaxNumCandidateCanopiesToHoldInMemory()Get the maximum number of candidate canopies to retain in memory during training. T2 distance and data characteristics determine how many candidate canopies are formed before periodic and final pruning are performed. There may not be enough memory available if T2 is set too low.- Returns:
- the maximum number of candidate canopies to retain in memory during training
-
dontReplaceMissingValuesTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setDontReplaceMissingValues
public void setDontReplaceMissingValues(boolean r) Sets whether missing values are to be replaced.- Parameters:
r
- true if missing values are to be replaced
-
getDontReplaceMissingValues
public boolean getDontReplaceMissingValues()Gets whether missing values are to be replaced.- Returns:
- true if missing values are to be replaced
-
printSingleAssignment
-
printCanopyAssignments
Print the supplied instances and their canopies- Parameters:
dataPoints
- the instances to printcanopyAssignments
- the canopy assignments, one assignment array for each instance- Returns:
- a string containing the printed assignments
-
toString
Return a textual description of this clusterer- Parameters:
header
- true if the header should be printed- Returns:
- a string describing the result of the clustering
-
toString
-
cleanUp
public void cleanUp()Save memory -
aggregateCanopies
public static Canopy aggregateCanopies(List<Canopy> canopies, double aggregationT1, double aggregationT2, NormalizableDistance finalDistanceFunction, Filter missingValuesReplacer, int finalNumCanopies) Aggregate the canopies from a list of Canopy clusterers together into one final model.- Parameters:
canopies
- the list of Canopy clusterers to aggregateaggregationT1
- the T1 distance to use for the aggregated classifieraggregationT2
- the T2 distance to use when aggregating canopiesfinalDistanceFunction
- the distance function to use with the final Canopy clusterermissingValuesReplacer
- the missing value replacement filter to use with the final clusterer (can be null for no missing value replacement)finalNumCanopies
- the final number of canopies- Returns:
- a Canopy clusterer that aggregates all the canopies
-
main
-