Package weka.classifiers.rules
Class JRip
java.lang.Object
weka.classifiers.AbstractClassifier
weka.classifiers.rules.JRip
- All Implemented Interfaces:
Serializable,Cloneable,Classifier,AdditionalMeasureProducer,BatchPredictor,CapabilitiesHandler,CapabilitiesIgnorer,CommandlineRunnable,OptionHandler,RevisionHandler,TechnicalInformationHandler,WeightedInstancesHandler
public class JRip
extends AbstractClassifier
implements AdditionalMeasureProducer, WeightedInstancesHandler, TechnicalInformationHandler
This class implements a propositional rule learner,
Repeated Incremental Pruning to Produce Error Reduction (RIPPER), which was
proposed by William W. Cohen as an optimized version of IREP.
The algorithm is briefly described as follows:
Initialize RS = {}, and for each class from the less prevalent one to the more frequent one, DO:
1. Building stage:
Repeat 1.1 and 1.2 until the descrition length (DL) of the ruleset and examples is 64 bits greater than the smallest DL met so far, or there are no positive examples, or the error rate >= 50%.
1.1. Grow phase:
Grow one rule by greedily adding antecedents (or conditions) to the rule until the rule is perfect (i.e. 100% accurate). The procedure tries every possible value of each attribute and selects the condition with highest information gain: p(log(p/t)-log(P/T)).
1.2. Prune phase:
Incrementally prune each rule and allow the pruning of any final sequences of the antecedents;The pruning metric is (p-n)/(p+n) -- but it's actually 2p/(p+n) -1, so in this implementation we simply use p/(p+n) (actually (p+1)/(p+n+2), thus if p+n is 0, it's 0.5).
2. Optimization stage:
after generating the initial ruleset {Ri}, generate and prune two variants of each rule Ri from randomized data using procedure 1.1 and 1.2. But one variant is generated from an empty rule while the other is generated by greedily adding antecedents to the original rule. Moreover, the pruning metric used here is (TP+TN)/(P+N).Then the smallest possible DL for each variant and the original rule is computed. The variant with the minimal DL is selected as the final representative of Ri in the ruleset.After all the rules in {Ri} have been examined and if there are still residual positives, more rules are generated based on the residual positives using Building Stage again.
3. Delete the rules from the ruleset that would increase the DL of the whole ruleset if it were in it. and add resultant ruleset to RS.
ENDDO
Note that there seem to be 2 bugs in the original ripper program that would affect the ruleset size and accuracy slightly. This implementation avoids these bugs and thus is a little bit different from Cohen's original implementation. Even after fixing the bugs, since the order of classes with the same frequency is not defined in ripper, there still seems to be some trivial difference between this implementation and the original ripper, especially for audiology data in UCI repository, where there are lots of classes of few instances.
Details please see:
William W. Cohen: Fast Effective Rule Induction. In: Twelfth International Conference on Machine Learning, 115-123, 1995.
PS. We have compared this implementation with the original ripper implementation in aspects of accuracy, ruleset size and running time on both artificial data "ab+bcd+defg" and UCI datasets. In all these aspects it seems to be quite comparable to the original ripper implementation. However, we didn't consider memory consumption optimization in this implementation.
BibTeX:
The algorithm is briefly described as follows:
Initialize RS = {}, and for each class from the less prevalent one to the more frequent one, DO:
1. Building stage:
Repeat 1.1 and 1.2 until the descrition length (DL) of the ruleset and examples is 64 bits greater than the smallest DL met so far, or there are no positive examples, or the error rate >= 50%.
1.1. Grow phase:
Grow one rule by greedily adding antecedents (or conditions) to the rule until the rule is perfect (i.e. 100% accurate). The procedure tries every possible value of each attribute and selects the condition with highest information gain: p(log(p/t)-log(P/T)).
1.2. Prune phase:
Incrementally prune each rule and allow the pruning of any final sequences of the antecedents;The pruning metric is (p-n)/(p+n) -- but it's actually 2p/(p+n) -1, so in this implementation we simply use p/(p+n) (actually (p+1)/(p+n+2), thus if p+n is 0, it's 0.5).
2. Optimization stage:
after generating the initial ruleset {Ri}, generate and prune two variants of each rule Ri from randomized data using procedure 1.1 and 1.2. But one variant is generated from an empty rule while the other is generated by greedily adding antecedents to the original rule. Moreover, the pruning metric used here is (TP+TN)/(P+N).Then the smallest possible DL for each variant and the original rule is computed. The variant with the minimal DL is selected as the final representative of Ri in the ruleset.After all the rules in {Ri} have been examined and if there are still residual positives, more rules are generated based on the residual positives using Building Stage again.
3. Delete the rules from the ruleset that would increase the DL of the whole ruleset if it were in it. and add resultant ruleset to RS.
ENDDO
Note that there seem to be 2 bugs in the original ripper program that would affect the ruleset size and accuracy slightly. This implementation avoids these bugs and thus is a little bit different from Cohen's original implementation. Even after fixing the bugs, since the order of classes with the same frequency is not defined in ripper, there still seems to be some trivial difference between this implementation and the original ripper, especially for audiology data in UCI repository, where there are lots of classes of few instances.
Details please see:
William W. Cohen: Fast Effective Rule Induction. In: Twelfth International Conference on Machine Learning, 115-123, 1995.
PS. We have compared this implementation with the original ripper implementation in aspects of accuracy, ruleset size and running time on both artificial data "ab+bcd+defg" and UCI datasets. In all these aspects it seems to be quite comparable to the original ripper implementation. However, we didn't consider memory consumption optimization in this implementation.
BibTeX:
@inproceedings{Cohen1995,
author = {William W. Cohen},
booktitle = {Twelfth International Conference on Machine Learning},
pages = {115-123},
publisher = {Morgan Kaufmann},
title = {Fast Effective Rule Induction},
year = {1995}
}
Valid options are:
-F <number of folds> Set number of folds for REP One fold is used as pruning set. (default 3)
-N <min. weights> Set the minimal weights of instances within a split. (default 2.0)
-O <number of runs> Set the number of runs of optimizations. (Default: 2)
-D Set whether turn on the debug mode (Default: false)
-S <seed> The seed of randomization (Default: 1)
-E Whether NOT check the error rate>=0.5 in stopping criteria (default: check)
-P Whether NOT use pruning (default: use pruning)
- Version:
- $Revision: 15519 $
- Author:
- Xin Xu (xx5@cs.waikato.ac.nz), Eibe Frank (eibe@cs.waikato.ac.nz)
- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionclassThe single antecedent in the rule, which is composed of an attribute and the corresponding value.classThe antecedent with nominal attributeclassThe antecedent with numeric attributeclassThis class implements a single rule that predicts specified class. -
Field Summary
Fields inherited from class weka.classifiers.AbstractClassifier
BATCH_SIZE_DEFAULT, NUM_DECIMAL_PLACES_DEFAULT -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoidbuildClassifier(Instances instances) Builds Ripper in the order of class frequencies.Returns the tip text for this propertyReturns the tip text for this propertydouble[]distributionForInstance(Instance datum) Classify the test instance with the rule learner and provide the class distributionsReturns an enumeration of the additional measure namesReturns the tip text for this propertyReturns default capabilities of the classifier.booleanGets whether to check for error rate is in stopping criterionbooleangetDebug()Gets whether debug information is output to the consoleintgetFolds()Gets the number of foldsdoublegetMeasure(String additionalMeasureName) Returns the value of the named measuredoublegetMinNo()Gets the minimum total weight of the instances in a ruleintGets the the number of optimization runsString[]Gets the current settings of the Classifier.Returns the revision string.Get the ruleset generated by RippergetRuleStats(int pos) Get the statistics of the ruleset in the given positionlonggetSeed()Gets the current seed value to use in randomizing the dataReturns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.booleanGets whether pruning is performedReturns a string describing classifierReturns an enumeration describing the available options Valid options are:static voidMain method.Returns the tip text for this propertyReturns the tip text for this propertyReturns the tip text for this propertyvoidsetCheckErrorRate(boolean d) Sets whether to check for error rate is in stopping criterionvoidsetDebug(boolean d) Sets whether debug information is output to the consolevoidsetFolds(int fold) Sets the number of folds to usevoidsetMinNo(double m) Sets the minimum total weight of the instances in a rulevoidsetOptimizations(int run) Sets the number of optimization runsvoidsetOptions(String[] options) Parses a given list of options.voidsetSeed(long s) Sets the seed value to use in randomizing the datavoidsetUsePruning(boolean d) Sets whether pruning is performedtoString()Prints the all the rules of the rule learner.Returns the tip text for this propertyMethods inherited from class weka.classifiers.AbstractClassifier
batchSizeTipText, classifyInstance, distributionsForInstances, doNotCheckCapabilitiesTipText, forName, getBatchSize, getDoNotCheckCapabilities, getNumDecimalPlaces, implementsMoreEfficientBatchPrediction, makeCopies, makeCopy, numDecimalPlacesTipText, postExecution, preExecution, run, runClassifier, setBatchSize, setDoNotCheckCapabilities, setNumDecimalPlaces
-
Constructor Details
-
JRip
public JRip()
-
-
Method Details
-
globalInfo
Returns a string describing classifier- Returns:
- a description suitable for displaying in the explorer/experimenter gui
-
getTechnicalInformation
Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.- Specified by:
getTechnicalInformationin interfaceTechnicalInformationHandler- Returns:
- the technical information about this class
-
listOptions
Returns an enumeration describing the available options Valid options are:-F number
The number of folds for reduced error pruning. One fold is used as the pruning set. (Default: 3)-N number
The minimal weights of instances within a split. (Default: 2)-O number
Set the number of runs of optimizations. (Default: 2)-D
Whether turn on the debug mode -S number
The seed of randomization used in Ripper.(Default: 1)-E
Whether NOT check the error rate >= 0.5 in stopping criteria. (default: check)-P
Whether NOT use pruning. (default: use pruning)- Specified by:
listOptionsin interfaceOptionHandler- Overrides:
listOptionsin classAbstractClassifier- Returns:
- an enumeration of all the available options
-
setOptions
Parses a given list of options. Valid options are:-F <number of folds> Set number of folds for REP One fold is used as pruning set. (default 3)
-N <min. weights> Set the minimal weights of instances within a split. (default 2.0)
-O <number of runs> Set the number of runs of optimizations. (Default: 2)
-D Set whether turn on the debug mode (Default: false)
-S <seed> The seed of randomization (Default: 1)
-E Whether NOT check the error rate>=0.5 in stopping criteria (default: check)
-P Whether NOT use pruning (default: use pruning)
- Specified by:
setOptionsin interfaceOptionHandler- Overrides:
setOptionsin classAbstractClassifier- Parameters:
options- the list of options as an array of strings- Throws:
Exception- if an option is not supported
-
getOptions
Gets the current settings of the Classifier.- Specified by:
getOptionsin interfaceOptionHandler- Overrides:
getOptionsin classAbstractClassifier- Returns:
- an array of strings suitable for passing to setOptions
-
enumerateMeasures
Returns an enumeration of the additional measure names- Specified by:
enumerateMeasuresin interfaceAdditionalMeasureProducer- Returns:
- an enumeration of the measure names
-
getMeasure
Returns the value of the named measure- Specified by:
getMeasurein interfaceAdditionalMeasureProducer- Parameters:
additionalMeasureName- the name of the measure to query for its value- Returns:
- the value of the named measure
- Throws:
IllegalArgumentException- if the named measure is not supported
-
foldsTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setFolds
public void setFolds(int fold) Sets the number of folds to use- Parameters:
fold- the number of folds
-
getFolds
public int getFolds()Gets the number of folds- Returns:
- the number of folds
-
minNoTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setMinNo
public void setMinNo(double m) Sets the minimum total weight of the instances in a rule- Parameters:
m- the minimum total weight of the instances in a rule
-
getMinNo
public double getMinNo()Gets the minimum total weight of the instances in a rule- Returns:
- the minimum total weight of the instances in a rule
-
seedTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setSeed
public void setSeed(long s) Sets the seed value to use in randomizing the data- Parameters:
s- the new seed value
-
getSeed
public long getSeed()Gets the current seed value to use in randomizing the data- Returns:
- the seed value
-
optimizationsTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setOptimizations
public void setOptimizations(int run) Sets the number of optimization runs- Parameters:
run- the number of optimization runs
-
getOptimizations
public int getOptimizations()Gets the the number of optimization runs- Returns:
- the number of optimization runs
-
debugTipText
Returns the tip text for this property- Overrides:
debugTipTextin classAbstractClassifier- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setDebug
public void setDebug(boolean d) Sets whether debug information is output to the console- Overrides:
setDebugin classAbstractClassifier- Parameters:
d- whether debug information is output to the console
-
getDebug
public boolean getDebug()Gets whether debug information is output to the console- Overrides:
getDebugin classAbstractClassifier- Returns:
- whether debug information is output to the console
-
checkErrorRateTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setCheckErrorRate
public void setCheckErrorRate(boolean d) Sets whether to check for error rate is in stopping criterion- Parameters:
d- whether to check for error rate is in stopping criterion
-
getCheckErrorRate
public boolean getCheckErrorRate()Gets whether to check for error rate is in stopping criterion- Returns:
- true if checking for error rate is in stopping criterion
-
usePruningTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setUsePruning
public void setUsePruning(boolean d) Sets whether pruning is performed- Parameters:
d- Whether pruning is performed
-
getUsePruning
public boolean getUsePruning()Gets whether pruning is performed- Returns:
- true if pruning is performed
-
getRuleset
Get the ruleset generated by Ripper- Returns:
- the ruleset
-
getRuleStats
Get the statistics of the ruleset in the given position- Parameters:
pos- the position of the stats, assuming correct- Returns:
- the statistics of the ruleset in the given position
-
getCapabilities
Returns default capabilities of the classifier.- Specified by:
getCapabilitiesin interfaceCapabilitiesHandler- Specified by:
getCapabilitiesin interfaceClassifier- Overrides:
getCapabilitiesin classAbstractClassifier- Returns:
- the capabilities of this classifier
- See Also:
-
buildClassifier
Builds Ripper in the order of class frequencies. For each class it's built in two stages: building and optimization- Specified by:
buildClassifierin interfaceClassifier- Parameters:
instances- the training data- Throws:
Exception- if classifier can't be built successfully
-
distributionForInstance
Classify the test instance with the rule learner and provide the class distributions- Specified by:
distributionForInstancein interfaceClassifier- Overrides:
distributionForInstancein classAbstractClassifier- Parameters:
datum- the instance to be classified- Returns:
- the distribution
-
toString
Prints the all the rules of the rule learner. -
getRevision
Returns the revision string.- Specified by:
getRevisionin interfaceRevisionHandler- Overrides:
getRevisionin classAbstractClassifier- Returns:
- the revision
-
main
Main method.- Parameters:
args- the options for the classifier
-