Package weka.classifiers.rules
Class JRip
java.lang.Object
weka.classifiers.AbstractClassifier
weka.classifiers.rules.JRip
- All Implemented Interfaces:
Serializable
,Cloneable
,Classifier
,AdditionalMeasureProducer
,BatchPredictor
,CapabilitiesHandler
,CapabilitiesIgnorer
,CommandlineRunnable
,OptionHandler
,RevisionHandler
,TechnicalInformationHandler
,WeightedInstancesHandler
public class JRip
extends AbstractClassifier
implements AdditionalMeasureProducer, WeightedInstancesHandler, TechnicalInformationHandler
This class implements a propositional rule learner,
Repeated Incremental Pruning to Produce Error Reduction (RIPPER), which was
proposed by William W. Cohen as an optimized version of IREP.
The algorithm is briefly described as follows:
Initialize RS = {}, and for each class from the less prevalent one to the more frequent one, DO:
1. Building stage:
Repeat 1.1 and 1.2 until the descrition length (DL) of the ruleset and examples is 64 bits greater than the smallest DL met so far, or there are no positive examples, or the error rate >= 50%.
1.1. Grow phase:
Grow one rule by greedily adding antecedents (or conditions) to the rule until the rule is perfect (i.e. 100% accurate). The procedure tries every possible value of each attribute and selects the condition with highest information gain: p(log(p/t)-log(P/T)).
1.2. Prune phase:
Incrementally prune each rule and allow the pruning of any final sequences of the antecedents;The pruning metric is (p-n)/(p+n) -- but it's actually 2p/(p+n) -1, so in this implementation we simply use p/(p+n) (actually (p+1)/(p+n+2), thus if p+n is 0, it's 0.5).
2. Optimization stage:
after generating the initial ruleset {Ri}, generate and prune two variants of each rule Ri from randomized data using procedure 1.1 and 1.2. But one variant is generated from an empty rule while the other is generated by greedily adding antecedents to the original rule. Moreover, the pruning metric used here is (TP+TN)/(P+N).Then the smallest possible DL for each variant and the original rule is computed. The variant with the minimal DL is selected as the final representative of Ri in the ruleset.After all the rules in {Ri} have been examined and if there are still residual positives, more rules are generated based on the residual positives using Building Stage again.
3. Delete the rules from the ruleset that would increase the DL of the whole ruleset if it were in it. and add resultant ruleset to RS.
ENDDO
Note that there seem to be 2 bugs in the original ripper program that would affect the ruleset size and accuracy slightly. This implementation avoids these bugs and thus is a little bit different from Cohen's original implementation. Even after fixing the bugs, since the order of classes with the same frequency is not defined in ripper, there still seems to be some trivial difference between this implementation and the original ripper, especially for audiology data in UCI repository, where there are lots of classes of few instances.
Details please see:
William W. Cohen: Fast Effective Rule Induction. In: Twelfth International Conference on Machine Learning, 115-123, 1995.
PS. We have compared this implementation with the original ripper implementation in aspects of accuracy, ruleset size and running time on both artificial data "ab+bcd+defg" and UCI datasets. In all these aspects it seems to be quite comparable to the original ripper implementation. However, we didn't consider memory consumption optimization in this implementation.
BibTeX:
The algorithm is briefly described as follows:
Initialize RS = {}, and for each class from the less prevalent one to the more frequent one, DO:
1. Building stage:
Repeat 1.1 and 1.2 until the descrition length (DL) of the ruleset and examples is 64 bits greater than the smallest DL met so far, or there are no positive examples, or the error rate >= 50%.
1.1. Grow phase:
Grow one rule by greedily adding antecedents (or conditions) to the rule until the rule is perfect (i.e. 100% accurate). The procedure tries every possible value of each attribute and selects the condition with highest information gain: p(log(p/t)-log(P/T)).
1.2. Prune phase:
Incrementally prune each rule and allow the pruning of any final sequences of the antecedents;The pruning metric is (p-n)/(p+n) -- but it's actually 2p/(p+n) -1, so in this implementation we simply use p/(p+n) (actually (p+1)/(p+n+2), thus if p+n is 0, it's 0.5).
2. Optimization stage:
after generating the initial ruleset {Ri}, generate and prune two variants of each rule Ri from randomized data using procedure 1.1 and 1.2. But one variant is generated from an empty rule while the other is generated by greedily adding antecedents to the original rule. Moreover, the pruning metric used here is (TP+TN)/(P+N).Then the smallest possible DL for each variant and the original rule is computed. The variant with the minimal DL is selected as the final representative of Ri in the ruleset.After all the rules in {Ri} have been examined and if there are still residual positives, more rules are generated based on the residual positives using Building Stage again.
3. Delete the rules from the ruleset that would increase the DL of the whole ruleset if it were in it. and add resultant ruleset to RS.
ENDDO
Note that there seem to be 2 bugs in the original ripper program that would affect the ruleset size and accuracy slightly. This implementation avoids these bugs and thus is a little bit different from Cohen's original implementation. Even after fixing the bugs, since the order of classes with the same frequency is not defined in ripper, there still seems to be some trivial difference between this implementation and the original ripper, especially for audiology data in UCI repository, where there are lots of classes of few instances.
Details please see:
William W. Cohen: Fast Effective Rule Induction. In: Twelfth International Conference on Machine Learning, 115-123, 1995.
PS. We have compared this implementation with the original ripper implementation in aspects of accuracy, ruleset size and running time on both artificial data "ab+bcd+defg" and UCI datasets. In all these aspects it seems to be quite comparable to the original ripper implementation. However, we didn't consider memory consumption optimization in this implementation.
BibTeX:
@inproceedings{Cohen1995, author = {William W. Cohen}, booktitle = {Twelfth International Conference on Machine Learning}, pages = {115-123}, publisher = {Morgan Kaufmann}, title = {Fast Effective Rule Induction}, year = {1995} }Valid options are:
-F <number of folds> Set number of folds for REP One fold is used as pruning set. (default 3)
-N <min. weights> Set the minimal weights of instances within a split. (default 2.0)
-O <number of runs> Set the number of runs of optimizations. (Default: 2)
-D Set whether turn on the debug mode (Default: false)
-S <seed> The seed of randomization (Default: 1)
-E Whether NOT check the error rate>=0.5 in stopping criteria (default: check)
-P Whether NOT use pruning (default: use pruning)
- Version:
- $Revision: 15519 $
- Author:
- Xin Xu (xx5@cs.waikato.ac.nz), Eibe Frank (eibe@cs.waikato.ac.nz)
- See Also:
-
Nested Class Summary
Modifier and TypeClassDescriptionclass
The single antecedent in the rule, which is composed of an attribute and the corresponding value.class
The antecedent with nominal attributeclass
The antecedent with numeric attributeclass
This class implements a single rule that predicts specified class. -
Field Summary
Fields inherited from class weka.classifiers.AbstractClassifier
BATCH_SIZE_DEFAULT, NUM_DECIMAL_PLACES_DEFAULT
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionvoid
buildClassifier
(Instances instances) Builds Ripper in the order of class frequencies.Returns the tip text for this propertyReturns the tip text for this propertydouble[]
distributionForInstance
(Instance datum) Classify the test instance with the rule learner and provide the class distributionsReturns an enumeration of the additional measure namesReturns the tip text for this propertyReturns default capabilities of the classifier.boolean
Gets whether to check for error rate is in stopping criterionboolean
getDebug()
Gets whether debug information is output to the consoleint
getFolds()
Gets the number of foldsdouble
getMeasure
(String additionalMeasureName) Returns the value of the named measuredouble
getMinNo()
Gets the minimum total weight of the instances in a ruleint
Gets the the number of optimization runsString[]
Gets the current settings of the Classifier.Returns the revision string.Get the ruleset generated by RippergetRuleStats
(int pos) Get the statistics of the ruleset in the given positionlong
getSeed()
Gets the current seed value to use in randomizing the dataReturns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.boolean
Gets whether pruning is performedReturns a string describing classifierReturns an enumeration describing the available options Valid options are:static void
Main method.Returns the tip text for this propertyReturns the tip text for this propertyReturns the tip text for this propertyvoid
setCheckErrorRate
(boolean d) Sets whether to check for error rate is in stopping criterionvoid
setDebug
(boolean d) Sets whether debug information is output to the consolevoid
setFolds
(int fold) Sets the number of folds to usevoid
setMinNo
(double m) Sets the minimum total weight of the instances in a rulevoid
setOptimizations
(int run) Sets the number of optimization runsvoid
setOptions
(String[] options) Parses a given list of options.void
setSeed
(long s) Sets the seed value to use in randomizing the datavoid
setUsePruning
(boolean d) Sets whether pruning is performedtoString()
Prints the all the rules of the rule learner.Returns the tip text for this propertyMethods inherited from class weka.classifiers.AbstractClassifier
batchSizeTipText, classifyInstance, distributionsForInstances, doNotCheckCapabilitiesTipText, forName, getBatchSize, getDoNotCheckCapabilities, getNumDecimalPlaces, implementsMoreEfficientBatchPrediction, makeCopies, makeCopy, numDecimalPlacesTipText, postExecution, preExecution, run, runClassifier, setBatchSize, setDoNotCheckCapabilities, setNumDecimalPlaces
-
Constructor Details
-
JRip
public JRip()
-
-
Method Details
-
globalInfo
Returns a string describing classifier- Returns:
- a description suitable for displaying in the explorer/experimenter gui
-
getTechnicalInformation
Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.- Specified by:
getTechnicalInformation
in interfaceTechnicalInformationHandler
- Returns:
- the technical information about this class
-
listOptions
Returns an enumeration describing the available options Valid options are:-F number
The number of folds for reduced error pruning. One fold is used as the pruning set. (Default: 3)-N number
The minimal weights of instances within a split. (Default: 2)-O number
Set the number of runs of optimizations. (Default: 2)-D
Whether turn on the debug mode -S number
The seed of randomization used in Ripper.(Default: 1)-E
Whether NOT check the error rate >= 0.5 in stopping criteria. (default: check)-P
Whether NOT use pruning. (default: use pruning)- Specified by:
listOptions
in interfaceOptionHandler
- Overrides:
listOptions
in classAbstractClassifier
- Returns:
- an enumeration of all the available options
-
setOptions
Parses a given list of options. Valid options are:-F <number of folds> Set number of folds for REP One fold is used as pruning set. (default 3)
-N <min. weights> Set the minimal weights of instances within a split. (default 2.0)
-O <number of runs> Set the number of runs of optimizations. (Default: 2)
-D Set whether turn on the debug mode (Default: false)
-S <seed> The seed of randomization (Default: 1)
-E Whether NOT check the error rate>=0.5 in stopping criteria (default: check)
-P Whether NOT use pruning (default: use pruning)
- Specified by:
setOptions
in interfaceOptionHandler
- Overrides:
setOptions
in classAbstractClassifier
- Parameters:
options
- the list of options as an array of strings- Throws:
Exception
- if an option is not supported
-
getOptions
Gets the current settings of the Classifier.- Specified by:
getOptions
in interfaceOptionHandler
- Overrides:
getOptions
in classAbstractClassifier
- Returns:
- an array of strings suitable for passing to setOptions
-
enumerateMeasures
Returns an enumeration of the additional measure names- Specified by:
enumerateMeasures
in interfaceAdditionalMeasureProducer
- Returns:
- an enumeration of the measure names
-
getMeasure
Returns the value of the named measure- Specified by:
getMeasure
in interfaceAdditionalMeasureProducer
- Parameters:
additionalMeasureName
- the name of the measure to query for its value- Returns:
- the value of the named measure
- Throws:
IllegalArgumentException
- if the named measure is not supported
-
foldsTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setFolds
public void setFolds(int fold) Sets the number of folds to use- Parameters:
fold
- the number of folds
-
getFolds
public int getFolds()Gets the number of folds- Returns:
- the number of folds
-
minNoTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setMinNo
public void setMinNo(double m) Sets the minimum total weight of the instances in a rule- Parameters:
m
- the minimum total weight of the instances in a rule
-
getMinNo
public double getMinNo()Gets the minimum total weight of the instances in a rule- Returns:
- the minimum total weight of the instances in a rule
-
seedTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setSeed
public void setSeed(long s) Sets the seed value to use in randomizing the data- Parameters:
s
- the new seed value
-
getSeed
public long getSeed()Gets the current seed value to use in randomizing the data- Returns:
- the seed value
-
optimizationsTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setOptimizations
public void setOptimizations(int run) Sets the number of optimization runs- Parameters:
run
- the number of optimization runs
-
getOptimizations
public int getOptimizations()Gets the the number of optimization runs- Returns:
- the number of optimization runs
-
debugTipText
Returns the tip text for this property- Overrides:
debugTipText
in classAbstractClassifier
- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setDebug
public void setDebug(boolean d) Sets whether debug information is output to the console- Overrides:
setDebug
in classAbstractClassifier
- Parameters:
d
- whether debug information is output to the console
-
getDebug
public boolean getDebug()Gets whether debug information is output to the console- Overrides:
getDebug
in classAbstractClassifier
- Returns:
- whether debug information is output to the console
-
checkErrorRateTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setCheckErrorRate
public void setCheckErrorRate(boolean d) Sets whether to check for error rate is in stopping criterion- Parameters:
d
- whether to check for error rate is in stopping criterion
-
getCheckErrorRate
public boolean getCheckErrorRate()Gets whether to check for error rate is in stopping criterion- Returns:
- true if checking for error rate is in stopping criterion
-
usePruningTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setUsePruning
public void setUsePruning(boolean d) Sets whether pruning is performed- Parameters:
d
- Whether pruning is performed
-
getUsePruning
public boolean getUsePruning()Gets whether pruning is performed- Returns:
- true if pruning is performed
-
getRuleset
Get the ruleset generated by Ripper- Returns:
- the ruleset
-
getRuleStats
Get the statistics of the ruleset in the given position- Parameters:
pos
- the position of the stats, assuming correct- Returns:
- the statistics of the ruleset in the given position
-
getCapabilities
Returns default capabilities of the classifier.- Specified by:
getCapabilities
in interfaceCapabilitiesHandler
- Specified by:
getCapabilities
in interfaceClassifier
- Overrides:
getCapabilities
in classAbstractClassifier
- Returns:
- the capabilities of this classifier
- See Also:
-
buildClassifier
Builds Ripper in the order of class frequencies. For each class it's built in two stages: building and optimization- Specified by:
buildClassifier
in interfaceClassifier
- Parameters:
instances
- the training data- Throws:
Exception
- if classifier can't be built successfully
-
distributionForInstance
Classify the test instance with the rule learner and provide the class distributions- Specified by:
distributionForInstance
in interfaceClassifier
- Overrides:
distributionForInstance
in classAbstractClassifier
- Parameters:
datum
- the instance to be classified- Returns:
- the distribution
-
toString
Prints the all the rules of the rule learner. -
getRevision
Returns the revision string.- Specified by:
getRevision
in interfaceRevisionHandler
- Overrides:
getRevision
in classAbstractClassifier
- Returns:
- the revision
-
main
Main method.- Parameters:
args
- the options for the classifier
-