Package weka.classifiers.rules
Class RuleStats
java.lang.Object
weka.classifiers.rules.RuleStats
 All Implemented Interfaces:
Serializable
,RevisionHandler
This class implements the statistics functions used in the propositional rule
learner, from the simpler ones like count of true/false positive/negatives,
filter data based on the ruleset, etc. to the more sophisticated ones such as
MDL calculation and rule variants generation for each rule in the ruleset.
Obviously the statistics functions listed above need the specific data and the specific ruleset, which are given in order to instantiate an object of this class.
 Version:
 $Revision: 10153 $
 Author:
 Xin Xu (xx5@cs.waikato.ac.nz)
 See Also:

Constructor Summary

Method Summary
Modifier and TypeMethodDescriptionvoid
addAndUpdate
(Rule lastRule) Add a rule to the ruleset and update the statsvoid
cleanUp()
Frees up memory after classifier has been built.double
combinedDL
(double expFPRate, double predicted) Compute the combined DL of the ruleset in this class, i.e.void
Filter the data according to the ruleset and compute the basic stats: coverage/uncoverage, true/false positive/negatives of each rulevoid
Count data from the position index in the ruleset assuming that given data are not covered by the rules in position 0...(index1), and the statistics of these rules are provided.
This procedure is typically useful when a temporary object of RuleStats is constructed in order to efficiently calculate the relative DL of rule in position index, thus all other stuff is not needed.static double
dataDL
(double expFPOverErr, double cover, double uncover, double fp, double fn) The description length of data given the parameters of the data based on the ruleset.getData()
Get the data of the statsdouble[]
getDistributions
(int index) Get the class distribution predicted by the rule in given positiongetFiltered
(int index) Get the data after filtering the given ruleReturns the revision string.Get the ruleset of the statsint
Get the size of the ruleset in the statsdouble[]
getSimpleStats
(int index) Get the simple stats of one rule, including 6 parameters: 0: coverage; 1:uncoverage; 2: true positive; 3: true negatives; 4: false positives; 5: false negativesdouble
minDataDLIfDeleted
(int index, double expFPRate, boolean checkErr) Compute the minimal data description length of the ruleset if the rule in the given position is deleted.
The min_data_DL_if_deleted = data_DL_if_deleted  potentialdouble
minDataDLIfExists
(int index, double expFPRate, boolean checkErr) Compute the minimal data description length of the ruleset if the rule in the given position is NOT deleted.
The min_data_DL_if_n_deleted = data_DL_if_n_deleted  potentialstatic double
numAllConditions
(Instances data) Compute the number of all possible conditions that could appear in a rule of a given data.static final Instances[]
Patition the data into 2, first of which has (numFolds1)/numFolds of the data and the second has 1/numFolds of the datadouble
potential
(int index, double expFPOverErr, double[] rulesetStat, double[] ruleStat, boolean checkErr) Calculate the potential to decrease DL of the ruleset, i.e.void
reduceDL
(double expFPRate, boolean checkErr) Try to reduce the DL of the ruleset by testing removing the rules one by one in reverse order and update all the statsdouble
relativeDL
(int index, double expFPRate, boolean checkErr) The description length (DL) of the ruleset relative to if the rule in the given position is deleted, which is obtained by:
MDL if the rule exists  MDL if the rule does not exist
Note the minimal possible DL of the ruleset is calculated(i.e.void
Remove the last rule in the ruleset as well as it's stats.static Instances
rmCoveredBySuccessives
(Instances data, ArrayList<Rule> rules, int index) Static utility function to count the data covered by the rules after the given index in the given rules, and then remove them.void
Set the data of the stats, overwriting the old one if anyvoid
setMDLTheoryWeight
(double weight) Set the weight of theory in MDL calcualtionvoid
setNumAllConds
(double total) Set the number of all conditions that could appear in a rule in this RuleStats object, if the number set is smaller than 0 (typically 1), then it calcualtes based on the data storevoid
setRuleset
(ArrayList<Rule> rules) Set the ruleset of the stats, overwriting the old one if anystatic final Instances
Stratify the given data into the given number of bags based on the class values.static double
subsetDL
(double t, double k, double p) Subset description length:
S(t,k,p) = k*log2(p)(nk)log2(1p) Details see Quilan: "MDL and categorical theories (Continued)",ML95double
theoryDL
(int index) The description length of the theory for a given rule.

Constructor Details

RuleStats
public RuleStats()Default constructor 
RuleStats
Constructor that provides ruleset and data Parameters:
data
 the datarules
 the ruleset


Method Details

cleanUp
public void cleanUp()Frees up memory after classifier has been built. 
setNumAllConds
public void setNumAllConds(double total) Set the number of all conditions that could appear in a rule in this RuleStats object, if the number set is smaller than 0 (typically 1), then it calcualtes based on the data store Parameters:
total
 the set number

setData
Set the data of the stats, overwriting the old one if any Parameters:
data
 the data to be set

getData
Get the data of the stats Returns:
 the data

setRuleset
Set the ruleset of the stats, overwriting the old one if any Parameters:
rules
 the set of rules to be set

getRuleset
Get the ruleset of the stats Returns:
 the set of rules

getRulesetSize
public int getRulesetSize()Get the size of the ruleset in the stats Returns:
 the size of ruleset

getSimpleStats
public double[] getSimpleStats(int index) Get the simple stats of one rule, including 6 parameters: 0: coverage; 1:uncoverage; 2: true positive; 3: true negatives; 4: false positives; 5: false negatives Parameters:
index
 the index of the rule Returns:
 the stats

getFiltered
Get the data after filtering the given rule Parameters:
index
 the index of the rule Returns:
 the data covered and uncovered by the rule

getDistributions
public double[] getDistributions(int index) Get the class distribution predicted by the rule in given position Parameters:
index
 the position index of the rule Returns:
 the class distributions

setMDLTheoryWeight
public void setMDLTheoryWeight(double weight) Set the weight of theory in MDL calcualtion Parameters:
weight
 the weight to be set

numAllConditions
Compute the number of all possible conditions that could appear in a rule of a given data. For nominal attributes, it's the number of values that could appear; for numeric attributes, it's the number of values * 2, i.e. <= and >= are counted as different possible conditions. Parameters:
data
 the given data Returns:
 number of all conditions of the data

countData
public void countData()Filter the data according to the ruleset and compute the basic stats: coverage/uncoverage, true/false positive/negatives of each rule 
countData
Count data from the position index in the ruleset assuming that given data are not covered by the rules in position 0...(index1), and the statistics of these rules are provided.
This procedure is typically useful when a temporary object of RuleStats is constructed in order to efficiently calculate the relative DL of rule in position index, thus all other stuff is not needed. Parameters:
index
 the given positionuncovered
 the data not covered by rules before indexprevRuleStats
 the provided stats of previous rules

addAndUpdate
Add a rule to the ruleset and update the stats Parameters:
lastRule
 the rule to be added

subsetDL
public static double subsetDL(double t, double k, double p) Subset description length:
S(t,k,p) = k*log2(p)(nk)log2(1p) Details see Quilan: "MDL and categorical theories (Continued)",ML95 Parameters:
t
 the number of elements in a known setk
 the number of elements in a subsetp
 the expected proportion of subset known by recipient Returns:
 the subset description length

theoryDL
public double theoryDL(int index) The description length of the theory for a given rule. Computed as:
0.5* [k+ S(t, k, k/t)]
where k is the number of antecedents of the rule; t is the total possible antecedents that could appear in a rule; K is the universal prior for k , log2*(k) and S(t,k,p) = k*log2(p)(nk)log2(1p) is the subset encoding length.Details see Quilan: "MDL and categorical theories (Continued)",ML95
 Parameters:
index
 the index of the given rule (assuming correct) Returns:
 the theory DL, weighted if weight != 1.0

dataDL
public static double dataDL(double expFPOverErr, double cover, double uncover, double fp, double fn) The description length of data given the parameters of the data based on the ruleset.Details see Quinlan: "MDL and categorical theories (Continued)",ML95
 Parameters:
expFPOverErr
 expected FP/(FP+FN)cover
 coverageuncover
 uncoveragefp
 False Positivefn
 False Negative Returns:
 the description length

potential
public double potential(int index, double expFPOverErr, double[] rulesetStat, double[] ruleStat, boolean checkErr) Calculate the potential to decrease DL of the ruleset, i.e. the possible DL that could be decreased by deleting the rule whose index and simple statstics are given. If there's no potentials (i.e. smOrEq 0 && error rate < 0.5), it returns NaN.The way this procedure does is copied from original RIPPER implementation and is quite bizzare because it does not update the following rules' stats recursively any more when testing each rule, which means it assumes after deletion no data covered by the following rules (or regards the deleted rule as the last rule). Reasonable assumption?
 Parameters:
index
 the index of the rule in m_Ruleset to be deletedexpFPOverErr
 expected FP/(FP+FN)rulesetStat
 the simple statistics of the ruleset, updated if the rule should be deletedruleStat
 the simple statistics of the rule to be deletedcheckErr
 whether check if error rate >= 0.5 Returns:
 the potential DL that could be decreased

minDataDLIfDeleted
public double minDataDLIfDeleted(int index, double expFPRate, boolean checkErr) Compute the minimal data description length of the ruleset if the rule in the given position is deleted.
The min_data_DL_if_deleted = data_DL_if_deleted  potential Parameters:
index
 the index of the rule in questionexpFPRate
 expected FP/(FP+FN), used in dataDL calculationcheckErr
 whether check if error rate >= 0.5 Returns:
 the minDataDL

minDataDLIfExists
public double minDataDLIfExists(int index, double expFPRate, boolean checkErr) Compute the minimal data description length of the ruleset if the rule in the given position is NOT deleted.
The min_data_DL_if_n_deleted = data_DL_if_n_deleted  potential Parameters:
index
 the index of the rule in questionexpFPRate
 expected FP/(FP+FN), used in dataDL calculationcheckErr
 whether check if error rate >= 0.5 Returns:
 the minDataDL

relativeDL
public double relativeDL(int index, double expFPRate, boolean checkErr) The description length (DL) of the ruleset relative to if the rule in the given position is deleted, which is obtained by:
MDL if the rule exists  MDL if the rule does not exist
Note the minimal possible DL of the ruleset is calculated(i.e. some other rules may also be deleted) instead of the DL of the current ruleset. Parameters:
index
 the given position of the rule in question (assuming correct)expFPRate
 expected FP/(FP+FN), used in dataDL calculationcheckErr
 whether check if error rate >= 0.5 Returns:
 the relative DL

reduceDL
public void reduceDL(double expFPRate, boolean checkErr) Try to reduce the DL of the ruleset by testing removing the rules one by one in reverse order and update all the stats Parameters:
expFPRate
 expected FP/(FP+FN), used in dataDL calculationcheckErr
 whether check if error rate >= 0.5

removeLast
public void removeLast()Remove the last rule in the ruleset as well as it's stats. It might be useful when the last rule was added for testing purpose and then the test failed 
rmCoveredBySuccessives
Static utility function to count the data covered by the rules after the given index in the given rules, and then remove them. It returns the data not covered by the successive rules. Parameters:
data
 the data to be processedrules
 the rulesetindex
 the given index Returns:
 the data after processing

stratify
Stratify the given data into the given number of bags based on the class values. It differs from theInstances.stratify(int fold)
that before stratification it sorts the instances according to the class order in the header file. It assumes no missing values in the class. Parameters:
data
 the given datafolds
 the given number of foldsrand
 the random object used to randomize the instances Returns:
 the stratified instances

combinedDL
public double combinedDL(double expFPRate, double predicted) Compute the combined DL of the ruleset in this class, i.e. theory DL and data DL. Note this procedure computes the combined DL according to the current status of the ruleset in this class Parameters:
expFPRate
 expected FP/(FP+FN), used in dataDL calculationpredicted
 the default classification if ruleset covers null Returns:
 the combined class

partition
Patition the data into 2, first of which has (numFolds1)/numFolds of the data and the second has 1/numFolds of the data Parameters:
data
 the given datanumFolds
 the given number of folds Returns:
 the patitioned instances

getRevision
Returns the revision string. Specified by:
getRevision
in interfaceRevisionHandler
 Returns:
 the revision
