Package weka.classifiers.bayes
Class NaiveBayesMultinomialText
java.lang.Object
weka.classifiers.AbstractClassifier
weka.classifiers.bayes.NaiveBayesMultinomialText
- All Implemented Interfaces:
Serializable
,Cloneable
,Classifier
,UpdateableBatchProcessor
,UpdateableClassifier
,Aggregateable<NaiveBayesMultinomialText>
,BatchPredictor
,CapabilitiesHandler
,CapabilitiesIgnorer
,CommandlineRunnable
,OptionHandler
,RevisionHandler
,WeightedInstancesHandler
public class NaiveBayesMultinomialText
extends AbstractClassifier
implements UpdateableClassifier, UpdateableBatchProcessor, WeightedInstancesHandler, Aggregateable<NaiveBayesMultinomialText>
Multinomial naive bayes for text data. Operates directly (and only) on String attributes. Other types of input attributes are accepted but ignored during training and classification
Valid options are:
-W Use word frequencies instead of binary bag of words.
-P <# instances> How often to prune the dictionary of low frequency words (default = 0, i.e. don't prune)
-M <double> Minimum word frequency. Words with less than this frequence are ignored. If periodic pruning is turned on then this is also used to determine which words to remove from the dictionary (default = 3).
-normalize Normalize document length (use in conjunction with -norm and -lnorm)
-norm <num> Specify the norm that each instance must have (default 1.0)
-lnorm <num> Specify L-norm to use (default 2.0)
-lowercase Convert all tokens to lowercase before adding to the dictionary.
-stopwords-handler The stopwords handler to use (default Null).
-tokenizer <spec> The tokenizing algorihtm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
-stemmer <spec> The stemmering algorihtm (classname plus parameters) to use.
-output-debug-info If set, classifier is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, classifier capabilities are not checked before classifier is built (use with caution).
- Author:
- Mark Hall (mhall{[at]}pentaho{[dot]}com), Andrew Golightly (acg4@cs.waikato.ac.nz), Bernhard Pfahringer (bernhard@cs.waikato.ac.nz)
- See Also:
-
Field Summary
Fields inherited from class weka.classifiers.AbstractClassifier
BATCH_SIZE_DEFAULT, NUM_DECIMAL_PLACES_DEFAULT
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionaggregate
(NaiveBayesMultinomialText toAggregate) Aggregate an object with this onevoid
Signal that the training data is finished (for now).void
buildClassifier
(Instances data) Generates the classifier.double[]
distributionForInstance
(Instance instance) Calculates the class membership probabilities for the given test instance.void
Call to complete the aggregation process.Returns default capabilities of the classifier.double
getLNorm()
Get the L Norm used.boolean
Get whether to convert all tokens to lowercasedouble
Get the minimum word frequency.double
getNorm()
Get the instance's Norm.boolean
Get whether to normalize the length of each documentString[]
Gets the current settings of the classifier.int
Get how often to prune the dictionaryReturns the revision string.Returns the current stemming algorithm, null if none is used.Gets the stopwords handler.Returns the current tokenizer algorithm.boolean
Get whether to use word frequencies rather than binary bag of words representation.Returns a string describing classifierReturns an enumeration describing the available options.Returns the tip text for this propertyReturns the tip text for this propertystatic void
Main method for testing this class.Returns the tip text for this propertyReturns the tip text for this propertyReturns the tip text for this propertyReturns the tip text for this propertyvoid
reset()
Reset the classifier.void
setLNorm
(double newLNorm) Set the L-norm to usedvoid
setLowercaseTokens
(boolean l) Set whether to convert all tokens to lowercasevoid
setMinWordFrequency
(double minFreq) Set the minimum word frequency.void
setNorm
(double newNorm) Set the norm of the instancesvoid
setNormalizeDocLength
(boolean norm) Set whether to normalize the length of each documentvoid
setOptions
(String[] options) Parses a given list of options.void
setPeriodicPruning
(int p) Set how often to prune the dictionaryvoid
setStemmer
(Stemmer value) the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).void
Sets the stopwords handler to use.void
setTokenizer
(Tokenizer value) the tokenizer algorithm to use.void
setUseWordFrequencies
(boolean u) Set whether to use word frequencies rather than binary bag of words representation.Returns the tip text for this property.Returns the tip text for this property.Returns the tip text for this property.toString()
Returns a textual description of this classifier.void
updateClassifier
(Instance instance) Updates the classifier with the given instance.Returns the tip text for this propertyMethods inherited from class weka.classifiers.AbstractClassifier
batchSizeTipText, classifyInstance, debugTipText, distributionsForInstances, doNotCheckCapabilitiesTipText, forName, getBatchSize, getDebug, getDoNotCheckCapabilities, getNumDecimalPlaces, implementsMoreEfficientBatchPrediction, makeCopies, makeCopy, numDecimalPlacesTipText, postExecution, preExecution, run, runClassifier, setBatchSize, setDebug, setDoNotCheckCapabilities, setNumDecimalPlaces
-
Constructor Details
-
NaiveBayesMultinomialText
public NaiveBayesMultinomialText()
-
-
Method Details
-
globalInfo
Returns a string describing classifier- Returns:
- a description suitable for displaying in the explorer/experimenter gui
-
getCapabilities
Returns default capabilities of the classifier.- Specified by:
getCapabilities
in interfaceCapabilitiesHandler
- Specified by:
getCapabilities
in interfaceClassifier
- Overrides:
getCapabilities
in classAbstractClassifier
- Returns:
- the capabilities of this classifier
- See Also:
-
buildClassifier
Generates the classifier.- Specified by:
buildClassifier
in interfaceClassifier
- Parameters:
data
- set of instances serving as training data- Throws:
Exception
- if the classifier has not been generated successfully
-
updateClassifier
Updates the classifier with the given instance.- Specified by:
updateClassifier
in interfaceUpdateableClassifier
- Parameters:
instance
- the new training instance to include in the model- Throws:
Exception
- if the instance could not be incorporated in the model.
-
distributionForInstance
Calculates the class membership probabilities for the given test instance.- Specified by:
distributionForInstance
in interfaceClassifier
- Overrides:
distributionForInstance
in classAbstractClassifier
- Parameters:
instance
- the instance to be classified- Returns:
- predicted class probability distribution
- Throws:
Exception
- if there is a problem generating the prediction
-
reset
public void reset()Reset the classifier. -
setStemmer
the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).- Parameters:
value
- the configured stemming algorithm, or null- See Also:
-
getStemmer
Returns the current stemming algorithm, null if none is used.- Returns:
- the current stemming algorithm, null if none set
-
stemmerTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setTokenizer
the tokenizer algorithm to use.- Parameters:
value
- the configured tokenizing algorithm
-
getTokenizer
Returns the current tokenizer algorithm.- Returns:
- the current tokenizer algorithm
-
tokenizerTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
useWordFrequenciesTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setUseWordFrequencies
public void setUseWordFrequencies(boolean u) Set whether to use word frequencies rather than binary bag of words representation.- Parameters:
u
- true if word frequencies are to be used.
-
getUseWordFrequencies
public boolean getUseWordFrequencies()Get whether to use word frequencies rather than binary bag of words representation.- Returns:
- true if word frequencies are to be used.
-
lowercaseTokensTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setLowercaseTokens
public void setLowercaseTokens(boolean l) Set whether to convert all tokens to lowercase- Parameters:
l
- true if all tokens are to be converted to lowercase
-
getLowercaseTokens
public boolean getLowercaseTokens()Get whether to convert all tokens to lowercase- Returns:
- true true if all tokens are to be converted to lowercase
-
periodicPruningTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setPeriodicPruning
public void setPeriodicPruning(int p) Set how often to prune the dictionary- Parameters:
p
- how often to prune
-
getPeriodicPruning
public int getPeriodicPruning()Get how often to prune the dictionary- Returns:
- how often to prune the dictionary
-
minWordFrequencyTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setMinWordFrequency
public void setMinWordFrequency(double minFreq) Set the minimum word frequency. Words that don't occur at least min freq times are ignored when updating weights. If periodic pruning is turned on, then min frequency is used when removing words from the dictionary.- Parameters:
minFreq
- the minimum word frequency to use
-
getMinWordFrequency
public double getMinWordFrequency()Get the minimum word frequency. Words that don't occur at least min freq times are ignored when updating weights. If periodic pruning is turned on, then min frequency is used when removing words from the dictionary.- Returns:
- the minimum word frequency to use
-
normalizeDocLengthTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setNormalizeDocLength
public void setNormalizeDocLength(boolean norm) Set whether to normalize the length of each document- Parameters:
norm
- true if document lengths is to be normalized
-
getNormalizeDocLength
public boolean getNormalizeDocLength()Get whether to normalize the length of each document- Returns:
- true if document lengths is to be normalized
-
normTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getNorm
public double getNorm()Get the instance's Norm.- Returns:
- the Norm
-
setNorm
public void setNorm(double newNorm) Set the norm of the instances- Parameters:
newNorm
- the norm to wich the instances must be set
-
LNormTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getLNorm
public double getLNorm()Get the L Norm used.- Returns:
- the L-norm used
-
setLNorm
public void setLNorm(double newLNorm) Set the L-norm to used- Parameters:
newLNorm
- the L-norm
-
setStopwordsHandler
Sets the stopwords handler to use.- Parameters:
value
- the stopwords handler, if null, Null is used
-
getStopwordsHandler
Gets the stopwords handler.- Returns:
- the stopwords handler
-
stopwordsHandlerTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
listOptions
Returns an enumeration describing the available options.- Specified by:
listOptions
in interfaceOptionHandler
- Overrides:
listOptions
in classAbstractClassifier
- Returns:
- an enumeration of all the available options.
-
setOptions
Parses a given list of options. Valid options are:-W Use word frequencies instead of binary bag of words.
-P <# instances> How often to prune the dictionary of low frequency words (default = 0, i.e. don't prune)
-M <double> Minimum word frequency. Words with less than this frequence are ignored. If periodic pruning is turned on then this is also used to determine which words to remove from the dictionary (default = 3).
-normalize Normalize document length (use in conjunction with -norm and -lnorm)
-norm <num> Specify the norm that each instance must have (default 1.0)
-lnorm <num> Specify L-norm to use (default 2.0)
-lowercase Convert all tokens to lowercase before adding to the dictionary.
-stopwords-handler The stopwords handler to use (default Null).
-tokenizer <spec> The tokenizing algorihtm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
-stemmer <spec> The stemmering algorihtm (classname plus parameters) to use.
-output-debug-info If set, classifier is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, classifier capabilities are not checked before classifier is built (use with caution).
- Specified by:
setOptions
in interfaceOptionHandler
- Overrides:
setOptions
in classAbstractClassifier
- Parameters:
options
- the list of options as an array of strings- Throws:
Exception
- if an option is not supported
-
getOptions
Gets the current settings of the classifier.- Specified by:
getOptions
in interfaceOptionHandler
- Overrides:
getOptions
in classAbstractClassifier
- Returns:
- an array of strings suitable for passing to setOptions
-
toString
Returns a textual description of this classifier. -
getRevision
Returns the revision string.- Specified by:
getRevision
in interfaceRevisionHandler
- Overrides:
getRevision
in classAbstractClassifier
- Returns:
- the revision
-
aggregate
Description copied from interface:Aggregateable
Aggregate an object with this one- Specified by:
aggregate
in interfaceAggregateable<NaiveBayesMultinomialText>
- Parameters:
toAggregate
- the object to aggregate- Returns:
- the result of aggregation
- Throws:
Exception
- if the supplied object can't be aggregated for some reason
-
finalizeAggregation
Description copied from interface:Aggregateable
Call to complete the aggregation process. Allows implementers to do any final processing based on how many objects were aggregated.- Specified by:
finalizeAggregation
in interfaceAggregateable<NaiveBayesMultinomialText>
- Throws:
Exception
- if the aggregation can't be finalized for some reason
-
batchFinished
Description copied from interface:UpdateableBatchProcessor
Signal that the training data is finished (for now).- Specified by:
batchFinished
in interfaceUpdateableBatchProcessor
- Throws:
Exception
- if a problem occurs
-
main
Main method for testing this class.- Parameters:
args
- the options
-