java.lang.Object

weka.classifiers.AbstractClassifier

weka.classifiers.bayes.NaiveBayesMultinomialText

All Implemented Interfaces:: Serializable, Cloneable, Classifier, UpdateableBatchProcessor, UpdateableClassifier, Aggregateable<NaiveBayesMultinomialText>, BatchPredictor, CapabilitiesHandler, CapabilitiesIgnorer, CommandlineRunnable, OptionHandler, RevisionHandler, WeightedInstancesHandler

public class NaiveBayesMultinomialText extends AbstractClassifier implements UpdateableClassifier, UpdateableBatchProcessor, WeightedInstancesHandler, Aggregateable<NaiveBayesMultinomialText>

Multinomial naive bayes for text data. Operates directly (and only) on String attributes. Other types of input attributes are accepted but ignored during training and classification

Valid options are:

 -W
  Use word frequencies instead of binary bag of words.

 -P <# instances>
  How often to prune the dictionary of low frequency words (default = 0, i.e. don't prune)

 -M <double>
  Minimum word frequency. Words with less than this frequence are ignored.
  If periodic pruning is turned on then this is also used to determine which
  words to remove from the dictionary (default = 3).

 -normalize
  Normalize document length (use in conjunction with -norm and -lnorm)

 -norm <num>
  Specify the norm that each instance must have (default 1.0)

 -lnorm <num>
  Specify L-norm to use (default 2.0)

 -lowercase
  Convert all tokens to lowercase before adding to the dictionary.

 -stopwords-handler
  The stopwords handler to use (default Null).

 -tokenizer <spec>
  The tokenizing algorihtm (classname plus parameters) to use.
  (default: weka.core.tokenizers.WordTokenizer)

 -stemmer <spec>
  The stemmering algorihtm (classname plus parameters) to use.

 -output-debug-info
  If set, classifier is run in debug mode and
  may output additional info to the console

 -do-not-check-capabilities
  If set, classifier capabilities are not checked before classifier is built
  (use with caution).

Author:

Mark Hall (mhall{[at]}pentaho{[dot]}com), Andrew Golightly (acg4@cs.waikato.ac.nz), Bernhard Pfahringer (bernhard@cs.waikato.ac.nz)

See Also:

Serialized Form

Field Summary

Fields inherited from class weka.classifiers.AbstractClassifier
BATCH_SIZE_DEFAULT, NUM_DECIMAL_PLACES_DEFAULT
Constructor Summary

Constructors

Constructor

Description

NaiveBayesMultinomialText()
Method Summary

Modifier and Type

Method

Description

NaiveBayesMultinomialText

aggregate(NaiveBayesMultinomialText toAggregate)

Aggregate an object with this one

void

batchFinished()

Signal that the training data is finished (for now).

void

buildClassifier(Instances data)

Generates the classifier.

double[]

distributionForInstance(Instance instance)

Calculates the class membership probabilities for the given test instance.

void

finalizeAggregation()

Call to complete the aggregation process.

Capabilities

getCapabilities()

Returns default capabilities of the classifier.

double

getLNorm()

Get the L Norm used.

boolean

getLowercaseTokens()

Get whether to convert all tokens to lowercase

double

getMinWordFrequency()

Get the minimum word frequency.

double

getNorm()

Get the instance's Norm.

boolean

getNormalizeDocLength()

Get whether to normalize the length of each document

String[]

getOptions()

Gets the current settings of the classifier.

int

getPeriodicPruning()

Get how often to prune the dictionary

String

getRevision()

Returns the revision string.

Stemmer

getStemmer()

Returns the current stemming algorithm, null if none is used.

StopwordsHandler

getStopwordsHandler()

Gets the stopwords handler.

Tokenizer

getTokenizer()

Returns the current tokenizer algorithm.

boolean

getUseWordFrequencies()

Get whether to use word frequencies rather than binary bag of words representation.

String

globalInfo()

Returns a string describing classifier

Enumeration<Option>

listOptions()

Returns an enumeration describing the available options.

String

LNormTipText()

Returns the tip text for this property

String

lowercaseTokensTipText()

Returns the tip text for this property

static void

main(String[] args)

Main method for testing this class.

String

minWordFrequencyTipText()

Returns the tip text for this property

String

normalizeDocLengthTipText()

Returns the tip text for this property

String

normTipText()

Returns the tip text for this property

String

periodicPruningTipText()

Returns the tip text for this property

void

reset()

Reset the classifier.

void

setLNorm(double newLNorm)

Set the L-norm to used

void

setLowercaseTokens(boolean l)

Set whether to convert all tokens to lowercase

void

setMinWordFrequency(double minFreq)

Set the minimum word frequency.

void

setNorm(double newNorm)

Set the norm of the instances

void

setNormalizeDocLength(boolean norm)

Set whether to normalize the length of each document

void

setOptions(String[] options)

Parses a given list of options.

void

setPeriodicPruning(int p)

Set how often to prune the dictionary

void

setStemmer(Stemmer value)

the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).

void

setStopwordsHandler(StopwordsHandler value)

Sets the stopwords handler to use.

void

setTokenizer(Tokenizer value)

the tokenizer algorithm to use.

void

setUseWordFrequencies(boolean u)

Set whether to use word frequencies rather than binary bag of words representation.

String

stemmerTipText()

Returns the tip text for this property.

String

stopwordsHandlerTipText()

Returns the tip text for this property.

String

tokenizerTipText()

Returns the tip text for this property.

String

toString()

Returns a textual description of this classifier.

void

updateClassifier(Instance instance)

Updates the classifier with the given instance.

String

useWordFrequenciesTipText()

Returns the tip text for this property

Methods inherited from class weka.classifiers.AbstractClassifier
batchSizeTipText, classifyInstance, debugTipText, distributionsForInstances, doNotCheckCapabilitiesTipText, forName, getBatchSize, getDebug, getDoNotCheckCapabilities, getNumDecimalPlaces, implementsMoreEfficientBatchPrediction, makeCopies, makeCopy, numDecimalPlacesTipText, postExecution, preExecution, run, runClassifier, setBatchSize, setDebug, setDoNotCheckCapabilities, setNumDecimalPlaces

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait

Constructor Details
- NaiveBayesMultinomialText
  
  public NaiveBayesMultinomialText()
Method Details
- globalInfo
  
  public String globalInfo()
  
  Returns a string describing classifier
  
  Returns:
  
  a description suitable for displaying in the explorer/experimenter gui
- getCapabilities
  
  public Capabilities getCapabilities()
  
  Returns default capabilities of the classifier.
  Specified by:
  
  getCapabilities in interface CapabilitiesHandler
  
  Specified by:
  
  getCapabilities in interface Classifier
  
  Overrides:
  
  getCapabilities in class AbstractClassifier
  
  Returns:
  
  the capabilities of this classifier
  
  See Also:
  
  Capabilities
- buildClassifier
  
  public void buildClassifier(Instances data) throws Exception
  
  Generates the classifier.
  
  Specified by:
  
  buildClassifier in interface Classifier
  
  Parameters:
  
  data - set of instances serving as training data
  
  Throws:
  
  Exception - if the classifier has not been generated successfully
- updateClassifier
  
  public void updateClassifier(Instance instance) throws Exception
  
  Updates the classifier with the given instance.
  
  Specified by:
  
  updateClassifier in interface UpdateableClassifier
  
  Parameters:
  
  instance - the new training instance to include in the model
  
  Throws:
  
  Exception - if the instance could not be incorporated in the model.
- distributionForInstance
  
  public double[] distributionForInstance(Instance instance) throws Exception
  
  Calculates the class membership probabilities for the given test instance.
  
  Specified by:
  
  distributionForInstance in interface Classifier
  
  Overrides:
  
  distributionForInstance in class AbstractClassifier
  
  Parameters:
  
  instance - the instance to be classified
  
  Returns:
  
  predicted class probability distribution
  
  Throws:
  
  Exception - if there is a problem generating the prediction
- reset
  
  public void reset()
  
  Reset the classifier.
- setStemmer
  
  public void setStemmer(Stemmer value)
  
  the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).
  Parameters:
  
  value - the configured stemming algorithm, or null
  
  See Also:
  
  NullStemmer
- getStemmer
  
  public Stemmer getStemmer()
  
  Returns the current stemming algorithm, null if none is used.
  
  Returns:
  
  the current stemming algorithm, null if none set
- stemmerTipText
  
  public String stemmerTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- setTokenizer
  
  public void setTokenizer(Tokenizer value)
  
  the tokenizer algorithm to use.
  
  Parameters:
  
  value - the configured tokenizing algorithm
- getTokenizer
  
  public Tokenizer getTokenizer()
  
  Returns the current tokenizer algorithm.
  
  Returns:
  
  the current tokenizer algorithm
- tokenizerTipText
  
  public String tokenizerTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- useWordFrequenciesTipText
  
  public String useWordFrequenciesTipText()
  
  Returns the tip text for this property
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- setUseWordFrequencies
  
  public void setUseWordFrequencies(boolean u)
  
  Set whether to use word frequencies rather than binary bag of words representation.
  
  Parameters:
  
  u - true if word frequencies are to be used.
- getUseWordFrequencies
  
  public boolean getUseWordFrequencies()
  
  Get whether to use word frequencies rather than binary bag of words representation.
  
  Returns:
  
  true if word frequencies are to be used.
- lowercaseTokensTipText
  
  public String lowercaseTokensTipText()
  
  Returns the tip text for this property
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- setLowercaseTokens
  
  public void setLowercaseTokens(boolean l)
  
  Set whether to convert all tokens to lowercase
  
  Parameters:
  
  l - true if all tokens are to be converted to lowercase
- getLowercaseTokens
  
  public boolean getLowercaseTokens()
  
  Get whether to convert all tokens to lowercase
  
  Returns:
  
  true true if all tokens are to be converted to lowercase
- periodicPruningTipText
  
  public String periodicPruningTipText()
  
  Returns the tip text for this property
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- setPeriodicPruning
  
  public void setPeriodicPruning(int p)
  
  Set how often to prune the dictionary
  
  Parameters:
  
  p - how often to prune
- getPeriodicPruning
  
  public int getPeriodicPruning()
  
  Get how often to prune the dictionary
  
  Returns:
  
  how often to prune the dictionary
- minWordFrequencyTipText
  
  public String minWordFrequencyTipText()
  
  Returns the tip text for this property
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- setMinWordFrequency
  
  public void setMinWordFrequency(double minFreq)
  
  Set the minimum word frequency. Words that don't occur at least min freq times are ignored when updating weights. If periodic pruning is turned on, then min frequency is used when removing words from the dictionary.
  
  Parameters:
  
  minFreq - the minimum word frequency to use
- getMinWordFrequency
  
  public double getMinWordFrequency()
  
  Get the minimum word frequency. Words that don't occur at least min freq times are ignored when updating weights. If periodic pruning is turned on, then min frequency is used when removing words from the dictionary.
  
  Returns:
  
  the minimum word frequency to use
- normalizeDocLengthTipText
  
  public String normalizeDocLengthTipText()
  
  Returns the tip text for this property
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- setNormalizeDocLength
  
  public void setNormalizeDocLength(boolean norm)
  
  Set whether to normalize the length of each document
  
  Parameters:
  
  norm - true if document lengths is to be normalized
- getNormalizeDocLength
  
  public boolean getNormalizeDocLength()
  
  Get whether to normalize the length of each document
  
  Returns:
  
  true if document lengths is to be normalized
- normTipText
  
  public String normTipText()
  
  Returns the tip text for this property
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getNorm
  
  public double getNorm()
  
  Get the instance's Norm.
  
  Returns:
  
  the Norm
- setNorm
  
  public void setNorm(double newNorm)
  
  Set the norm of the instances
  
  Parameters:
  
  newNorm - the norm to wich the instances must be set
- LNormTipText
  
  public String LNormTipText()
  
  Returns the tip text for this property
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getLNorm
  
  public double getLNorm()
  
  Get the L Norm used.
  
  Returns:
  
  the L-norm used
- setLNorm
  
  public void setLNorm(double newLNorm)
  
  Set the L-norm to used
  
  Parameters:
  
  newLNorm - the L-norm
- setStopwordsHandler
  
  public void setStopwordsHandler(StopwordsHandler value)
  
  Sets the stopwords handler to use.
  
  Parameters:
  
  value - the stopwords handler, if null, Null is used
- getStopwordsHandler
  
  public StopwordsHandler getStopwordsHandler()
  
  Gets the stopwords handler.
  
  Returns:
  
  the stopwords handler
- stopwordsHandlerTipText
  
  public String stopwordsHandlerTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- listOptions
  
  public Enumeration<Option> listOptions()
  
  Returns an enumeration describing the available options.
  
  Specified by:
  
  listOptions in interface OptionHandler
  
  Overrides:
  
  listOptions in class AbstractClassifier
  
  Returns:
  
  an enumeration of all the available options.
- setOptions
  
  public void setOptions(String[] options) throws Exception
  Parses a given list of options.
  Valid options are:
  
  -W Use word frequencies instead of binary bag of words.
  
  -P <# instances> How often to prune the dictionary of low frequency words (default = 0, i.e. don't prune)
  
  -M <double> Minimum word frequency. Words with less than this frequence are ignored. If periodic pruning is turned on then this is also used to determine which words to remove from the dictionary (default = 3).
  
  -normalize Normalize document length (use in conjunction with -norm and -lnorm)
  
  -norm <num> Specify the norm that each instance must have (default 1.0)
  
  -lnorm <num> Specify L-norm to use (default 2.0)
  
  -lowercase Convert all tokens to lowercase before adding to the dictionary.
  
  -stopwords-handler The stopwords handler to use (default Null).
  
  -tokenizer <spec> The tokenizing algorihtm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
  
  -stemmer <spec> The stemmering algorihtm (classname plus parameters) to use.
  
  -output-debug-info If set, classifier is run in debug mode and may output additional info to the console
  
  -do-not-check-capabilities If set, classifier capabilities are not checked before classifier is built (use with caution).
  Specified by:
  
  setOptions in interface OptionHandler
  
  Overrides:
  
  setOptions in class AbstractClassifier
  
  Parameters:
  
  options - the list of options as an array of strings
  
  Throws:
  
  Exception - if an option is not supported
- getOptions
  
  public String[] getOptions()
  
  Gets the current settings of the classifier.
  
  Specified by:
  
  getOptions in interface OptionHandler
  
  Overrides:
  
  getOptions in class AbstractClassifier
  
  Returns:
  
  an array of strings suitable for passing to setOptions
- toString
  
  public String toString()
  
  Returns a textual description of this classifier.
  
  Overrides:
  
  toString in class Object
  
  Returns:
  
  a textual description of this classifier.
- getRevision
  
  public String getRevision()
  
  Returns the revision string.
  
  Specified by:
  
  getRevision in interface RevisionHandler
  
  Overrides:
  
  getRevision in class AbstractClassifier
  
  Returns:
  
  the revision
- aggregate
  
  public NaiveBayesMultinomialText aggregate(NaiveBayesMultinomialText toAggregate) throws Exception
  
  Description copied from interface: Aggregateable
  
  Aggregate an object with this one
  
  Specified by:
  
  aggregate in interface Aggregateable<NaiveBayesMultinomialText>
  
  Parameters:
  
  toAggregate - the object to aggregate
  
  Returns:
  
  the result of aggregation
  
  Throws:
  
  Exception - if the supplied object can't be aggregated for some reason
- finalizeAggregation
  
  public void finalizeAggregation() throws Exception
  
  Description copied from interface: Aggregateable
  
  Call to complete the aggregation process. Allows implementers to do any final processing based on how many objects were aggregated.
  
  Specified by:
  
  finalizeAggregation in interface Aggregateable<NaiveBayesMultinomialText>
  
  Throws:
  
  Exception - if the aggregation can't be finalized for some reason
- batchFinished
  
  public void batchFinished() throws Exception
  
  Description copied from interface: UpdateableBatchProcessor
  
  Signal that the training data is finished (for now).
  
  Specified by:
  
  batchFinished in interface UpdateableBatchProcessor
  
  Throws:
  
  Exception - if a problem occurs
- main
  
  public static void main(String[] args)
  
  Main method for testing this class.
  
  Parameters:
  
  args - the options

Class NaiveBayesMultinomialText

Field Summary

Fields inherited from class weka.classifiers.AbstractClassifier

Constructor Summary

Method Summary

Methods inherited from class weka.classifiers.AbstractClassifier

Methods inherited from class java.lang.Object

Constructor Details

NaiveBayesMultinomialText

Method Details

globalInfo

getCapabilities

buildClassifier

updateClassifier

distributionForInstance

reset

setStemmer

getStemmer

stemmerTipText

setTokenizer

getTokenizer

tokenizerTipText

useWordFrequenciesTipText

setUseWordFrequencies

getUseWordFrequencies

lowercaseTokensTipText

setLowercaseTokens

getLowercaseTokens

periodicPruningTipText

setPeriodicPruning

getPeriodicPruning

minWordFrequencyTipText

setMinWordFrequency

getMinWordFrequency

normalizeDocLengthTipText

setNormalizeDocLength

getNormalizeDocLength

normTipText

getNorm

setNorm

LNormTipText

getLNorm

setLNorm

setStopwordsHandler

getStopwordsHandler

stopwordsHandlerTipText

listOptions

setOptions

getOptions

toString

getRevision

aggregate

finalizeAggregation

batchFinished

main