Class NaiveBayesMultinomialText

java.lang.Object
weka.classifiers.AbstractClassifier
weka.classifiers.bayes.NaiveBayesMultinomialText
All Implemented Interfaces:
Serializable, Cloneable, Classifier, UpdateableBatchProcessor, UpdateableClassifier, Aggregateable<NaiveBayesMultinomialText>, BatchPredictor, CapabilitiesHandler, CapabilitiesIgnorer, CommandlineRunnable, OptionHandler, RevisionHandler, WeightedInstancesHandler

Multinomial naive bayes for text data. Operates directly (and only) on String attributes. Other types of input attributes are accepted but ignored during training and classification

Valid options are:

 -W
  Use word frequencies instead of binary bag of words.
 -P <# instances>
  How often to prune the dictionary of low frequency words (default = 0, i.e. don't prune)
 -M <double>
  Minimum word frequency. Words with less than this frequence are ignored.
  If periodic pruning is turned on then this is also used to determine which
  words to remove from the dictionary (default = 3).
 -normalize
  Normalize document length (use in conjunction with -norm and -lnorm)
 -norm <num>
  Specify the norm that each instance must have (default 1.0)
 -lnorm <num>
  Specify L-norm to use (default 2.0)
 -lowercase
  Convert all tokens to lowercase before adding to the dictionary.
 -stopwords-handler
  The stopwords handler to use (default Null).
 -tokenizer <spec>
  The tokenizing algorihtm (classname plus parameters) to use.
  (default: weka.core.tokenizers.WordTokenizer)
 -stemmer <spec>
  The stemmering algorihtm (classname plus parameters) to use.
 -output-debug-info
  If set, classifier is run in debug mode and
  may output additional info to the console
 -do-not-check-capabilities
  If set, classifier capabilities are not checked before classifier is built
  (use with caution).
Author:
Mark Hall (mhall{[at]}pentaho{[dot]}com), Andrew Golightly (acg4@cs.waikato.ac.nz), Bernhard Pfahringer (bernhard@cs.waikato.ac.nz)
See Also:
  • Constructor Details

    • NaiveBayesMultinomialText

      public NaiveBayesMultinomialText()
  • Method Details

    • globalInfo

      public String globalInfo()
      Returns a string describing classifier
      Returns:
      a description suitable for displaying in the explorer/experimenter gui
    • getCapabilities

      public Capabilities getCapabilities()
      Returns default capabilities of the classifier.
      Specified by:
      getCapabilities in interface CapabilitiesHandler
      Specified by:
      getCapabilities in interface Classifier
      Overrides:
      getCapabilities in class AbstractClassifier
      Returns:
      the capabilities of this classifier
      See Also:
    • buildClassifier

      public void buildClassifier(Instances data) throws Exception
      Generates the classifier.
      Specified by:
      buildClassifier in interface Classifier
      Parameters:
      data - set of instances serving as training data
      Throws:
      Exception - if the classifier has not been generated successfully
    • updateClassifier

      public void updateClassifier(Instance instance) throws Exception
      Updates the classifier with the given instance.
      Specified by:
      updateClassifier in interface UpdateableClassifier
      Parameters:
      instance - the new training instance to include in the model
      Throws:
      Exception - if the instance could not be incorporated in the model.
    • distributionForInstance

      public double[] distributionForInstance(Instance instance) throws Exception
      Calculates the class membership probabilities for the given test instance.
      Specified by:
      distributionForInstance in interface Classifier
      Overrides:
      distributionForInstance in class AbstractClassifier
      Parameters:
      instance - the instance to be classified
      Returns:
      predicted class probability distribution
      Throws:
      Exception - if there is a problem generating the prediction
    • reset

      public void reset()
      Reset the classifier.
    • setStemmer

      public void setStemmer(Stemmer value)
      the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).
      Parameters:
      value - the configured stemming algorithm, or null
      See Also:
    • getStemmer

      public Stemmer getStemmer()
      Returns the current stemming algorithm, null if none is used.
      Returns:
      the current stemming algorithm, null if none set
    • stemmerTipText

      public String stemmerTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setTokenizer

      public void setTokenizer(Tokenizer value)
      the tokenizer algorithm to use.
      Parameters:
      value - the configured tokenizing algorithm
    • getTokenizer

      public Tokenizer getTokenizer()
      Returns the current tokenizer algorithm.
      Returns:
      the current tokenizer algorithm
    • tokenizerTipText

      public String tokenizerTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • useWordFrequenciesTipText

      public String useWordFrequenciesTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setUseWordFrequencies

      public void setUseWordFrequencies(boolean u)
      Set whether to use word frequencies rather than binary bag of words representation.
      Parameters:
      u - true if word frequencies are to be used.
    • getUseWordFrequencies

      public boolean getUseWordFrequencies()
      Get whether to use word frequencies rather than binary bag of words representation.
      Returns:
      true if word frequencies are to be used.
    • lowercaseTokensTipText

      public String lowercaseTokensTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setLowercaseTokens

      public void setLowercaseTokens(boolean l)
      Set whether to convert all tokens to lowercase
      Parameters:
      l - true if all tokens are to be converted to lowercase
    • getLowercaseTokens

      public boolean getLowercaseTokens()
      Get whether to convert all tokens to lowercase
      Returns:
      true true if all tokens are to be converted to lowercase
    • periodicPruningTipText

      public String periodicPruningTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setPeriodicPruning

      public void setPeriodicPruning(int p)
      Set how often to prune the dictionary
      Parameters:
      p - how often to prune
    • getPeriodicPruning

      public int getPeriodicPruning()
      Get how often to prune the dictionary
      Returns:
      how often to prune the dictionary
    • minWordFrequencyTipText

      public String minWordFrequencyTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setMinWordFrequency

      public void setMinWordFrequency(double minFreq)
      Set the minimum word frequency. Words that don't occur at least min freq times are ignored when updating weights. If periodic pruning is turned on, then min frequency is used when removing words from the dictionary.
      Parameters:
      minFreq - the minimum word frequency to use
    • getMinWordFrequency

      public double getMinWordFrequency()
      Get the minimum word frequency. Words that don't occur at least min freq times are ignored when updating weights. If periodic pruning is turned on, then min frequency is used when removing words from the dictionary.
      Returns:
      the minimum word frequency to use
    • normalizeDocLengthTipText

      public String normalizeDocLengthTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setNormalizeDocLength

      public void setNormalizeDocLength(boolean norm)
      Set whether to normalize the length of each document
      Parameters:
      norm - true if document lengths is to be normalized
    • getNormalizeDocLength

      public boolean getNormalizeDocLength()
      Get whether to normalize the length of each document
      Returns:
      true if document lengths is to be normalized
    • normTipText

      public String normTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getNorm

      public double getNorm()
      Get the instance's Norm.
      Returns:
      the Norm
    • setNorm

      public void setNorm(double newNorm)
      Set the norm of the instances
      Parameters:
      newNorm - the norm to wich the instances must be set
    • LNormTipText

      public String LNormTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getLNorm

      public double getLNorm()
      Get the L Norm used.
      Returns:
      the L-norm used
    • setLNorm

      public void setLNorm(double newLNorm)
      Set the L-norm to used
      Parameters:
      newLNorm - the L-norm
    • setStopwordsHandler

      public void setStopwordsHandler(StopwordsHandler value)
      Sets the stopwords handler to use.
      Parameters:
      value - the stopwords handler, if null, Null is used
    • getStopwordsHandler

      public StopwordsHandler getStopwordsHandler()
      Gets the stopwords handler.
      Returns:
      the stopwords handler
    • stopwordsHandlerTipText

      public String stopwordsHandlerTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • listOptions

      public Enumeration<Option> listOptions()
      Returns an enumeration describing the available options.
      Specified by:
      listOptions in interface OptionHandler
      Overrides:
      listOptions in class AbstractClassifier
      Returns:
      an enumeration of all the available options.
    • setOptions

      public void setOptions(String[] options) throws Exception
      Parses a given list of options.

      Valid options are:

       -W
        Use word frequencies instead of binary bag of words.
       -P <# instances>
        How often to prune the dictionary of low frequency words (default = 0, i.e. don't prune)
       -M <double>
        Minimum word frequency. Words with less than this frequence are ignored.
        If periodic pruning is turned on then this is also used to determine which
        words to remove from the dictionary (default = 3).
       -normalize
        Normalize document length (use in conjunction with -norm and -lnorm)
       -norm <num>
        Specify the norm that each instance must have (default 1.0)
       -lnorm <num>
        Specify L-norm to use (default 2.0)
       -lowercase
        Convert all tokens to lowercase before adding to the dictionary.
       -stopwords-handler
        The stopwords handler to use (default Null).
       -tokenizer <spec>
        The tokenizing algorihtm (classname plus parameters) to use.
        (default: weka.core.tokenizers.WordTokenizer)
       -stemmer <spec>
        The stemmering algorihtm (classname plus parameters) to use.
       -output-debug-info
        If set, classifier is run in debug mode and
        may output additional info to the console
       -do-not-check-capabilities
        If set, classifier capabilities are not checked before classifier is built
        (use with caution).
      Specified by:
      setOptions in interface OptionHandler
      Overrides:
      setOptions in class AbstractClassifier
      Parameters:
      options - the list of options as an array of strings
      Throws:
      Exception - if an option is not supported
    • getOptions

      public String[] getOptions()
      Gets the current settings of the classifier.
      Specified by:
      getOptions in interface OptionHandler
      Overrides:
      getOptions in class AbstractClassifier
      Returns:
      an array of strings suitable for passing to setOptions
    • toString

      public String toString()
      Returns a textual description of this classifier.
      Overrides:
      toString in class Object
      Returns:
      a textual description of this classifier.
    • getRevision

      public String getRevision()
      Returns the revision string.
      Specified by:
      getRevision in interface RevisionHandler
      Overrides:
      getRevision in class AbstractClassifier
      Returns:
      the revision
    • aggregate

      public NaiveBayesMultinomialText aggregate(NaiveBayesMultinomialText toAggregate) throws Exception
      Description copied from interface: Aggregateable
      Aggregate an object with this one
      Specified by:
      aggregate in interface Aggregateable<NaiveBayesMultinomialText>
      Parameters:
      toAggregate - the object to aggregate
      Returns:
      the result of aggregation
      Throws:
      Exception - if the supplied object can't be aggregated for some reason
    • finalizeAggregation

      public void finalizeAggregation() throws Exception
      Description copied from interface: Aggregateable
      Call to complete the aggregation process. Allows implementers to do any final processing based on how many objects were aggregated.
      Specified by:
      finalizeAggregation in interface Aggregateable<NaiveBayesMultinomialText>
      Throws:
      Exception - if the aggregation can't be finalized for some reason
    • batchFinished

      public void batchFinished() throws Exception
      Description copied from interface: UpdateableBatchProcessor
      Signal that the training data is finished (for now).
      Specified by:
      batchFinished in interface UpdateableBatchProcessor
      Throws:
      Exception - if a problem occurs
    • main

      public static void main(String[] args)
      Main method for testing this class.
      Parameters:
      args - the options