Class SGDText

All Implemented Interfaces:
Serializable, Cloneable, Classifier, UpdateableBatchProcessor, UpdateableClassifier, Aggregateable<SGDText>, BatchPredictor, CapabilitiesHandler, CapabilitiesIgnorer, CommandlineRunnable, OptionHandler, Randomizable, RevisionHandler, WeightedInstancesHandler

Implements stochastic gradient descent for learning a linear binary class SVM or binary class logistic regression on text data. Operates directly (and only) on String attributes. Other types of input attributes are accepted but ignored during training and classification.

Valid options are:

 -F
  Set the loss function to minimize. 0 = hinge loss (SVM), 1 = log loss (logistic regression)
  (default = 0)
 -outputProbs
  Output probabilities for SVMs (fits a logsitic
  model to the output of the SVM)
 -L
  The learning rate (default = 0.01).
 -R <double>
  The lambda regularization constant (default = 0.0001)
 -E <integer>
  The number of epochs to perform (batch learning only, default = 500)
 -W
  Use word frequencies instead of binary bag of words.
 -P <# instances>
  How often to prune the dictionary of low frequency words (default = 0, i.e. don't prune)
 -M <double>
  Minimum word frequency. Words with less than this frequence are ignored.
  If periodic pruning is turned on then this is also used to determine which
  words to remove from the dictionary (default = 3).
 -min-coeff <double>
  Minimum absolute value of coefficients in the model.
  If periodic pruning is turned on then this
  is also used to prune words from the dictionary
  (default = 0.001
 -normalize
  Normalize document length (use in conjunction with -norm and -lnorm)
 -norm <num>
  Specify the norm that each instance must have (default 1.0)
 -lnorm <num>
  Specify L-norm to use (default 2.0)
 -lowercase
  Convert all tokens to lowercase before adding to the dictionary.
 -stopwords-handler
  The stopwords handler to use (default Null).
 -tokenizer <spec>
  The tokenizing algorihtm (classname plus parameters) to use.
  (default: weka.core.tokenizers.WordTokenizer)
 -stemmer <spec>
  The stemmering algorihtm (classname plus parameters) to use.
 -S <num>
  Random number seed.
  (default 1)
 -output-debug-info
  If set, classifier is run in debug mode and
  may output additional info to the console
 -do-not-check-capabilities
  If set, classifier capabilities are not checked before classifier is built
  (use with caution).
Author:
Mark Hall (mhall{[at]}pentaho{[dot]}com), Eibe Frank (eibe{[at]}cs{[dot]}waikato{[dot]}ac{[dot]}nz)
See Also:
  • Field Details

    • HINGE

      public static final int HINGE
      the hinge loss function.
      See Also:
    • LOGLOSS

      public static final int LOGLOSS
      the log loss function.
      See Also:
    • TAGS_SELECTION

      public static final Tag[] TAGS_SELECTION
      Loss functions to choose from
  • Constructor Details

    • SGDText

      public SGDText()
  • Method Details

    • getCapabilities

      public Capabilities getCapabilities()
      Returns default capabilities of the classifier.
      Specified by:
      getCapabilities in interface CapabilitiesHandler
      Specified by:
      getCapabilities in interface Classifier
      Overrides:
      getCapabilities in class AbstractClassifier
      Returns:
      the capabilities of this classifier
      See Also:
    • setStemmer

      public void setStemmer(Stemmer value)
      the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).
      Parameters:
      value - the configured stemming algorithm, or null
      See Also:
    • getStemmer

      public Stemmer getStemmer()
      Returns the current stemming algorithm, null if none is used.
      Returns:
      the current stemming algorithm, null if none set
    • stemmerTipText

      public String stemmerTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setTokenizer

      public void setTokenizer(Tokenizer value)
      the tokenizer algorithm to use.
      Parameters:
      value - the configured tokenizing algorithm
    • getTokenizer

      public Tokenizer getTokenizer()
      Returns the current tokenizer algorithm.
      Returns:
      the current tokenizer algorithm
    • tokenizerTipText

      public String tokenizerTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • useWordFrequenciesTipText

      public String useWordFrequenciesTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setUseWordFrequencies

      public void setUseWordFrequencies(boolean u)
      Set whether to use word frequencies rather than binary bag of words representation.
      Parameters:
      u - true if word frequencies are to be used.
    • getUseWordFrequencies

      public boolean getUseWordFrequencies()
      Get whether to use word frequencies rather than binary bag of words representation.
      Returns:
      true if word frequencies are to be used.
    • lowercaseTokensTipText

      public String lowercaseTokensTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setLowercaseTokens

      public void setLowercaseTokens(boolean l)
      Set whether to convert all tokens to lowercase
      Parameters:
      l - true if all tokens are to be converted to lowercase
    • getLowercaseTokens

      public boolean getLowercaseTokens()
      Get whether to convert all tokens to lowercase
      Returns:
      true true if all tokens are to be converted to lowercase
    • setStopwordsHandler

      public void setStopwordsHandler(StopwordsHandler value)
      Sets the stopwords handler to use.
      Parameters:
      value - the stopwords handler, if null, Null is used
    • getStopwordsHandler

      public StopwordsHandler getStopwordsHandler()
      Gets the stopwords handler.
      Returns:
      the stopwords handler
    • stopwordsHandlerTipText

      public String stopwordsHandlerTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • periodicPruningTipText

      public String periodicPruningTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setPeriodicPruning

      public void setPeriodicPruning(int p)
      Set how often to prune the dictionary
      Parameters:
      p - how often to prune
    • getPeriodicPruning

      public int getPeriodicPruning()
      Get how often to prune the dictionary
      Returns:
      how often to prune the dictionary
    • minWordFrequencyTipText

      public String minWordFrequencyTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setMinWordFrequency

      public void setMinWordFrequency(double minFreq)
      Set the minimum word frequency. Words that don't occur at least min freq times are ignored when updating weights. If periodic pruning is turned on, then min frequency is used when removing words from the dictionary.
      Parameters:
      minFreq - the minimum word frequency to use
    • getMinWordFrequency

      public double getMinWordFrequency()
      Get the minimum word frequency. Words that don't occur at least min freq times are ignored when updating weights. If periodic pruning is turned on, then min frequency is used when removing words from the dictionary.
      Returns:
      the minimum word frequency to use
    • minAbsoluteCoefficientValueTipText

      public String minAbsoluteCoefficientValueTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setMinAbsoluteCoefficientValue

      public void setMinAbsoluteCoefficientValue(double minCoeff)
      Set the minimum absolute magnitude for model coefficients. Terms with weights smaller than this value are ignored. If periodic pruning is turned on then this is also used to determine if a word should be removed from the dictionary
      Parameters:
      minCoeff - the minimum absolute value of a model coefficient
    • getMinAbsoluteCoefficientValue

      public double getMinAbsoluteCoefficientValue()
      Get the minimum absolute magnitude for model coefficients. Terms with weights smaller than this value are ignored. If periodic pruning is turned on this then is also used to determine if a word should be removed from the dictionary
      Returns:
      the minimum absolute value of a model coefficient
    • normalizeDocLengthTipText

      public String normalizeDocLengthTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setNormalizeDocLength

      public void setNormalizeDocLength(boolean norm)
      Set whether to normalize the length of each document
      Parameters:
      norm - true if document lengths is to be normalized
    • getNormalizeDocLength

      public boolean getNormalizeDocLength()
      Get whether to normalize the length of each document
      Returns:
      true if document lengths is to be normalized
    • normTipText

      public String normTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getNorm

      public double getNorm()
      Get the instance's Norm.
      Returns:
      the Norm
    • setNorm

      public void setNorm(double newNorm)
      Set the norm of the instances
      Parameters:
      newNorm - the norm to wich the instances must be set
    • LNormTipText

      public String LNormTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getLNorm

      public double getLNorm()
      Get the L Norm used.
      Returns:
      the L-norm used
    • setLNorm

      public void setLNorm(double newLNorm)
      Set the L-norm to used
      Parameters:
      newLNorm - the L-norm
    • lambdaTipText

      public String lambdaTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setLambda

      public void setLambda(double lambda)
      Set the value of lambda to use
      Parameters:
      lambda - the value of lambda to use
    • getLambda

      public double getLambda()
      Get the current value of lambda
      Returns:
      the current value of lambda
    • setLearningRate

      public void setLearningRate(double lr)
      Set the learning rate.
      Parameters:
      lr - the learning rate to use.
    • getLearningRate

      public double getLearningRate()
      Get the learning rate.
      Returns:
      the learning rate
    • learningRateTipText

      public String learningRateTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • epochsTipText

      public String epochsTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setEpochs

      public void setEpochs(int e)
      Set the number of epochs to use
      Parameters:
      e - the number of epochs to use
    • getEpochs

      public int getEpochs()
      Get current number of epochs
      Returns:
      the current number of epochs
    • setLossFunction

      public void setLossFunction(SelectedTag function)
      Set the loss function to use.
      Parameters:
      function - the loss function to use.
    • getLossFunction

      public SelectedTag getLossFunction()
      Get the current loss function.
      Returns:
      the current loss function.
    • lossFunctionTipText

      public String lossFunctionTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setOutputProbsForSVM

      public void setOutputProbsForSVM(boolean o)
      Set whether to fit a logistic regression (itself trained using SGD) to the outputs of the SVM (if an SVM is being learned).
      Parameters:
      o - true if a logistic regression is to be fit to the output of the SVM to produce probability estimates.
    • getOutputProbsForSVM

      public boolean getOutputProbsForSVM()
      Get whether to fit a logistic regression (itself trained using SGD) to the outputs of the SVM (if an SVM is being learned).
      Returns:
      true if a logistic regression is to be fit to the output of the SVM to produce probability estimates.
    • outputProbsForSVMTipText

      public String outputProbsForSVMTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • listOptions

      public Enumeration<Option> listOptions()
      Returns an enumeration describing the available options.
      Specified by:
      listOptions in interface OptionHandler
      Overrides:
      listOptions in class RandomizableClassifier
      Returns:
      an enumeration of all the available options.
    • setOptions

      public void setOptions(String[] options) throws Exception
      Parses a given list of options.

      Valid options are:

       -F
        Set the loss function to minimize. 0 = hinge loss (SVM), 1 = log loss (logistic regression)
        (default = 0)
       -outputProbs
        Output probabilities for SVMs (fits a logsitic
        model to the output of the SVM)
       -L
        The learning rate (default = 0.01).
       -R <double>
        The lambda regularization constant (default = 0.0001)
       -E <integer>
        The number of epochs to perform (batch learning only, default = 500)
       -W
        Use word frequencies instead of binary bag of words.
       -P <# instances>
        How often to prune the dictionary of low frequency words (default = 0, i.e. don't prune)
       -M <double>
        Minimum word frequency. Words with less than this frequence are ignored.
        If periodic pruning is turned on then this is also used to determine which
        words to remove from the dictionary (default = 3).
       -min-coeff <double>
        Minimum absolute value of coefficients in the model.
        If periodic pruning is turned on then this
        is also used to prune words from the dictionary
        (default = 0.001
       -normalize
        Normalize document length (use in conjunction with -norm and -lnorm)
       -norm <num>
        Specify the norm that each instance must have (default 1.0)
       -lnorm <num>
        Specify L-norm to use (default 2.0)
       -lowercase
        Convert all tokens to lowercase before adding to the dictionary.
       -stopwords-handler
        The stopwords handler to use (default Null).
       -tokenizer <spec>
        The tokenizing algorihtm (classname plus parameters) to use.
        (default: weka.core.tokenizers.WordTokenizer)
       -stemmer <spec>
        The stemmering algorihtm (classname plus parameters) to use.
       -S <num>
        Random number seed.
        (default 1)
       -output-debug-info
        If set, classifier is run in debug mode and
        may output additional info to the console
       -do-not-check-capabilities
        If set, classifier capabilities are not checked before classifier is built
        (use with caution).
      Specified by:
      setOptions in interface OptionHandler
      Overrides:
      setOptions in class RandomizableClassifier
      Parameters:
      options - the list of options as an array of strings
      Throws:
      Exception - if an option is not supported
    • getOptions

      public String[] getOptions()
      Gets the current settings of the classifier.
      Specified by:
      getOptions in interface OptionHandler
      Overrides:
      getOptions in class RandomizableClassifier
      Returns:
      an array of strings suitable for passing to setOptions
    • globalInfo

      public String globalInfo()
      Returns a string describing classifier
      Returns:
      a description suitable for displaying in the explorer/experimenter gui
    • reset

      public void reset()
      Reset the classifier.
    • buildClassifier

      public void buildClassifier(Instances data) throws Exception
      Method for building the classifier.
      Specified by:
      buildClassifier in interface Classifier
      Parameters:
      data - the set of training instances.
      Throws:
      Exception - if the classifier can't be built successfully.
    • updateClassifier

      public void updateClassifier(Instance instance) throws Exception
      Updates the classifier with the given instance.
      Specified by:
      updateClassifier in interface UpdateableClassifier
      Parameters:
      instance - the new training instance to include in the model
      Throws:
      Exception - if the instance could not be incorporated in the model.
    • distributionForInstance

      public double[] distributionForInstance(Instance inst) throws Exception
      Description copied from class: AbstractClassifier
      Predicts the class memberships for a given instance. If an instance is unclassified, the returned array elements must be all zero. If the class is numeric, the array must consist of only one element, which contains the predicted value. Note that a classifier MUST implement either this or classifyInstance().
      Specified by:
      distributionForInstance in interface Classifier
      Overrides:
      distributionForInstance in class AbstractClassifier
      Parameters:
      inst - the instance to be classified
      Returns:
      an array containing the estimated membership probabilities of the test instance in each class or the numeric prediction
      Throws:
      Exception - if distribution could not be computed successfully
    • toString

      public String toString()
      Overrides:
      toString in class Object
    • getDictionary

      public LinkedHashMap<String,SGDText.Count> getDictionary()
      Get this model's dictionary (including term weights).
      Returns:
      this model's dictionary.
    • getDictionarySize

      public int getDictionarySize()
      Return the size of the dictionary (minus any low frequency terms that are below the threshold but haven't been pruned yet).
      Returns:
      the size of the dictionary.
    • bias

      public double bias()
    • setBias

      public void setBias(double bias)
    • getRevision

      public String getRevision()
      Returns the revision string.
      Specified by:
      getRevision in interface RevisionHandler
      Overrides:
      getRevision in class AbstractClassifier
      Returns:
      the revision
    • aggregate

      public SGDText aggregate(SGDText toAggregate) throws Exception
      Aggregate an object with this one
      Specified by:
      aggregate in interface Aggregateable<SGDText>
      Parameters:
      toAggregate - the object to aggregate
      Returns:
      the result of aggregation
      Throws:
      Exception - if the supplied object can't be aggregated for some reason
    • finalizeAggregation

      public void finalizeAggregation() throws Exception
      Call to complete the aggregation process. Allows implementers to do any final processing based on how many objects were aggregated.
      Specified by:
      finalizeAggregation in interface Aggregateable<SGDText>
      Throws:
      Exception - if the aggregation can't be finalized for some reason
    • batchFinished

      public void batchFinished() throws Exception
      Description copied from interface: UpdateableBatchProcessor
      Signal that the training data is finished (for now).
      Specified by:
      batchFinished in interface UpdateableBatchProcessor
      Throws:
      Exception - if a problem occurs
    • main

      public static void main(String[] args)
      Main method for testing this class.