Package weka.classifiers.functions
Class SGDText
java.lang.Object
weka.classifiers.AbstractClassifier
weka.classifiers.RandomizableClassifier
weka.classifiers.functions.SGDText
- All Implemented Interfaces:
Serializable
,Cloneable
,Classifier
,UpdateableBatchProcessor
,UpdateableClassifier
,Aggregateable<SGDText>
,BatchPredictor
,CapabilitiesHandler
,CapabilitiesIgnorer
,CommandlineRunnable
,OptionHandler
,Randomizable
,RevisionHandler
,WeightedInstancesHandler
public class SGDText
extends RandomizableClassifier
implements UpdateableClassifier, UpdateableBatchProcessor, WeightedInstancesHandler, Aggregateable<SGDText>
Implements stochastic gradient descent for learning a linear binary class SVM or binary class logistic regression on text data. Operates directly (and only) on String attributes. Other types of input attributes are accepted but ignored during training and classification.
Valid options are:
-F Set the loss function to minimize. 0 = hinge loss (SVM), 1 = log loss (logistic regression) (default = 0)
-outputProbs Output probabilities for SVMs (fits a logsitic model to the output of the SVM)
-L The learning rate (default = 0.01).
-R <double> The lambda regularization constant (default = 0.0001)
-E <integer> The number of epochs to perform (batch learning only, default = 500)
-W Use word frequencies instead of binary bag of words.
-P <# instances> How often to prune the dictionary of low frequency words (default = 0, i.e. don't prune)
-M <double> Minimum word frequency. Words with less than this frequence are ignored. If periodic pruning is turned on then this is also used to determine which words to remove from the dictionary (default = 3).
-min-coeff <double> Minimum absolute value of coefficients in the model. If periodic pruning is turned on then this is also used to prune words from the dictionary (default = 0.001
-normalize Normalize document length (use in conjunction with -norm and -lnorm)
-norm <num> Specify the norm that each instance must have (default 1.0)
-lnorm <num> Specify L-norm to use (default 2.0)
-lowercase Convert all tokens to lowercase before adding to the dictionary.
-stopwords-handler The stopwords handler to use (default Null).
-tokenizer <spec> The tokenizing algorihtm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
-stemmer <spec> The stemmering algorihtm (classname plus parameters) to use.
-S <num> Random number seed. (default 1)
-output-debug-info If set, classifier is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, classifier capabilities are not checked before classifier is built (use with caution).
- Author:
- Mark Hall (mhall{[at]}pentaho{[dot]}com), Eibe Frank (eibe{[at]}cs{[dot]}waikato{[dot]}ac{[dot]}nz)
- See Also:
-
Nested Class Summary
-
Field Summary
Modifier and TypeFieldDescriptionstatic final int
the hinge loss function.static final int
the log loss function.static final Tag[]
Loss functions to choose fromFields inherited from class weka.classifiers.AbstractClassifier
BATCH_SIZE_DEFAULT, NUM_DECIMAL_PLACES_DEFAULT
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionAggregate an object with this onevoid
Signal that the training data is finished (for now).double
bias()
void
buildClassifier
(Instances data) Method for building the classifier.double[]
Predicts the class memberships for a given instance.Returns the tip text for this propertyvoid
Call to complete the aggregation process.Returns default capabilities of the classifier.Get this model's dictionary (including term weights).int
Return the size of the dictionary (minus any low frequency terms that are below the threshold but haven't been pruned yet).int
Get current number of epochsdouble
Get the current value of lambdadouble
Get the learning rate.double
getLNorm()
Get the L Norm used.Get the current loss function.boolean
Get whether to convert all tokens to lowercasedouble
Get the minimum absolute magnitude for model coefficients.double
Get the minimum word frequency.double
getNorm()
Get the instance's Norm.boolean
Get whether to normalize the length of each documentString[]
Gets the current settings of the classifier.boolean
Get whether to fit a logistic regression (itself trained using SGD) to the outputs of the SVM (if an SVM is being learned).int
Get how often to prune the dictionaryReturns the revision string.Returns the current stemming algorithm, null if none is used.Gets the stopwords handler.Returns the current tokenizer algorithm.boolean
Get whether to use word frequencies rather than binary bag of words representation.Returns a string describing classifierReturns the tip text for this propertyReturns the tip text for this propertyReturns an enumeration describing the available options.Returns the tip text for this propertyReturns the tip text for this propertyReturns the tip text for this propertystatic void
Main method for testing this class.Returns the tip text for this propertyReturns the tip text for this propertyReturns the tip text for this propertyReturns the tip text for this propertyReturns the tip text for this propertyReturns the tip text for this propertyvoid
reset()
Reset the classifier.void
setBias
(double bias) void
setEpochs
(int e) Set the number of epochs to usevoid
setLambda
(double lambda) Set the value of lambda to usevoid
setLearningRate
(double lr) Set the learning rate.void
setLNorm
(double newLNorm) Set the L-norm to usedvoid
setLossFunction
(SelectedTag function) Set the loss function to use.void
setLowercaseTokens
(boolean l) Set whether to convert all tokens to lowercasevoid
setMinAbsoluteCoefficientValue
(double minCoeff) Set the minimum absolute magnitude for model coefficients.void
setMinWordFrequency
(double minFreq) Set the minimum word frequency.void
setNorm
(double newNorm) Set the norm of the instancesvoid
setNormalizeDocLength
(boolean norm) Set whether to normalize the length of each documentvoid
setOptions
(String[] options) Parses a given list of options.void
setOutputProbsForSVM
(boolean o) Set whether to fit a logistic regression (itself trained using SGD) to the outputs of the SVM (if an SVM is being learned).void
setPeriodicPruning
(int p) Set how often to prune the dictionaryvoid
setStemmer
(Stemmer value) the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).void
Sets the stopwords handler to use.void
setTokenizer
(Tokenizer value) the tokenizer algorithm to use.void
setUseWordFrequencies
(boolean u) Set whether to use word frequencies rather than binary bag of words representation.Returns the tip text for this property.Returns the tip text for this property.Returns the tip text for this property.toString()
void
updateClassifier
(Instance instance) Updates the classifier with the given instance.Returns the tip text for this propertyMethods inherited from class weka.classifiers.RandomizableClassifier
getSeed, seedTipText, setSeed
Methods inherited from class weka.classifiers.AbstractClassifier
batchSizeTipText, classifyInstance, debugTipText, distributionsForInstances, doNotCheckCapabilitiesTipText, forName, getBatchSize, getDebug, getDoNotCheckCapabilities, getNumDecimalPlaces, implementsMoreEfficientBatchPrediction, makeCopies, makeCopy, numDecimalPlacesTipText, postExecution, preExecution, run, runClassifier, setBatchSize, setDebug, setDoNotCheckCapabilities, setNumDecimalPlaces
-
Field Details
-
HINGE
public static final int HINGEthe hinge loss function.- See Also:
-
LOGLOSS
public static final int LOGLOSSthe log loss function.- See Also:
-
TAGS_SELECTION
Loss functions to choose from
-
-
Constructor Details
-
SGDText
public SGDText()
-
-
Method Details
-
getCapabilities
Returns default capabilities of the classifier.- Specified by:
getCapabilities
in interfaceCapabilitiesHandler
- Specified by:
getCapabilities
in interfaceClassifier
- Overrides:
getCapabilities
in classAbstractClassifier
- Returns:
- the capabilities of this classifier
- See Also:
-
setStemmer
the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).- Parameters:
value
- the configured stemming algorithm, or null- See Also:
-
getStemmer
Returns the current stemming algorithm, null if none is used.- Returns:
- the current stemming algorithm, null if none set
-
stemmerTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setTokenizer
the tokenizer algorithm to use.- Parameters:
value
- the configured tokenizing algorithm
-
getTokenizer
Returns the current tokenizer algorithm.- Returns:
- the current tokenizer algorithm
-
tokenizerTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
useWordFrequenciesTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setUseWordFrequencies
public void setUseWordFrequencies(boolean u) Set whether to use word frequencies rather than binary bag of words representation.- Parameters:
u
- true if word frequencies are to be used.
-
getUseWordFrequencies
public boolean getUseWordFrequencies()Get whether to use word frequencies rather than binary bag of words representation.- Returns:
- true if word frequencies are to be used.
-
lowercaseTokensTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setLowercaseTokens
public void setLowercaseTokens(boolean l) Set whether to convert all tokens to lowercase- Parameters:
l
- true if all tokens are to be converted to lowercase
-
getLowercaseTokens
public boolean getLowercaseTokens()Get whether to convert all tokens to lowercase- Returns:
- true true if all tokens are to be converted to lowercase
-
setStopwordsHandler
Sets the stopwords handler to use.- Parameters:
value
- the stopwords handler, if null, Null is used
-
getStopwordsHandler
Gets the stopwords handler.- Returns:
- the stopwords handler
-
stopwordsHandlerTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
periodicPruningTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setPeriodicPruning
public void setPeriodicPruning(int p) Set how often to prune the dictionary- Parameters:
p
- how often to prune
-
getPeriodicPruning
public int getPeriodicPruning()Get how often to prune the dictionary- Returns:
- how often to prune the dictionary
-
minWordFrequencyTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setMinWordFrequency
public void setMinWordFrequency(double minFreq) Set the minimum word frequency. Words that don't occur at least min freq times are ignored when updating weights. If periodic pruning is turned on, then min frequency is used when removing words from the dictionary.- Parameters:
minFreq
- the minimum word frequency to use
-
getMinWordFrequency
public double getMinWordFrequency()Get the minimum word frequency. Words that don't occur at least min freq times are ignored when updating weights. If periodic pruning is turned on, then min frequency is used when removing words from the dictionary.- Returns:
- the minimum word frequency to use
-
minAbsoluteCoefficientValueTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setMinAbsoluteCoefficientValue
public void setMinAbsoluteCoefficientValue(double minCoeff) Set the minimum absolute magnitude for model coefficients. Terms with weights smaller than this value are ignored. If periodic pruning is turned on then this is also used to determine if a word should be removed from the dictionary- Parameters:
minCoeff
- the minimum absolute value of a model coefficient
-
getMinAbsoluteCoefficientValue
public double getMinAbsoluteCoefficientValue()Get the minimum absolute magnitude for model coefficients. Terms with weights smaller than this value are ignored. If periodic pruning is turned on this then is also used to determine if a word should be removed from the dictionary- Returns:
- the minimum absolute value of a model coefficient
-
normalizeDocLengthTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setNormalizeDocLength
public void setNormalizeDocLength(boolean norm) Set whether to normalize the length of each document- Parameters:
norm
- true if document lengths is to be normalized
-
getNormalizeDocLength
public boolean getNormalizeDocLength()Get whether to normalize the length of each document- Returns:
- true if document lengths is to be normalized
-
normTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getNorm
public double getNorm()Get the instance's Norm.- Returns:
- the Norm
-
setNorm
public void setNorm(double newNorm) Set the norm of the instances- Parameters:
newNorm
- the norm to wich the instances must be set
-
LNormTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getLNorm
public double getLNorm()Get the L Norm used.- Returns:
- the L-norm used
-
setLNorm
public void setLNorm(double newLNorm) Set the L-norm to used- Parameters:
newLNorm
- the L-norm
-
lambdaTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setLambda
public void setLambda(double lambda) Set the value of lambda to use- Parameters:
lambda
- the value of lambda to use
-
getLambda
public double getLambda()Get the current value of lambda- Returns:
- the current value of lambda
-
setLearningRate
public void setLearningRate(double lr) Set the learning rate.- Parameters:
lr
- the learning rate to use.
-
getLearningRate
public double getLearningRate()Get the learning rate.- Returns:
- the learning rate
-
learningRateTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
epochsTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setEpochs
public void setEpochs(int e) Set the number of epochs to use- Parameters:
e
- the number of epochs to use
-
getEpochs
public int getEpochs()Get current number of epochs- Returns:
- the current number of epochs
-
setLossFunction
Set the loss function to use.- Parameters:
function
- the loss function to use.
-
getLossFunction
Get the current loss function.- Returns:
- the current loss function.
-
lossFunctionTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setOutputProbsForSVM
public void setOutputProbsForSVM(boolean o) Set whether to fit a logistic regression (itself trained using SGD) to the outputs of the SVM (if an SVM is being learned).- Parameters:
o
- true if a logistic regression is to be fit to the output of the SVM to produce probability estimates.
-
getOutputProbsForSVM
public boolean getOutputProbsForSVM()Get whether to fit a logistic regression (itself trained using SGD) to the outputs of the SVM (if an SVM is being learned).- Returns:
- true if a logistic regression is to be fit to the output of the SVM to produce probability estimates.
-
outputProbsForSVMTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
listOptions
Returns an enumeration describing the available options.- Specified by:
listOptions
in interfaceOptionHandler
- Overrides:
listOptions
in classRandomizableClassifier
- Returns:
- an enumeration of all the available options.
-
setOptions
Parses a given list of options. Valid options are:-F Set the loss function to minimize. 0 = hinge loss (SVM), 1 = log loss (logistic regression) (default = 0)
-outputProbs Output probabilities for SVMs (fits a logsitic model to the output of the SVM)
-L The learning rate (default = 0.01).
-R <double> The lambda regularization constant (default = 0.0001)
-E <integer> The number of epochs to perform (batch learning only, default = 500)
-W Use word frequencies instead of binary bag of words.
-P <# instances> How often to prune the dictionary of low frequency words (default = 0, i.e. don't prune)
-M <double> Minimum word frequency. Words with less than this frequence are ignored. If periodic pruning is turned on then this is also used to determine which words to remove from the dictionary (default = 3).
-min-coeff <double> Minimum absolute value of coefficients in the model. If periodic pruning is turned on then this is also used to prune words from the dictionary (default = 0.001
-normalize Normalize document length (use in conjunction with -norm and -lnorm)
-norm <num> Specify the norm that each instance must have (default 1.0)
-lnorm <num> Specify L-norm to use (default 2.0)
-lowercase Convert all tokens to lowercase before adding to the dictionary.
-stopwords-handler The stopwords handler to use (default Null).
-tokenizer <spec> The tokenizing algorihtm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
-stemmer <spec> The stemmering algorihtm (classname plus parameters) to use.
-S <num> Random number seed. (default 1)
-output-debug-info If set, classifier is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, classifier capabilities are not checked before classifier is built (use with caution).
- Specified by:
setOptions
in interfaceOptionHandler
- Overrides:
setOptions
in classRandomizableClassifier
- Parameters:
options
- the list of options as an array of strings- Throws:
Exception
- if an option is not supported
-
getOptions
Gets the current settings of the classifier.- Specified by:
getOptions
in interfaceOptionHandler
- Overrides:
getOptions
in classRandomizableClassifier
- Returns:
- an array of strings suitable for passing to setOptions
-
globalInfo
Returns a string describing classifier- Returns:
- a description suitable for displaying in the explorer/experimenter gui
-
reset
public void reset()Reset the classifier. -
buildClassifier
Method for building the classifier.- Specified by:
buildClassifier
in interfaceClassifier
- Parameters:
data
- the set of training instances.- Throws:
Exception
- if the classifier can't be built successfully.
-
updateClassifier
Updates the classifier with the given instance.- Specified by:
updateClassifier
in interfaceUpdateableClassifier
- Parameters:
instance
- the new training instance to include in the model- Throws:
Exception
- if the instance could not be incorporated in the model.
-
distributionForInstance
Description copied from class:AbstractClassifier
Predicts the class memberships for a given instance. If an instance is unclassified, the returned array elements must be all zero. If the class is numeric, the array must consist of only one element, which contains the predicted value. Note that a classifier MUST implement either this or classifyInstance().- Specified by:
distributionForInstance
in interfaceClassifier
- Overrides:
distributionForInstance
in classAbstractClassifier
- Parameters:
inst
- the instance to be classified- Returns:
- an array containing the estimated membership probabilities of the test instance in each class or the numeric prediction
- Throws:
Exception
- if distribution could not be computed successfully
-
toString
-
getDictionary
Get this model's dictionary (including term weights).- Returns:
- this model's dictionary.
-
getDictionarySize
public int getDictionarySize()Return the size of the dictionary (minus any low frequency terms that are below the threshold but haven't been pruned yet).- Returns:
- the size of the dictionary.
-
bias
public double bias() -
setBias
public void setBias(double bias) -
getRevision
Returns the revision string.- Specified by:
getRevision
in interfaceRevisionHandler
- Overrides:
getRevision
in classAbstractClassifier
- Returns:
- the revision
-
aggregate
Aggregate an object with this one- Specified by:
aggregate
in interfaceAggregateable<SGDText>
- Parameters:
toAggregate
- the object to aggregate- Returns:
- the result of aggregation
- Throws:
Exception
- if the supplied object can't be aggregated for some reason
-
finalizeAggregation
Call to complete the aggregation process. Allows implementers to do any final processing based on how many objects were aggregated.- Specified by:
finalizeAggregation
in interfaceAggregateable<SGDText>
- Throws:
Exception
- if the aggregation can't be finalized for some reason
-
batchFinished
Description copied from interface:UpdateableBatchProcessor
Signal that the training data is finished (for now).- Specified by:
batchFinished
in interfaceUpdateableBatchProcessor
- Throws:
Exception
- if a problem occurs
-
main
Main method for testing this class.
-