Class StringToWordVector
java.lang.Object
weka.filters.Filter
weka.filters.unsupervised.attribute.StringToWordVector
- All Implemented Interfaces:
Serializable
,CapabilitiesHandler
,CapabilitiesIgnorer
,CommandlineRunnable
,OptionHandler
,RevisionHandler
,WeightedInstancesHandler
,UnsupervisedFilter
public class StringToWordVector
extends Filter
implements UnsupervisedFilter, OptionHandler, WeightedInstancesHandler
Converts string attributes into a set of numeric attributes representing word occurrence
information from the text contained in the strings. The dictionary is determined from the first batch of data
filtered (typically training data). Note that this filter is not strictly unsupervised when a class attribute is set
because it creates a separate dictionary for each class and then merges them.
Valid options are:
Valid options are:
-C Output word counts rather than boolean word presence.
-R <index1,index2-index4,...> Specify list of string attributes to convert to words (as weka Range). (default: select all string attributes)
-V Invert matching sense of column indexes.
-P <attribute name prefix> Specify a prefix for the created attribute names. (default: "")
-W <number of words to keep> Specify approximate number of word fields to create. Surplus words will be discarded.. (default: 1000)
-prune-rate <rate as a percentage of dataset> Specify the rate (e.g., every 10% of the input dataset) at which to periodically prune the dictionary. -W prunes after creating a full dictionary. You may not have enough memory for this approach. (default: no periodic pruning)
-T Transform the word frequencies into log(1+fij) where fij is the frequency of word i in jth document(instance).
-I Transform each word frequency into: fij*log(num of Documents/num of documents containing word i) where fij if frequency of word i in jth document(instance)
-N Whether to 0=not normalize/1=normalize all data/2=normalize test data only to average length of training documents (default 0=don't normalize).
-L Convert all tokens to lowercase before adding to the dictionary.
-stopwords-handler The stopwords handler to use (default Null).
-stemmer <spec> The stemming algorithm (classname plus parameters) to use.
-M <int> The minimum term frequency (default = 1).
-O If this is set, the maximum number of words and the minimum term frequency is not enforced on a per-class basis but based on the documents in all the classes (even if a class attribute is set).
-tokenizer <spec> The tokenizing algorihtm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
-dictionary <path to save to> The file to save the dictionary to. (default is not to save the dictionary)
-binary-dict Save the dictionary file as a binary serialized object instead of in plain text form. Use in conjunction with -dictionary
- Version:
- $Revision: 14508 $
- Author:
- Len Trigg (len@reeltwo.com), Stuart Inglis (stuart@reeltwo.com), Gordon Paynter (gordon.paynter@ucr.edu), Asrhaf M. Kibriya (amk14@cs.waikato.ac.nz)
- See Also:
-
Field Summary
Modifier and TypeFieldDescriptionstatic final int
normalization: No normalization.static final int
normalization: Normalize all data.static final int
normalization: Normalize test data only.static final Tag[]
Specifies whether document's (instance's) word frequencies are to be normalized. -
Constructor Summary
ConstructorDescriptionDefault constructor.StringToWordVector
(int wordsToKeep) Constructor that allows specification of the target number of words in the output. -
Method Summary
Modifier and TypeMethodDescriptionReturns the tip text for this property.Returns the tip text for this property.boolean
Signify that this batch of input to the filter is finished.Tip text for this propertyReturns the tip text for this property.Gets the current range selection.Get the attribute name prefix.Returns the Capabilities of this filter.Set the dictionary file to save the dictionary to.boolean
Get the DoNotOperateOnPerClassBasis value.boolean
Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.boolean
Gets whether the supplied columns are to be processed or skipped.boolean
Gets whether if the tokens are to be downcased or not.int
Get the MinTermFreq value.Gets whether if the word frequencies for a document (instance) should be normalized or not.String[]
Gets the current settings of the filter.boolean
Gets whether output instances contain 0 or 1 indicating word presence, or word counts.double
Gets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.Returns the revision string.boolean
Set whether to save the dictionary in binary serialized form rather than as plain textGet the value of m_SelectedRange.Returns the current stemming algorithm, null if none is used.Gets the stopwords handler.boolean
Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.Returns the current tokenizer algorithm.int
Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.Returns a string describing this filter.Returns the tip text for this property.boolean
Input an instance for filtering.Returns the tip text for this property.Returns an enumeration describing the available options.Returns the tip text for this property.static void
Main method for testing this class.Returns the tip text for this property.Returns the tip text for this property.Returns the tip text for this property.Returns the tip text for this property.void
setAttributeIndices
(String rangeList) Sets which attributes are to be worked on.void
setAttributeIndicesArray
(int[] attributes) Sets which attributes are to be processed.void
setAttributeNamePrefix
(String newPrefix) Set the attribute name prefix.void
setDictionaryFileToSaveTo
(File toSaveTo) Set the dictionary file to save the dictionary to.void
setDoNotOperateOnPerClassBasis
(boolean newDoNotOperateOnPerClassBasis) Set the DoNotOperateOnPerClassBasis value.void
setIDFTransform
(boolean IDFTransform) Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.boolean
setInputFormat
(Instances instanceInfo) Sets the format of the input instances.void
setInvertSelection
(boolean invert) Sets whether selected columns should be processed or skipped.void
setLowerCaseTokens
(boolean downCaseTokens) Sets whether if the tokens are to be downcased or not.void
setMinTermFreq
(int newMinTermFreq) Set the MinTermFreq value.void
setNormalizeDocLength
(SelectedTag newType) Sets whether if the word frequencies for a document (instance) should be normalized or not.void
setOptions
(String[] options) Parses a given list of options.void
setOutputWordCounts
(boolean outputWordCounts) Sets whether output instances contain 0 or 1 indicating word presence, or word counts.void
setPeriodicPruning
(double newPeriodicPruning) Sets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.void
setSaveDictionaryInBinaryForm
(boolean saveAsBinary) Set whether to save the dictionary in binary serialized form rather than as plain textvoid
setSelectedRange
(String newSelectedRange) Set the value of m_SelectedRange.void
setStemmer
(Stemmer value) the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).void
Sets the stopwords handler to use.void
setTFTransform
(boolean TFTransform) Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.void
setTokenizer
(Tokenizer value) the tokenizer algorithm to use.void
setWordsToKeep
(int newWordsToKeep) Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.Returns the tip text for this property.Returns the tip text for this property.Returns the tip text for this property.Returns the tip text for this property.Returns the tip text for this property.Methods inherited from class weka.filters.Filter
batchFilterFile, debugTipText, doNotCheckCapabilitiesTipText, filterFile, getCapabilities, getCopyOfInputFormat, getDebug, getDoNotCheckCapabilities, getOutputFormat, isFirstBatchDone, isNewBatch, isOutputFormatDefined, makeCopies, makeCopy, mayRemoveInstanceAfterFirstBatchDone, numPendingOutput, output, outputPeek, postExecution, preExecution, run, runFilter, setDebug, setDoNotCheckCapabilities, toString, useFilter, wekaStaticWrapper
-
Field Details
-
FILTER_NONE
public static final int FILTER_NONEnormalization: No normalization.- See Also:
-
FILTER_NORMALIZE_ALL
public static final int FILTER_NORMALIZE_ALLnormalization: Normalize all data.- See Also:
-
FILTER_NORMALIZE_TEST_ONLY
public static final int FILTER_NORMALIZE_TEST_ONLYnormalization: Normalize test data only.- See Also:
-
TAGS_FILTER
Specifies whether document's (instance's) word frequencies are to be normalized. The are normalized to average length of documents specified as input format.
-
-
Constructor Details
-
StringToWordVector
public StringToWordVector()Default constructor. Targets 1000 words in the output. -
StringToWordVector
public StringToWordVector(int wordsToKeep) Constructor that allows specification of the target number of words in the output.- Parameters:
wordsToKeep
- the number of words in the output vector (per class if assigned).
-
-
Method Details
-
listOptions
Returns an enumeration describing the available options.- Specified by:
listOptions
in interfaceOptionHandler
- Overrides:
listOptions
in classFilter
- Returns:
- an enumeration of all the available options
-
setOptions
Parses a given list of options. Valid options are:-C Output word counts rather than boolean word presence.
-R <index1,index2-index4,...> Specify list of string attributes to convert to words (as weka Range). (default: select all string attributes)
-V Invert matching sense of column indexes.
-P <attribute name prefix> Specify a prefix for the created attribute names. (default: "")
-W <number of words to keep> Specify approximate number of word fields to create. Surplus words will be discarded.. (default: 1000)
-prune-rate <rate as a percentage of dataset> Specify the rate (e.g., every 10% of the input dataset) at which to periodically prune the dictionary. -W prunes after creating a full dictionary. You may not have enough memory for this approach. (default: no periodic pruning)
-T Transform the word frequencies into log(1+fij) where fij is the frequency of word i in jth document(instance).
-I Transform each word frequency into: fij*log(num of Documents/num of documents containing word i) where fij if frequency of word i in jth document(instance)
-N Whether to 0=not normalize/1=normalize all data/2=normalize test data only to average length of training documents (default 0=don't normalize).
-L Convert all tokens to lowercase before adding to the dictionary.
-stopwords-handler The stopwords handler to use (default Null).
-stemmer <spec> The stemming algorithm (classname plus parameters) to use.
-M <int> The minimum term frequency (default = 1).
-O If this is set, the maximum number of words and the minimum term frequency is not enforced on a per-class basis but based on the documents in all the classes (even if a class attribute is set).
-tokenizer <spec> The tokenizing algorihtm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
-dictionary <path to save to> The file to save the dictionary to. (default is not to save the dictionary)
-binary-dict Save the dictionary file as a binary serialized object instead of in plain text form. Use in conjunction with -dictionary
- Specified by:
setOptions
in interfaceOptionHandler
- Overrides:
setOptions
in classFilter
- Parameters:
options
- the list of options as an array of strings- Throws:
Exception
- if an option is not supported
-
getOptions
Gets the current settings of the filter.- Specified by:
getOptions
in interfaceOptionHandler
- Overrides:
getOptions
in classFilter
- Returns:
- an array of strings suitable for passing to setOptions
-
getCapabilities
Returns the Capabilities of this filter.- Specified by:
getCapabilities
in interfaceCapabilitiesHandler
- Overrides:
getCapabilities
in classFilter
- Returns:
- the capabilities of this object
- See Also:
-
setInputFormat
Sets the format of the input instances.- Overrides:
setInputFormat
in classFilter
- Parameters:
instanceInfo
- an Instances object containing the input instance structure (any instances contained in the object are ignored - only the structure is required).- Returns:
- true if the outputFormat may be collected immediately
- Throws:
Exception
- if the input format can't be set successfully
-
input
Input an instance for filtering. Filter requires all training instances be read before producing output.- Overrides:
input
in classFilter
- Parameters:
instance
- the input instance.- Returns:
- true if the filtered instance may now be collected with output().
- Throws:
IllegalStateException
- if no input structure has been defined.NullPointerException
- if the input format has not been defined.Exception
- if the input instance was not of the correct format or if there was a problem with the filtering.
-
batchFinished
Signify that this batch of input to the filter is finished. If the filter requires all instances prior to filtering, output() may now be called to retrieve the filtered instances.- Overrides:
batchFinished
in classFilter
- Returns:
- true if there are instances pending output.
- Throws:
IllegalStateException
- if no input structure has been defined.NullPointerException
- if no input structure has been defined,Exception
- if there was a problem finishing the batch.
-
dictionaryFileToSaveToTipText
Tip text for this property- Returns:
- the tip text for this property
-
setDictionaryFileToSaveTo
Set the dictionary file to save the dictionary to. A file with an empty path or a path "-- set me --" means do not save the dictionary.- Parameters:
toSaveTo
- the path to save the dictionary to
-
getDictionaryFileToSaveTo
Set the dictionary file to save the dictionary to. A file with an empty path or a path "-- set me --" means do not save the dictionary.- Returns:
- the path to save the dictionary to
-
saveDictionaryInBinaryFormTipText
-
setSaveDictionaryInBinaryForm
public void setSaveDictionaryInBinaryForm(boolean saveAsBinary) Set whether to save the dictionary in binary serialized form rather than as plain text- Parameters:
saveAsBinary
- true to save the dictionary in binary form
-
getSaveDictionaryInBinaryForm
public boolean getSaveDictionaryInBinaryForm()Set whether to save the dictionary in binary serialized form rather than as plain text- Returns:
- true to save the dictionary in binary form
-
globalInfo
Returns a string describing this filter.- Returns:
- a description of the filter suitable for displaying in the explorer/experimenter gui
-
getOutputWordCounts
public boolean getOutputWordCounts()Gets whether output instances contain 0 or 1 indicating word presence, or word counts.- Returns:
- true if word counts should be output.
-
setOutputWordCounts
public void setOutputWordCounts(boolean outputWordCounts) Sets whether output instances contain 0 or 1 indicating word presence, or word counts.- Parameters:
outputWordCounts
- true if word counts should be output.
-
outputWordCountsTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getSelectedRange
Get the value of m_SelectedRange.- Returns:
- Value of m_SelectedRange.
-
setSelectedRange
Set the value of m_SelectedRange.- Parameters:
newSelectedRange
- Value to assign to m_SelectedRange.
-
attributeIndicesTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getAttributeIndices
Gets the current range selection.- Returns:
- a string containing a comma separated list of ranges
-
setAttributeIndices
Sets which attributes are to be worked on.- Parameters:
rangeList
- a string representing the list of attributes. Since the string will typically come from a user, attributes are indexed from 1.
eg: first-3,5,6-last- Throws:
IllegalArgumentException
- if an invalid range list is supplied
-
setAttributeIndicesArray
public void setAttributeIndicesArray(int[] attributes) Sets which attributes are to be processed.- Parameters:
attributes
- an array containing indexes of attributes to process. Since the array will typically come from a program, attributes are indexed from 0.- Throws:
IllegalArgumentException
- if an invalid set of ranges is supplied
-
invertSelectionTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getInvertSelection
public boolean getInvertSelection()Gets whether the supplied columns are to be processed or skipped.- Returns:
- true if the supplied columns will be kept
-
setInvertSelection
public void setInvertSelection(boolean invert) Sets whether selected columns should be processed or skipped.- Parameters:
invert
- the new invert setting
-
getAttributeNamePrefix
Get the attribute name prefix.- Returns:
- The current attribute name prefix.
-
setAttributeNamePrefix
Set the attribute name prefix.- Parameters:
newPrefix
- String to use as the attribute name prefix.
-
attributeNamePrefixTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getWordsToKeep
public int getWordsToKeep()Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.- Returns:
- the target number of words in the output vector (per class if assigned).
-
setWordsToKeep
public void setWordsToKeep(int newWordsToKeep) Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.- Parameters:
newWordsToKeep
- the target number of words in the output vector (per class if assigned).
-
wordsToKeepTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getPeriodicPruning
public double getPeriodicPruning()Gets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.- Returns:
- the rate at which the dictionary is periodically pruned
-
setPeriodicPruning
public void setPeriodicPruning(double newPeriodicPruning) Sets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.- Parameters:
newPeriodicPruning
- the rate at which the dictionary is periodically pruned
-
periodicPruningTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getTFTransform
public boolean getTFTransform()Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.- Returns:
- true if word frequencies are to be transformed.
-
setTFTransform
public void setTFTransform(boolean TFTransform) Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.- Parameters:
TFTransform
- true if word frequencies are to be transformed.
-
TFTransformTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getIDFTransform
public boolean getIDFTransform()Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.- Returns:
- true if the word frequencies are to be transformed.
-
setIDFTransform
public void setIDFTransform(boolean IDFTransform) Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.- Parameters:
IDFTransform
- true if the word frequecies are to be transformed
-
IDFTransformTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getNormalizeDocLength
Gets whether if the word frequencies for a document (instance) should be normalized or not.- Returns:
- true if word frequencies are to be normalized.
-
setNormalizeDocLength
Sets whether if the word frequencies for a document (instance) should be normalized or not.- Parameters:
newType
- the new type.
-
normalizeDocLengthTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getLowerCaseTokens
public boolean getLowerCaseTokens()Gets whether if the tokens are to be downcased or not.- Returns:
- true if the tokens are to be downcased.
-
setLowerCaseTokens
public void setLowerCaseTokens(boolean downCaseTokens) Sets whether if the tokens are to be downcased or not. (Doesn't affect non-alphabetic characters in tokens).- Parameters:
downCaseTokens
- should be true if only lower case tokens are to be formed.
-
doNotOperateOnPerClassBasisTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getDoNotOperateOnPerClassBasis
public boolean getDoNotOperateOnPerClassBasis()Get the DoNotOperateOnPerClassBasis value.- Returns:
- the DoNotOperateOnPerClassBasis value.
-
setDoNotOperateOnPerClassBasis
public void setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis) Set the DoNotOperateOnPerClassBasis value.- Parameters:
newDoNotOperateOnPerClassBasis
- The new DoNotOperateOnPerClassBasis value.
-
minTermFreqTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getMinTermFreq
public int getMinTermFreq()Get the MinTermFreq value.- Returns:
- the MinTermFreq value.
-
setMinTermFreq
public void setMinTermFreq(int newMinTermFreq) Set the MinTermFreq value.- Parameters:
newMinTermFreq
- The new MinTermFreq value.
-
lowerCaseTokensTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setStemmer
the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).- Parameters:
value
- the configured stemming algorithm, or null- See Also:
-
getStemmer
Returns the current stemming algorithm, null if none is used.- Returns:
- the current stemming algorithm, null if none set
-
stemmerTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setStopwordsHandler
Sets the stopwords handler to use.- Parameters:
value
- the stopwords handler, if null, Null is used
-
getStopwordsHandler
Gets the stopwords handler.- Returns:
- the stopwords handler
-
stopwordsHandlerTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setTokenizer
the tokenizer algorithm to use.- Parameters:
value
- the configured tokenizing algorithm
-
getTokenizer
Returns the current tokenizer algorithm.- Returns:
- the current tokenizer algorithm
-
tokenizerTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getRevision
Returns the revision string.- Specified by:
getRevision
in interfaceRevisionHandler
- Overrides:
getRevision
in classFilter
- Returns:
- the revision
-
main
Main method for testing this class.- Parameters:
argv
- should contain arguments to the filter: use -h for help
-