weka.filters.unsupervised.attribute.StringToWordVector

All Implemented Interfaces:: Serializable, CapabilitiesHandler, CapabilitiesIgnorer, CommandlineRunnable, OptionHandler, RevisionHandler, WeightedInstancesHandler, UnsupervisedFilter

public class StringToWordVector extends Filter implements UnsupervisedFilter, OptionHandler, WeightedInstancesHandler

Converts string attributes into a set of numeric attributes representing word occurrence information from the text contained in the strings. The dictionary is determined from the first batch of data filtered (typically training data). Note that this filter is not strictly unsupervised when a class attribute is set because it creates a separate dictionary for each class and then merges them.

Valid options are:

 -C
  Output word counts rather than boolean word presence.

 -R <index1,index2-index4,...>
  Specify list of string attributes to convert to words (as weka Range).
  (default: select all string attributes)

 -V
  Invert matching sense of column indexes.

 -P <attribute name prefix>
  Specify a prefix for the created attribute names.
  (default: "")

 -W <number of words to keep>
  Specify approximate number of word fields to create.
  Surplus words will be discarded..
  (default: 1000)

 -prune-rate <rate as a percentage of dataset>
  Specify the rate (e.g., every 10% of the input dataset) at which to periodically prune the dictionary.
  -W prunes after creating a full dictionary. You may not have enough memory for this approach.
  (default: no periodic pruning)

 -T
  Transform the word frequencies into log(1+fij)
  where fij is the frequency of word i in jth document(instance).

 -I
  Transform each word frequency into:
  fij*log(num of Documents/num of documents containing word i)
    where fij if frequency of word i in jth document(instance)

 -N
  Whether to 0=not normalize/1=normalize all data/2=normalize test data only
  to average length of training documents (default 0=don't normalize).

 -L
  Convert all tokens to lowercase before adding to the dictionary.

 -stopwords-handler
  The stopwords handler to use (default Null).

 -stemmer <spec>
  The stemming algorithm (classname plus parameters) to use.

 -M <int>
  The minimum term frequency (default = 1).

 -O
  If this is set, the maximum number of words and the 
  minimum term frequency is not enforced on a per-class 
  basis but based on the documents in all the classes 
  (even if a class attribute is set).

 -tokenizer <spec>
  The tokenizing algorihtm (classname plus parameters) to use.
  (default: weka.core.tokenizers.WordTokenizer)

 -dictionary <path to save to>
  The file to save the dictionary to.
  (default is not to save the dictionary)

 -binary-dict
  Save the dictionary file as a binary serialized object
  instead of in plain text form. Use in conjunction with
  -dictionary

Version:

$Revision: 14508 $

Author:

Len Trigg (len@reeltwo.com), Stuart Inglis (stuart@reeltwo.com), Gordon Paynter (gordon.paynter@ucr.edu), Asrhaf M. Kibriya (amk14@cs.waikato.ac.nz)

See Also:

Serialized Form

Field Summary

Fields

Modifier and Type

Field

Description

static final int

FILTER_NONE

normalization: No normalization.

static final int

FILTER_NORMALIZE_ALL

normalization: Normalize all data.

static final int

FILTER_NORMALIZE_TEST_ONLY

normalization: Normalize test data only.

static final Tag[]

TAGS_FILTER

Specifies whether document's (instance's) word frequencies are to be normalized.
Constructor Summary

Constructors

Constructor

Description

StringToWordVector()

Default constructor.

StringToWordVector(int wordsToKeep)

Constructor that allows specification of the target number of words in the output.
Method Summary

Modifier and Type

Method

Description

String

attributeIndicesTipText()

Returns the tip text for this property.

String

attributeNamePrefixTipText()

Returns the tip text for this property.

boolean

batchFinished()

Signify that this batch of input to the filter is finished.

String

dictionaryFileToSaveToTipText()

Tip text for this property

String

doNotOperateOnPerClassBasisTipText()

Returns the tip text for this property.

String

getAttributeIndices()

Gets the current range selection.

String

getAttributeNamePrefix()

Get the attribute name prefix.

Capabilities

getCapabilities()

Returns the Capabilities of this filter.

File

getDictionaryFileToSaveTo()

Set the dictionary file to save the dictionary to.

boolean

getDoNotOperateOnPerClassBasis()

Get the DoNotOperateOnPerClassBasis value.

boolean

getIDFTransform()

Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.

boolean

getInvertSelection()

Gets whether the supplied columns are to be processed or skipped.

boolean

getLowerCaseTokens()

Gets whether if the tokens are to be downcased or not.

int

getMinTermFreq()

Get the MinTermFreq value.

SelectedTag

getNormalizeDocLength()

Gets whether if the word frequencies for a document (instance) should be normalized or not.

String[]

getOptions()

Gets the current settings of the filter.

boolean

getOutputWordCounts()

Gets whether output instances contain 0 or 1 indicating word presence, or word counts.

double

getPeriodicPruning()

Gets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.

String

getRevision()

Returns the revision string.

boolean

getSaveDictionaryInBinaryForm()

Set whether to save the dictionary in binary serialized form rather than as plain text

Range

getSelectedRange()

Get the value of m_SelectedRange.

Stemmer

getStemmer()

Returns the current stemming algorithm, null if none is used.

StopwordsHandler

getStopwordsHandler()

Gets the stopwords handler.

boolean

getTFTransform()

Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.

Tokenizer

getTokenizer()

Returns the current tokenizer algorithm.

int

getWordsToKeep()

Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.

String

globalInfo()

Returns a string describing this filter.

String

IDFTransformTipText()

Returns the tip text for this property.

boolean

input(Instance instance)

Input an instance for filtering.

String

invertSelectionTipText()

Returns the tip text for this property.

Enumeration<Option>

listOptions()

Returns an enumeration describing the available options.

String

lowerCaseTokensTipText()

Returns the tip text for this property.

static void

main(String[] argv)

Main method for testing this class.

String

minTermFreqTipText()

Returns the tip text for this property.

String

normalizeDocLengthTipText()

Returns the tip text for this property.

String

outputWordCountsTipText()

Returns the tip text for this property.

String

periodicPruningTipText()

Returns the tip text for this property.

String

saveDictionaryInBinaryFormTipText()

void

setAttributeIndices(String rangeList)

Sets which attributes are to be worked on.

void

setAttributeIndicesArray(int[] attributes)

Sets which attributes are to be processed.

void

setAttributeNamePrefix(String newPrefix)

Set the attribute name prefix.

void

setDictionaryFileToSaveTo(File toSaveTo)

Set the dictionary file to save the dictionary to.

void

setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis)

Set the DoNotOperateOnPerClassBasis value.

void

setIDFTransform(boolean IDFTransform)

Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.

boolean

setInputFormat(Instances instanceInfo)

Sets the format of the input instances.

void

setInvertSelection(boolean invert)

Sets whether selected columns should be processed or skipped.

void

setLowerCaseTokens(boolean downCaseTokens)

Sets whether if the tokens are to be downcased or not.

void

setMinTermFreq(int newMinTermFreq)

Set the MinTermFreq value.

void

setNormalizeDocLength(SelectedTag newType)

Sets whether if the word frequencies for a document (instance) should be normalized or not.

void

setOptions(String[] options)

Parses a given list of options.

void

setOutputWordCounts(boolean outputWordCounts)

Sets whether output instances contain 0 or 1 indicating word presence, or word counts.

void

setPeriodicPruning(double newPeriodicPruning)

Sets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.

void

setSaveDictionaryInBinaryForm(boolean saveAsBinary)

Set whether to save the dictionary in binary serialized form rather than as plain text

void

setSelectedRange(String newSelectedRange)

Set the value of m_SelectedRange.

void

setStemmer(Stemmer value)

the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).

void

setStopwordsHandler(StopwordsHandler value)

Sets the stopwords handler to use.

void

setTFTransform(boolean TFTransform)

Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.

void

setTokenizer(Tokenizer value)

the tokenizer algorithm to use.

void

setWordsToKeep(int newWordsToKeep)

Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.

String

stemmerTipText()

Returns the tip text for this property.

String

stopwordsHandlerTipText()

Returns the tip text for this property.

String

TFTransformTipText()

Returns the tip text for this property.

String

tokenizerTipText()

Returns the tip text for this property.

String

wordsToKeepTipText()

Returns the tip text for this property.

Methods inherited from class weka.filters.Filter
batchFilterFile, debugTipText, doNotCheckCapabilitiesTipText, filterFile, getCapabilities, getCopyOfInputFormat, getDebug, getDoNotCheckCapabilities, getOutputFormat, isFirstBatchDone, isNewBatch, isOutputFormatDefined, makeCopies, makeCopy, mayRemoveInstanceAfterFirstBatchDone, numPendingOutput, output, outputPeek, postExecution, preExecution, run, runFilter, setDebug, setDoNotCheckCapabilities, toString, useFilter, wekaStaticWrapper

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait

Field Details
- FILTER_NONE
  
  public static final int FILTER_NONE
  
  normalization: No normalization.
  See Also:
  
  Constant Field Values
- FILTER_NORMALIZE_ALL
  
  public static final int FILTER_NORMALIZE_ALL
  
  normalization: Normalize all data.
  See Also:
  
  Constant Field Values
- FILTER_NORMALIZE_TEST_ONLY
  
  public static final int FILTER_NORMALIZE_TEST_ONLY
  
  normalization: Normalize test data only.
  See Also:
  
  Constant Field Values
- TAGS_FILTER
  
  public static final Tag[] TAGS_FILTER
  
  Specifies whether document's (instance's) word frequencies are to be normalized. The are normalized to average length of documents specified as input format.
Constructor Details
- StringToWordVector
  
  public StringToWordVector()
  
  Default constructor. Targets 1000 words in the output.
- StringToWordVector
  
  public StringToWordVector(int wordsToKeep)
  
  Constructor that allows specification of the target number of words in the output.
  
  Parameters:
  
  wordsToKeep - the number of words in the output vector (per class if assigned).
Method Details
- listOptions
  
  public Enumeration<Option> listOptions()
  
  Returns an enumeration describing the available options.
  
  Specified by:
  
  listOptions in interface OptionHandler
  
  Overrides:
  
  listOptions in class Filter
  
  Returns:
  
  an enumeration of all the available options
- setOptions
  
  public void setOptions(String[] options) throws Exception
  Parses a given list of options.
  Valid options are:
  
  -C Output word counts rather than boolean word presence.
  
  -R <index1,index2-index4,...> Specify list of string attributes to convert to words (as weka Range). (default: select all string attributes)
  
  -V Invert matching sense of column indexes.
  
  -P <attribute name prefix> Specify a prefix for the created attribute names. (default: "")
  
  -W <number of words to keep> Specify approximate number of word fields to create. Surplus words will be discarded.. (default: 1000)
  
  -prune-rate <rate as a percentage of dataset> Specify the rate (e.g., every 10% of the input dataset) at which to periodically prune the dictionary. -W prunes after creating a full dictionary. You may not have enough memory for this approach. (default: no periodic pruning)
  
  -T Transform the word frequencies into log(1+fij) where fij is the frequency of word i in jth document(instance).
  
  -I Transform each word frequency into: fij*log(num of Documents/num of documents containing word i) where fij if frequency of word i in jth document(instance)
  
  -N Whether to 0=not normalize/1=normalize all data/2=normalize test data only to average length of training documents (default 0=don't normalize).
  
  -L Convert all tokens to lowercase before adding to the dictionary.
  
  -stopwords-handler The stopwords handler to use (default Null).
  
  -stemmer <spec> The stemming algorithm (classname plus parameters) to use.
  
  -M <int> The minimum term frequency (default = 1).
  
  -O If this is set, the maximum number of words and the minimum term frequency is not enforced on a per-class basis but based on the documents in all the classes (even if a class attribute is set).
  
  -tokenizer <spec> The tokenizing algorihtm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
  
  -dictionary <path to save to> The file to save the dictionary to. (default is not to save the dictionary)
  
  -binary-dict Save the dictionary file as a binary serialized object instead of in plain text form. Use in conjunction with -dictionary
  Specified by:
  
  setOptions in interface OptionHandler
  
  Overrides:
  
  setOptions in class Filter
  
  Parameters:
  
  options - the list of options as an array of strings
  
  Throws:
  
  Exception - if an option is not supported
- getOptions
  
  public String[] getOptions()
  
  Gets the current settings of the filter.
  
  Specified by:
  
  getOptions in interface OptionHandler
  
  Overrides:
  
  getOptions in class Filter
  
  Returns:
  
  an array of strings suitable for passing to setOptions
- getCapabilities
  
  public Capabilities getCapabilities()
  
  Returns the Capabilities of this filter.
  Specified by:
  
  getCapabilities in interface CapabilitiesHandler
  
  Overrides:
  
  getCapabilities in class Filter
  
  Returns:
  
  the capabilities of this object
  
  See Also:
  
  Capabilities
- setInputFormat
  
  public boolean setInputFormat(Instances instanceInfo) throws Exception
  
  Sets the format of the input instances.
  
  Overrides:
  
  setInputFormat in class Filter
  
  Parameters:
  
  instanceInfo - an Instances object containing the input instance structure (any instances contained in the object are ignored - only the structure is required).
  
  Returns:
  
  true if the outputFormat may be collected immediately
  
  Throws:
  
  Exception - if the input format can't be set successfully
- input
  
  public boolean input(Instance instance) throws Exception
  
  Input an instance for filtering. Filter requires all training instances be read before producing output.
  
  Overrides:
  
  input in class Filter
  
  Parameters:
  
  instance - the input instance.
  
  Returns:
  
  true if the filtered instance may now be collected with output().
  
  Throws:
  
  IllegalStateException - if no input structure has been defined.
  
  NullPointerException - if the input format has not been defined.
  
  Exception - if the input instance was not of the correct format or if there was a problem with the filtering.
- batchFinished
  
  public boolean batchFinished() throws Exception
  
  Signify that this batch of input to the filter is finished. If the filter requires all instances prior to filtering, output() may now be called to retrieve the filtered instances.
  
  Overrides:
  
  batchFinished in class Filter
  
  Returns:
  
  true if there are instances pending output.
  
  Throws:
  
  IllegalStateException - if no input structure has been defined.
  
  NullPointerException - if no input structure has been defined,
  
  Exception - if there was a problem finishing the batch.
- dictionaryFileToSaveToTipText
  
  public String dictionaryFileToSaveToTipText()
  
  Tip text for this property
  
  Returns:
  
  the tip text for this property
- setDictionaryFileToSaveTo
  
  public void setDictionaryFileToSaveTo(File toSaveTo)
  
  Set the dictionary file to save the dictionary to. A file with an empty path or a path "-- set me --" means do not save the dictionary.
  
  Parameters:
  
  toSaveTo - the path to save the dictionary to
- getDictionaryFileToSaveTo
  
  public File getDictionaryFileToSaveTo()
  
  Set the dictionary file to save the dictionary to. A file with an empty path or a path "-- set me --" means do not save the dictionary.
  
  Returns:
  
  the path to save the dictionary to
- saveDictionaryInBinaryFormTipText
  
  public String saveDictionaryInBinaryFormTipText()
- setSaveDictionaryInBinaryForm
  
  public void setSaveDictionaryInBinaryForm(boolean saveAsBinary)
  
  Set whether to save the dictionary in binary serialized form rather than as plain text
  
  Parameters:
  
  saveAsBinary - true to save the dictionary in binary form
- getSaveDictionaryInBinaryForm
  
  public boolean getSaveDictionaryInBinaryForm()
  
  Set whether to save the dictionary in binary serialized form rather than as plain text
  
  Returns:
  
  true to save the dictionary in binary form
- globalInfo
  
  public String globalInfo()
  
  Returns a string describing this filter.
  
  Returns:
  
  a description of the filter suitable for displaying in the explorer/experimenter gui
- getOutputWordCounts
  
  public boolean getOutputWordCounts()
  
  Gets whether output instances contain 0 or 1 indicating word presence, or word counts.
  
  Returns:
  
  true if word counts should be output.
- setOutputWordCounts
  
  public void setOutputWordCounts(boolean outputWordCounts)
  
  Sets whether output instances contain 0 or 1 indicating word presence, or word counts.
  
  Parameters:
  
  outputWordCounts - true if word counts should be output.
- outputWordCountsTipText
  
  public String outputWordCountsTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getSelectedRange
  
  public Range getSelectedRange()
  
  Get the value of m_SelectedRange.
  
  Returns:
  
  Value of m_SelectedRange.
- setSelectedRange
  
  public void setSelectedRange(String newSelectedRange)
  
  Set the value of m_SelectedRange.
  
  Parameters:
  
  newSelectedRange - Value to assign to m_SelectedRange.
- attributeIndicesTipText
  
  public String attributeIndicesTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getAttributeIndices
  
  public String getAttributeIndices()
  
  Gets the current range selection.
  
  Returns:
  
  a string containing a comma separated list of ranges
- setAttributeIndices
  
  public void setAttributeIndices(String rangeList)
  
  Sets which attributes are to be worked on.
  
  Parameters:
  
  rangeList - a string representing the list of attributes. Since the string will typically come from a user, attributes are indexed from 1.
  eg: first-3,5,6-last
  
  Throws:
  
  IllegalArgumentException - if an invalid range list is supplied
- setAttributeIndicesArray
  
  public void setAttributeIndicesArray(int[] attributes)
  
  Sets which attributes are to be processed.
  
  Parameters:
  
  attributes - an array containing indexes of attributes to process. Since the array will typically come from a program, attributes are indexed from 0.
  
  Throws:
  
  IllegalArgumentException - if an invalid set of ranges is supplied
- invertSelectionTipText
  
  public String invertSelectionTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getInvertSelection
  
  public boolean getInvertSelection()
  
  Gets whether the supplied columns are to be processed or skipped.
  
  Returns:
  
  true if the supplied columns will be kept
- setInvertSelection
  
  public void setInvertSelection(boolean invert)
  
  Sets whether selected columns should be processed or skipped.
  
  Parameters:
  
  invert - the new invert setting
- getAttributeNamePrefix
  
  public String getAttributeNamePrefix()
  
  Get the attribute name prefix.
  
  Returns:
  
  The current attribute name prefix.
- setAttributeNamePrefix
  
  public void setAttributeNamePrefix(String newPrefix)
  
  Set the attribute name prefix.
  
  Parameters:
  
  newPrefix - String to use as the attribute name prefix.
- attributeNamePrefixTipText
  
  public String attributeNamePrefixTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getWordsToKeep
  
  public int getWordsToKeep()
  
  Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.
  
  Returns:
  
  the target number of words in the output vector (per class if assigned).
- setWordsToKeep
  
  public void setWordsToKeep(int newWordsToKeep)
  
  Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.
  
  Parameters:
  
  newWordsToKeep - the target number of words in the output vector (per class if assigned).
- wordsToKeepTipText
  
  public String wordsToKeepTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getPeriodicPruning
  
  public double getPeriodicPruning()
  
  Gets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.
  
  Returns:
  
  the rate at which the dictionary is periodically pruned
- setPeriodicPruning
  
  public void setPeriodicPruning(double newPeriodicPruning)
  
  Sets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.
  
  Parameters:
  
  newPeriodicPruning - the rate at which the dictionary is periodically pruned
- periodicPruningTipText
  
  public String periodicPruningTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getTFTransform
  
  public boolean getTFTransform()
  
  Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
  
  Returns:
  
  true if word frequencies are to be transformed.
- setTFTransform
  
  public void setTFTransform(boolean TFTransform)
  
  Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
  
  Parameters:
  
  TFTransform - true if word frequencies are to be transformed.
- TFTransformTipText
  
  public String TFTransformTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getIDFTransform
  
  public boolean getIDFTransform()
  
  Sets whether if the word frequencies in a document should be transformed into:
  fij*log(num of Docs/num of Docs with word i)
  where fij is the frequency of word i in document(instance) j.
  
  Returns:
  
  true if the word frequencies are to be transformed.
- setIDFTransform
  
  public void setIDFTransform(boolean IDFTransform)
  
  Sets whether if the word frequencies in a document should be transformed into:
  fij*log(num of Docs/num of Docs with word i)
  where fij is the frequency of word i in document(instance) j.
  
  Parameters:
  
  IDFTransform - true if the word frequecies are to be transformed
- IDFTransformTipText
  
  public String IDFTransformTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getNormalizeDocLength
  
  public SelectedTag getNormalizeDocLength()
  
  Gets whether if the word frequencies for a document (instance) should be normalized or not.
  
  Returns:
  
  true if word frequencies are to be normalized.
- setNormalizeDocLength
  
  public void setNormalizeDocLength(SelectedTag newType)
  
  Sets whether if the word frequencies for a document (instance) should be normalized or not.
  
  Parameters:
  
  newType - the new type.
- normalizeDocLengthTipText
  
  public String normalizeDocLengthTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getLowerCaseTokens
  
  public boolean getLowerCaseTokens()
  
  Gets whether if the tokens are to be downcased or not.
  
  Returns:
  
  true if the tokens are to be downcased.
- setLowerCaseTokens
  
  public void setLowerCaseTokens(boolean downCaseTokens)
  
  Sets whether if the tokens are to be downcased or not. (Doesn't affect non-alphabetic characters in tokens).
  
  Parameters:
  
  downCaseTokens - should be true if only lower case tokens are to be formed.
- doNotOperateOnPerClassBasisTipText
  
  public String doNotOperateOnPerClassBasisTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getDoNotOperateOnPerClassBasis
  
  public boolean getDoNotOperateOnPerClassBasis()
  
  Get the DoNotOperateOnPerClassBasis value.
  
  Returns:
  
  the DoNotOperateOnPerClassBasis value.
- setDoNotOperateOnPerClassBasis
  
  public void setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis)
  
  Set the DoNotOperateOnPerClassBasis value.
  
  Parameters:
  
  newDoNotOperateOnPerClassBasis - The new DoNotOperateOnPerClassBasis value.
- minTermFreqTipText
  
  public String minTermFreqTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getMinTermFreq
  
  public int getMinTermFreq()
  
  Get the MinTermFreq value.
  
  Returns:
  
  the MinTermFreq value.
- setMinTermFreq
  
  public void setMinTermFreq(int newMinTermFreq)
  
  Set the MinTermFreq value.
  
  Parameters:
  
  newMinTermFreq - The new MinTermFreq value.
- lowerCaseTokensTipText
  
  public String lowerCaseTokensTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- setStemmer
  
  public void setStemmer(Stemmer value)
  
  the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).
  Parameters:
  
  value - the configured stemming algorithm, or null
  
  See Also:
  
  NullStemmer
- getStemmer
  
  public Stemmer getStemmer()
  
  Returns the current stemming algorithm, null if none is used.
  
  Returns:
  
  the current stemming algorithm, null if none set
- stemmerTipText
  
  public String stemmerTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- setStopwordsHandler
  
  public void setStopwordsHandler(StopwordsHandler value)
  
  Sets the stopwords handler to use.
  
  Parameters:
  
  value - the stopwords handler, if null, Null is used
- getStopwordsHandler
  
  public StopwordsHandler getStopwordsHandler()
  
  Gets the stopwords handler.
  
  Returns:
  
  the stopwords handler
- stopwordsHandlerTipText
  
  public String stopwordsHandlerTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- setTokenizer
  
  public void setTokenizer(Tokenizer value)
  
  the tokenizer algorithm to use.
  
  Parameters:
  
  value - the configured tokenizing algorithm
- getTokenizer
  
  public Tokenizer getTokenizer()
  
  Returns the current tokenizer algorithm.
  
  Returns:
  
  the current tokenizer algorithm
- tokenizerTipText
  
  public String tokenizerTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getRevision
  
  public String getRevision()
  
  Returns the revision string.
  
  Specified by:
  
  getRevision in interface RevisionHandler
  
  Overrides:
  
  getRevision in class Filter
  
  Returns:
  
  the revision
- main
  
  public static void main(String[] argv)
  
  Main method for testing this class.
  
  Parameters:
  
  argv - should contain arguments to the filter: use -h for help

Class StringToWordVector

Field Summary

Constructor Summary

Method Summary

Methods inherited from class weka.filters.Filter

Methods inherited from class java.lang.Object

Field Details

FILTER_NONE

FILTER_NORMALIZE_ALL

FILTER_NORMALIZE_TEST_ONLY

TAGS_FILTER

Constructor Details

StringToWordVector

StringToWordVector

Method Details

listOptions

setOptions

getOptions

getCapabilities

setInputFormat

input

batchFinished

dictionaryFileToSaveToTipText

setDictionaryFileToSaveTo

getDictionaryFileToSaveTo

saveDictionaryInBinaryFormTipText

setSaveDictionaryInBinaryForm

getSaveDictionaryInBinaryForm

globalInfo

getOutputWordCounts

setOutputWordCounts

outputWordCountsTipText

getSelectedRange

setSelectedRange

attributeIndicesTipText

getAttributeIndices

setAttributeIndices

setAttributeIndicesArray

invertSelectionTipText

getInvertSelection

setInvertSelection

getAttributeNamePrefix

setAttributeNamePrefix

attributeNamePrefixTipText

getWordsToKeep

setWordsToKeep

wordsToKeepTipText

getPeriodicPruning

setPeriodicPruning

periodicPruningTipText

getTFTransform

setTFTransform

TFTransformTipText

getIDFTransform

setIDFTransform

IDFTransformTipText

getNormalizeDocLength

setNormalizeDocLength

normalizeDocLengthTipText

getLowerCaseTokens

setLowerCaseTokens

doNotOperateOnPerClassBasisTipText

getDoNotOperateOnPerClassBasis

setDoNotOperateOnPerClassBasis

minTermFreqTipText

getMinTermFreq

setMinTermFreq

lowerCaseTokensTipText

setStemmer

getStemmer

stemmerTipText

setStopwordsHandler

getStopwordsHandler

stopwordsHandlerTipText

setTokenizer

getTokenizer

tokenizerTipText

getRevision

main