Class StringToWordVector

java.lang.Object
weka.filters.Filter
weka.filters.unsupervised.attribute.StringToWordVector
All Implemented Interfaces:
Serializable, CapabilitiesHandler, CapabilitiesIgnorer, CommandlineRunnable, OptionHandler, RevisionHandler, WeightedInstancesHandler, UnsupervisedFilter

public class StringToWordVector extends Filter implements UnsupervisedFilter, OptionHandler, WeightedInstancesHandler
Converts string attributes into a set of numeric attributes representing word occurrence information from the text contained in the strings. The dictionary is determined from the first batch of data filtered (typically training data). Note that this filter is not strictly unsupervised when a class attribute is set because it creates a separate dictionary for each class and then merges them.

Valid options are:

 -C
  Output word counts rather than boolean word presence.
 
 -R <index1,index2-index4,...>
  Specify list of string attributes to convert to words (as weka Range).
  (default: select all string attributes)
 -V
  Invert matching sense of column indexes.
 -P <attribute name prefix>
  Specify a prefix for the created attribute names.
  (default: "")
 -W <number of words to keep>
  Specify approximate number of word fields to create.
  Surplus words will be discarded..
  (default: 1000)
 -prune-rate <rate as a percentage of dataset>
  Specify the rate (e.g., every 10% of the input dataset) at which to periodically prune the dictionary.
  -W prunes after creating a full dictionary. You may not have enough memory for this approach.
  (default: no periodic pruning)
 -T
  Transform the word frequencies into log(1+fij)
  where fij is the frequency of word i in jth document(instance).
 
 -I
  Transform each word frequency into:
  fij*log(num of Documents/num of documents containing word i)
    where fij if frequency of word i in jth document(instance)
 -N
  Whether to 0=not normalize/1=normalize all data/2=normalize test data only
  to average length of training documents (default 0=don't normalize).
 -L
  Convert all tokens to lowercase before adding to the dictionary.
 -stopwords-handler
  The stopwords handler to use (default Null).
 -stemmer <spec>
  The stemming algorithm (classname plus parameters) to use.
 -M <int>
  The minimum term frequency (default = 1).
 -O
  If this is set, the maximum number of words and the 
  minimum term frequency is not enforced on a per-class 
  basis but based on the documents in all the classes 
  (even if a class attribute is set).
 -tokenizer <spec>
  The tokenizing algorihtm (classname plus parameters) to use.
  (default: weka.core.tokenizers.WordTokenizer)
 -dictionary <path to save to>
  The file to save the dictionary to.
  (default is not to save the dictionary)
 -binary-dict
  Save the dictionary file as a binary serialized object
  instead of in plain text form. Use in conjunction with
  -dictionary
Version:
$Revision: 14508 $
Author:
Len Trigg (len@reeltwo.com), Stuart Inglis (stuart@reeltwo.com), Gordon Paynter (gordon.paynter@ucr.edu), Asrhaf M. Kibriya (amk14@cs.waikato.ac.nz)
See Also:
  • Field Details

    • FILTER_NONE

      public static final int FILTER_NONE
      normalization: No normalization.
      See Also:
    • FILTER_NORMALIZE_ALL

      public static final int FILTER_NORMALIZE_ALL
      normalization: Normalize all data.
      See Also:
    • FILTER_NORMALIZE_TEST_ONLY

      public static final int FILTER_NORMALIZE_TEST_ONLY
      normalization: Normalize test data only.
      See Also:
    • TAGS_FILTER

      public static final Tag[] TAGS_FILTER
      Specifies whether document's (instance's) word frequencies are to be normalized. The are normalized to average length of documents specified as input format.
  • Constructor Details

    • StringToWordVector

      public StringToWordVector()
      Default constructor. Targets 1000 words in the output.
    • StringToWordVector

      public StringToWordVector(int wordsToKeep)
      Constructor that allows specification of the target number of words in the output.
      Parameters:
      wordsToKeep - the number of words in the output vector (per class if assigned).
  • Method Details

    • listOptions

      public Enumeration<Option> listOptions()
      Returns an enumeration describing the available options.
      Specified by:
      listOptions in interface OptionHandler
      Overrides:
      listOptions in class Filter
      Returns:
      an enumeration of all the available options
    • setOptions

      public void setOptions(String[] options) throws Exception
      Parses a given list of options.

      Valid options are:

       -C
        Output word counts rather than boolean word presence.
       
       -R <index1,index2-index4,...>
        Specify list of string attributes to convert to words (as weka Range).
        (default: select all string attributes)
       -V
        Invert matching sense of column indexes.
       -P <attribute name prefix>
        Specify a prefix for the created attribute names.
        (default: "")
       -W <number of words to keep>
        Specify approximate number of word fields to create.
        Surplus words will be discarded..
        (default: 1000)
       -prune-rate <rate as a percentage of dataset>
        Specify the rate (e.g., every 10% of the input dataset) at which to periodically prune the dictionary.
        -W prunes after creating a full dictionary. You may not have enough memory for this approach.
        (default: no periodic pruning)
       -T
        Transform the word frequencies into log(1+fij)
        where fij is the frequency of word i in jth document(instance).
       
       -I
        Transform each word frequency into:
        fij*log(num of Documents/num of documents containing word i)
          where fij if frequency of word i in jth document(instance)
       -N
        Whether to 0=not normalize/1=normalize all data/2=normalize test data only
        to average length of training documents (default 0=don't normalize).
       -L
        Convert all tokens to lowercase before adding to the dictionary.
       -stopwords-handler
        The stopwords handler to use (default Null).
       -stemmer <spec>
        The stemming algorithm (classname plus parameters) to use.
       -M <int>
        The minimum term frequency (default = 1).
       -O
        If this is set, the maximum number of words and the 
        minimum term frequency is not enforced on a per-class 
        basis but based on the documents in all the classes 
        (even if a class attribute is set).
       -tokenizer <spec>
        The tokenizing algorihtm (classname plus parameters) to use.
        (default: weka.core.tokenizers.WordTokenizer)
       -dictionary <path to save to>
        The file to save the dictionary to.
        (default is not to save the dictionary)
       -binary-dict
        Save the dictionary file as a binary serialized object
        instead of in plain text form. Use in conjunction with
        -dictionary
      Specified by:
      setOptions in interface OptionHandler
      Overrides:
      setOptions in class Filter
      Parameters:
      options - the list of options as an array of strings
      Throws:
      Exception - if an option is not supported
    • getOptions

      public String[] getOptions()
      Gets the current settings of the filter.
      Specified by:
      getOptions in interface OptionHandler
      Overrides:
      getOptions in class Filter
      Returns:
      an array of strings suitable for passing to setOptions
    • getCapabilities

      public Capabilities getCapabilities()
      Returns the Capabilities of this filter.
      Specified by:
      getCapabilities in interface CapabilitiesHandler
      Overrides:
      getCapabilities in class Filter
      Returns:
      the capabilities of this object
      See Also:
    • setInputFormat

      public boolean setInputFormat(Instances instanceInfo) throws Exception
      Sets the format of the input instances.
      Overrides:
      setInputFormat in class Filter
      Parameters:
      instanceInfo - an Instances object containing the input instance structure (any instances contained in the object are ignored - only the structure is required).
      Returns:
      true if the outputFormat may be collected immediately
      Throws:
      Exception - if the input format can't be set successfully
    • input

      public boolean input(Instance instance) throws Exception
      Input an instance for filtering. Filter requires all training instances be read before producing output.
      Overrides:
      input in class Filter
      Parameters:
      instance - the input instance.
      Returns:
      true if the filtered instance may now be collected with output().
      Throws:
      IllegalStateException - if no input structure has been defined.
      NullPointerException - if the input format has not been defined.
      Exception - if the input instance was not of the correct format or if there was a problem with the filtering.
    • batchFinished

      public boolean batchFinished() throws Exception
      Signify that this batch of input to the filter is finished. If the filter requires all instances prior to filtering, output() may now be called to retrieve the filtered instances.
      Overrides:
      batchFinished in class Filter
      Returns:
      true if there are instances pending output.
      Throws:
      IllegalStateException - if no input structure has been defined.
      NullPointerException - if no input structure has been defined,
      Exception - if there was a problem finishing the batch.
    • dictionaryFileToSaveToTipText

      public String dictionaryFileToSaveToTipText()
      Tip text for this property
      Returns:
      the tip text for this property
    • setDictionaryFileToSaveTo

      public void setDictionaryFileToSaveTo(File toSaveTo)
      Set the dictionary file to save the dictionary to. A file with an empty path or a path "-- set me --" means do not save the dictionary.
      Parameters:
      toSaveTo - the path to save the dictionary to
    • getDictionaryFileToSaveTo

      public File getDictionaryFileToSaveTo()
      Set the dictionary file to save the dictionary to. A file with an empty path or a path "-- set me --" means do not save the dictionary.
      Returns:
      the path to save the dictionary to
    • saveDictionaryInBinaryFormTipText

      public String saveDictionaryInBinaryFormTipText()
    • setSaveDictionaryInBinaryForm

      public void setSaveDictionaryInBinaryForm(boolean saveAsBinary)
      Set whether to save the dictionary in binary serialized form rather than as plain text
      Parameters:
      saveAsBinary - true to save the dictionary in binary form
    • getSaveDictionaryInBinaryForm

      public boolean getSaveDictionaryInBinaryForm()
      Set whether to save the dictionary in binary serialized form rather than as plain text
      Returns:
      true to save the dictionary in binary form
    • globalInfo

      public String globalInfo()
      Returns a string describing this filter.
      Returns:
      a description of the filter suitable for displaying in the explorer/experimenter gui
    • getOutputWordCounts

      public boolean getOutputWordCounts()
      Gets whether output instances contain 0 or 1 indicating word presence, or word counts.
      Returns:
      true if word counts should be output.
    • setOutputWordCounts

      public void setOutputWordCounts(boolean outputWordCounts)
      Sets whether output instances contain 0 or 1 indicating word presence, or word counts.
      Parameters:
      outputWordCounts - true if word counts should be output.
    • outputWordCountsTipText

      public String outputWordCountsTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getSelectedRange

      public Range getSelectedRange()
      Get the value of m_SelectedRange.
      Returns:
      Value of m_SelectedRange.
    • setSelectedRange

      public void setSelectedRange(String newSelectedRange)
      Set the value of m_SelectedRange.
      Parameters:
      newSelectedRange - Value to assign to m_SelectedRange.
    • attributeIndicesTipText

      public String attributeIndicesTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getAttributeIndices

      public String getAttributeIndices()
      Gets the current range selection.
      Returns:
      a string containing a comma separated list of ranges
    • setAttributeIndices

      public void setAttributeIndices(String rangeList)
      Sets which attributes are to be worked on.
      Parameters:
      rangeList - a string representing the list of attributes. Since the string will typically come from a user, attributes are indexed from 1.
      eg: first-3,5,6-last
      Throws:
      IllegalArgumentException - if an invalid range list is supplied
    • setAttributeIndicesArray

      public void setAttributeIndicesArray(int[] attributes)
      Sets which attributes are to be processed.
      Parameters:
      attributes - an array containing indexes of attributes to process. Since the array will typically come from a program, attributes are indexed from 0.
      Throws:
      IllegalArgumentException - if an invalid set of ranges is supplied
    • invertSelectionTipText

      public String invertSelectionTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getInvertSelection

      public boolean getInvertSelection()
      Gets whether the supplied columns are to be processed or skipped.
      Returns:
      true if the supplied columns will be kept
    • setInvertSelection

      public void setInvertSelection(boolean invert)
      Sets whether selected columns should be processed or skipped.
      Parameters:
      invert - the new invert setting
    • getAttributeNamePrefix

      public String getAttributeNamePrefix()
      Get the attribute name prefix.
      Returns:
      The current attribute name prefix.
    • setAttributeNamePrefix

      public void setAttributeNamePrefix(String newPrefix)
      Set the attribute name prefix.
      Parameters:
      newPrefix - String to use as the attribute name prefix.
    • attributeNamePrefixTipText

      public String attributeNamePrefixTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getWordsToKeep

      public int getWordsToKeep()
      Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.
      Returns:
      the target number of words in the output vector (per class if assigned).
    • setWordsToKeep

      public void setWordsToKeep(int newWordsToKeep)
      Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.
      Parameters:
      newWordsToKeep - the target number of words in the output vector (per class if assigned).
    • wordsToKeepTipText

      public String wordsToKeepTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getPeriodicPruning

      public double getPeriodicPruning()
      Gets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.
      Returns:
      the rate at which the dictionary is periodically pruned
    • setPeriodicPruning

      public void setPeriodicPruning(double newPeriodicPruning)
      Sets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.
      Parameters:
      newPeriodicPruning - the rate at which the dictionary is periodically pruned
    • periodicPruningTipText

      public String periodicPruningTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getTFTransform

      public boolean getTFTransform()
      Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
      Returns:
      true if word frequencies are to be transformed.
    • setTFTransform

      public void setTFTransform(boolean TFTransform)
      Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
      Parameters:
      TFTransform - true if word frequencies are to be transformed.
    • TFTransformTipText

      public String TFTransformTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getIDFTransform

      public boolean getIDFTransform()
      Sets whether if the word frequencies in a document should be transformed into:
      fij*log(num of Docs/num of Docs with word i)
      where fij is the frequency of word i in document(instance) j.
      Returns:
      true if the word frequencies are to be transformed.
    • setIDFTransform

      public void setIDFTransform(boolean IDFTransform)
      Sets whether if the word frequencies in a document should be transformed into:
      fij*log(num of Docs/num of Docs with word i)
      where fij is the frequency of word i in document(instance) j.
      Parameters:
      IDFTransform - true if the word frequecies are to be transformed
    • IDFTransformTipText

      public String IDFTransformTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getNormalizeDocLength

      public SelectedTag getNormalizeDocLength()
      Gets whether if the word frequencies for a document (instance) should be normalized or not.
      Returns:
      true if word frequencies are to be normalized.
    • setNormalizeDocLength

      public void setNormalizeDocLength(SelectedTag newType)
      Sets whether if the word frequencies for a document (instance) should be normalized or not.
      Parameters:
      newType - the new type.
    • normalizeDocLengthTipText

      public String normalizeDocLengthTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getLowerCaseTokens

      public boolean getLowerCaseTokens()
      Gets whether if the tokens are to be downcased or not.
      Returns:
      true if the tokens are to be downcased.
    • setLowerCaseTokens

      public void setLowerCaseTokens(boolean downCaseTokens)
      Sets whether if the tokens are to be downcased or not. (Doesn't affect non-alphabetic characters in tokens).
      Parameters:
      downCaseTokens - should be true if only lower case tokens are to be formed.
    • doNotOperateOnPerClassBasisTipText

      public String doNotOperateOnPerClassBasisTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getDoNotOperateOnPerClassBasis

      public boolean getDoNotOperateOnPerClassBasis()
      Get the DoNotOperateOnPerClassBasis value.
      Returns:
      the DoNotOperateOnPerClassBasis value.
    • setDoNotOperateOnPerClassBasis

      public void setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis)
      Set the DoNotOperateOnPerClassBasis value.
      Parameters:
      newDoNotOperateOnPerClassBasis - The new DoNotOperateOnPerClassBasis value.
    • minTermFreqTipText

      public String minTermFreqTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getMinTermFreq

      public int getMinTermFreq()
      Get the MinTermFreq value.
      Returns:
      the MinTermFreq value.
    • setMinTermFreq

      public void setMinTermFreq(int newMinTermFreq)
      Set the MinTermFreq value.
      Parameters:
      newMinTermFreq - The new MinTermFreq value.
    • lowerCaseTokensTipText

      public String lowerCaseTokensTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setStemmer

      public void setStemmer(Stemmer value)
      the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).
      Parameters:
      value - the configured stemming algorithm, or null
      See Also:
    • getStemmer

      public Stemmer getStemmer()
      Returns the current stemming algorithm, null if none is used.
      Returns:
      the current stemming algorithm, null if none set
    • stemmerTipText

      public String stemmerTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setStopwordsHandler

      public void setStopwordsHandler(StopwordsHandler value)
      Sets the stopwords handler to use.
      Parameters:
      value - the stopwords handler, if null, Null is used
    • getStopwordsHandler

      public StopwordsHandler getStopwordsHandler()
      Gets the stopwords handler.
      Returns:
      the stopwords handler
    • stopwordsHandlerTipText

      public String stopwordsHandlerTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setTokenizer

      public void setTokenizer(Tokenizer value)
      the tokenizer algorithm to use.
      Parameters:
      value - the configured tokenizing algorithm
    • getTokenizer

      public Tokenizer getTokenizer()
      Returns the current tokenizer algorithm.
      Returns:
      the current tokenizer algorithm
    • tokenizerTipText

      public String tokenizerTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getRevision

      public String getRevision()
      Returns the revision string.
      Specified by:
      getRevision in interface RevisionHandler
      Overrides:
      getRevision in class Filter
      Returns:
      the revision
    • main

      public static void main(String[] argv)
      Main method for testing this class.
      Parameters:
      argv - should contain arguments to the filter: use -h for help