Class FixedDictionaryStringToWordVector

java.lang.Object
weka.filters.Filter
weka.filters.SimpleFilter
weka.filters.SimpleStreamFilter
weka.filters.unsupervised.attribute.FixedDictionaryStringToWordVector
All Implemented Interfaces:
Serializable, CapabilitiesHandler, CapabilitiesIgnorer, CommandlineRunnable, EnvironmentHandler, OptionHandler, RevisionHandler, WeightedInstancesHandler, StreamableFilter, UnsupervisedFilter

public class FixedDictionaryStringToWordVector extends SimpleStreamFilter implements UnsupervisedFilter, EnvironmentHandler, WeightedInstancesHandler
Converts String attributes into a set of attributes representing word occurrence (depending on the tokenizer) information from the text contained in the strings. The set of words (attributes) is taken from a user-supplied dictionary, either in plain text form or as a serialized java object.

Valid options are:

  -dictionary <path to dictionary file>
  The path to the dictionary to use
 
  -binary-dict
  Dictionary file contains a binary serialized dictionary
 
  -C
  Output word counts rather than boolean 0 or 1 (indicating presence or absence of a word
 
  -R <range>
  Specify range of attributes to act on. This is a comma separated list of attribute
  indices, with "first" and "last" valid values.
 
  -V
  Set attributes selection mode. If false, only selected attributes in the range will
  be worked on. If true, only non-selected attributes will be processed
 
  -P <attribute name prefix>
  Specify a prefix for the created attribute names (default: "")
 
  -T
  Set whether the word frequencies should be transformed into
  log(1+fij), where fij is the frequency of word i in document (instance) j.
 
  -I
  Set whether the word frequencies in a document should be transformed into
  fij*log(num of Docs/num of docs with word i), where fij is the frequency
  of word i in document (instance) j.
 
  -N
  Whether to normalize to average length of documents seen during dictionary construction
 
  -L
  Convert all tokens to lowercase when matching against dictionary entries.
 
  -stemmer <spec>
  The stemming algorithm (classname plus parameters) to use.
 
  -stopwords-handler <spec>
  The stopwords handler to use (default = Null)
 
  -tokenizer <spec>
  The tokenizing algorithm (classname plus parameters) to use.
  (default: weka.core.tokenizers.WordTokenizer)
 
  -output-debug-info
  If set, filter is run in debug mode and
  may output additional info to the console
 
  -do-not-check-capabilities
  If set, filter capabilities are not checked before filter is built
  (use with caution).
 
Version:
$Revision: 15573 $
Author:
Mark Hall (mhall{[at]}pentaho{[dot]}com)
See Also:
  • Constructor Details

    • FixedDictionaryStringToWordVector

      public FixedDictionaryStringToWordVector()
  • Method Details

    • getCapabilities

      public Capabilities getCapabilities()
      Returns the Capabilities of this filter.
      Specified by:
      getCapabilities in interface CapabilitiesHandler
      Overrides:
      getCapabilities in class Filter
      Returns:
      the capabilities of this object
      See Also:
    • getDictionaryHandler

      public DictionaryBuilder getDictionaryHandler()
      Get the dictionary builder used to manage the dictionary and perform the actual vectorization
      Returns:
      the DictionaryBuilder in use
    • setDictionarySource

      public void setDictionarySource(InputStream source)
      Set an input stream to load a binary serialized dictionary from, rather than source it from a file
      Parameters:
      source - the input stream to read the dictionary from
    • setDictionarySource

      public void setDictionarySource(Reader source)
      Set an input reader to load a textual dictionary from, rather than source it from a file
      Parameters:
      source - the input reader to read the dictionary from
    • setDictionaryFile

      @OptionMetadata(displayName="Dictionary file", description="The path to the dictionary to use", commandLineParamName="dictionary", commandLineParamSynopsis="-dictionary <path to dictionary file>", displayOrder=1) @FilePropertyMetadata(fileChooserDialogType=0, directoriesOnly=false) public void setDictionaryFile(File file)
      Set the dictionary file to read from
      Parameters:
      file - the file to read from
    • getDictionaryFile

      public File getDictionaryFile()
      Get the dictionary file to read from
      Returns:
      the dictionary file to read from
    • setDictionaryIsBinary

      @OptionMetadata(displayName="Dictionary is binary", description="Dictionary file contains a binary serialized dictionary", commandLineParamName="binary-dict", commandLineParamSynopsis="-binary-dict", commandLineParamIsFlag=true, displayOrder=2) public void setDictionaryIsBinary(boolean binary)
      Set whether the dictionary file contains a binary serialized dictionary, rather than a plain text one
      Parameters:
      binary - true if the dictionary is a binary serialized one
    • getDictionaryIsBinary

      public boolean getDictionaryIsBinary()
    • getOutputWordCounts

      public boolean getOutputWordCounts()
      Gets whether output instances contain 0 or 1 indicating word presence, or word counts.
      Returns:
      true if word counts should be output.
    • setOutputWordCounts

      @OptionMetadata(displayName="Output word counts", description="Output word counts rather than boolean 0 or 1 (indicating presence or absence of a word", commandLineParamName="C", commandLineParamSynopsis="-C", commandLineParamIsFlag=true, displayOrder=3) public void setOutputWordCounts(boolean outputWordCounts)
      Sets whether output instances contain 0 or 1 indicating word presence, or word counts.
      Parameters:
      outputWordCounts - true if word counts should be output.
    • getAttributeIndices

      public String getAttributeIndices()
      Gets the current range selection.
      Returns:
      a string containing a comma separated list of ranges
    • setAttributeIndices

      @OptionMetadata(displayName="Range of attributes to operate on", description="Specify range of attributes to act on. This is a comma separated list of attribute\nindices, with \"first\" and \"last\" valid values.", commandLineParamName="R", commandLineParamSynopsis="-R <range>", displayOrder=4) public void setAttributeIndices(String rangeList)
      Sets which attributes are to be worked on.
      Parameters:
      rangeList - a string representing the list of attributes. Since the string will typically come from a user, attributes are indexed from 1.
      eg: first-3,5,6-last
      Throws:
      IllegalArgumentException - if an invalid range list is supplied
    • getInvertSelection

      public boolean getInvertSelection()
      Gets whether the supplied columns are to be processed or skipped.
      Returns:
      true if the supplied columns will be kept
    • setInvertSelection

      @OptionMetadata(displayName="Invert selection", description="Set attributes selection mode. If false, only selected attributes in the range will\nbe worked on. If true, only non-selected attributes will be processed", commandLineParamName="V", commandLineParamSynopsis="-V", commandLineParamIsFlag=true, displayOrder=5) public void setInvertSelection(boolean invert)
      Sets whether selected columns should be processed or skipped.
      Parameters:
      invert - the new invert setting
    • getAttributeNamePrefix

      public String getAttributeNamePrefix()
      Get the attribute name prefix.
      Returns:
      The current attribute name prefix.
    • setAttributeNamePrefix

      @OptionMetadata(displayName="Prefix for created attribute names", description="Specify a prefix for the created attribute names (default: \"\")", commandLineParamName="P", commandLineParamSynopsis="-P <attribute name prefix>", displayOrder=6) public void setAttributeNamePrefix(String newPrefix)
      Set the attribute name prefix.
      Parameters:
      newPrefix - String to use as the attribute name prefix.
    • getTFTransform

      public boolean getTFTransform()
      Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
      Returns:
      true if word frequencies are to be transformed.
    • setTFTransform

      @OptionMetadata(displayName="TFT transform", description="Set whether the word frequencies should be transformed into\nlog(1+fij), where fij is the frequency of word i in document (instance) j.", commandLineParamName="T", commandLineParamSynopsis="-T", commandLineParamIsFlag=true, displayOrder=7) public void setTFTransform(boolean TFTransform)
      Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
      Parameters:
      TFTransform - true if word frequencies are to be transformed.
    • getIDFTransform

      public boolean getIDFTransform()
      Sets whether if the word frequencies in a document should be transformed into:
      fij*log(num of Docs/num of Docs with word i)
      where fij is the frequency of word i in document(instance) j.
      Returns:
      true if the word frequencies are to be transformed.
    • setIDFTransform

      @OptionMetadata(displayName="IDF transform", description="Set whether the word frequencies in a document should be transformed into\nfij*log(num of Docs/num of docs with word i), where fij is the frequency\nof word i in document (instance) j.", commandLineParamName="I", commandLineParamSynopsis="-I", commandLineParamIsFlag=true, displayOrder=8) public void setIDFTransform(boolean IDFTransform)
      Sets whether if the word frequencies in a document should be transformed into:
      fij*log(num of Docs/num of Docs with word i)
      where fij is the frequency of word i in document(instance) j.
      Parameters:
      IDFTransform - true if the word frequecies are to be transformed
    • setNormalizeDocLength

      @OptionMetadata(displayName="Normalize word frequencies", description="Whether to normalize to average length of documents seen during dictionary construction", commandLineParamName="N", commandLineParamSynopsis="-N", commandLineParamIsFlag=true, displayOrder=9) public void setNormalizeDocLength(boolean normalize)
      Sets whether if the word frequencies for a document (instance) should be normalized or not.
      Parameters:
      normalize - the new type.
    • getNormalizeDocLength

      public boolean getNormalizeDocLength()
      Gets whether if the word frequencies for a document (instance) should be normalized or not.
      Returns:
      true if word frequencies are to be normalized.
    • getLowerCaseTokens

      public boolean getLowerCaseTokens()
      Gets whether if the tokens are to be downcased or not.
      Returns:
      true if the tokens are to be downcased.
    • setLowerCaseTokens

      @OptionMetadata(displayName="Lower case tokens", description="Convert all tokens to lowercase when matching against dictionary entries.", commandLineParamName="L", commandLineParamSynopsis="-L", commandLineParamIsFlag=true, displayOrder=10) public void setLowerCaseTokens(boolean downCaseTokens)
      Sets whether if the tokens are to be downcased or not. (Doesn't affect non-alphabetic characters in tokens).
      Parameters:
      downCaseTokens - should be true if only lower case tokens are to be formed.
    • setStemmer

      @OptionMetadata(displayName="Stemmer to use", description="The stemming algorithm (classname plus parameters) to use.", commandLineParamName="stemmer", commandLineParamSynopsis="-stemmer <spec>", displayOrder=11) public void setStemmer(Stemmer value)
      the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).
      Parameters:
      value - the configured stemming algorithm, or null
      See Also:
    • getStemmer

      public Stemmer getStemmer()
      Returns the current stemming algorithm, null if none is used.
      Returns:
      the current stemming algorithm, null if none set
    • setStopwordsHandler

      @OptionMetadata(displayName="Stop words handler", description="The stopwords handler to use (default = Null)", commandLineParamName="stopwords-handler", commandLineParamSynopsis="-stopwords-handler <spec>", displayOrder=12) public void setStopwordsHandler(StopwordsHandler value)
      Sets the stopwords handler to use.
      Parameters:
      value - the stopwords handler, if null, Null is used
    • getStopwordsHandler

      public StopwordsHandler getStopwordsHandler()
      Gets the stopwords handler.
      Returns:
      the stopwords handler
    • setTokenizer

      @OptionMetadata(displayName="Tokenizer", description="The tokenizing algorithm (classname plus parameters) to use.\n(default: weka.core.tokenizers.WordTokenizer)", commandLineParamName="tokenizer", commandLineParamSynopsis="-tokenizer <spec>", displayOrder=13) public void setTokenizer(Tokenizer value)
      the tokenizer algorithm to use.
      Parameters:
      value - the configured tokenizing algorithm
    • getTokenizer

      public Tokenizer getTokenizer()
      Returns the current tokenizer algorithm.
      Returns:
      the current tokenizer algorithm
    • globalInfo

      public String globalInfo()
      Description copied from class: SimpleFilter
      Returns a string describing this filter.
      Specified by:
      globalInfo in class SimpleFilter
      Returns:
      a description of the filter suitable for displaying in the explorer/experimenter gui
    • setEnvironment

      public void setEnvironment(Environment env)
      Description copied from interface: EnvironmentHandler
      Set environment variables to use.
      Specified by:
      setEnvironment in interface EnvironmentHandler
      Parameters:
      env - the environment variables to use
    • main

      public static void main(String[] args)