Class FixedDictionaryStringToWordVector

All Implemented Interfaces:
Serializable, CapabilitiesHandler, CapabilitiesIgnorer, CommandlineRunnable, EnvironmentHandler, OptionHandler, RevisionHandler, WeightedInstancesHandler, StreamableFilter, UnsupervisedFilter

public class FixedDictionaryStringToWordVector extends SimpleStreamFilter implements UnsupervisedFilter, EnvironmentHandler, WeightedInstancesHandler
Converts String attributes into a set of attributes representing word occurrence (depending on the tokenizer) information from the text contained in the strings. The set of words (attributes) is taken from a user-supplied dictionary, either in plain text form or as a serialized java object.

Valid options are:

  -dictionary <path to dictionary file>
  The path to the dictionary to use
  Dictionary file contains a binary serialized dictionary
  Output word counts rather than boolean 0 or 1 (indicating presence or absence of a word
  -R <range>
  Specify range of attributes to act on. This is a comma separated list of attribute
  indices, with "first" and "last" valid values.
  Set attributes selection mode. If false, only selected attributes in the range will
  be worked on. If true, only non-selected attributes will be processed
  -P <attribute name prefix>
  Specify a prefix for the created attribute names (default: "")
  Set whether the word frequencies should be transformed into
  log(1+fij), where fij is the frequency of word i in document (instance) j.
  Set whether the word frequencies in a document should be transformed into
  fij*log(num of Docs/num of docs with word i), where fij is the frequency
  of word i in document (instance) j.
  Whether to normalize to average length of documents seen during dictionary construction
  Convert all tokens to lowercase when matching against dictionary entries.
  -stemmer <spec>
  The stemming algorithm (classname plus parameters) to use.
  -stopwords-handler <spec>
  The stopwords handler to use (default = Null)
  -tokenizer <spec>
  The tokenizing algorithm (classname plus parameters) to use.
  (default: weka.core.tokenizers.WordTokenizer)
  If set, filter is run in debug mode and
  may output additional info to the console
  If set, filter capabilities are not checked before filter is built
  (use with caution).
$Revision: 15573 $
Mark Hall (mhall{[at]}pentaho{[dot]}com)
See Also:
  • Constructor Details

    • FixedDictionaryStringToWordVector

      public FixedDictionaryStringToWordVector()
  • Method Details

    • getCapabilities

      public Capabilities getCapabilities()
      Returns the Capabilities of this filter.
      Specified by:
      getCapabilities in interface CapabilitiesHandler
      getCapabilities in class Filter
      the capabilities of this object
      See Also:
    • getDictionaryHandler

      public DictionaryBuilder getDictionaryHandler()
      Get the dictionary builder used to manage the dictionary and perform the actual vectorization
      the DictionaryBuilder in use
    • setDictionarySource

      public void setDictionarySource(InputStream source)
      Set an input stream to load a binary serialized dictionary from, rather than source it from a file
      source - the input stream to read the dictionary from
    • setDictionarySource

      public void setDictionarySource(Reader source)
      Set an input reader to load a textual dictionary from, rather than source it from a file
      source - the input reader to read the dictionary from
    • setDictionaryFile

      @OptionMetadata(displayName="Dictionary file", description="The path to the dictionary to use", commandLineParamName="dictionary", commandLineParamSynopsis="-dictionary <path to dictionary file>", displayOrder=1) @FilePropertyMetadata(fileChooserDialogType=0, directoriesOnly=false) public void setDictionaryFile(File file)
      Set the dictionary file to read from
      file - the file to read from
    • getDictionaryFile

      public File getDictionaryFile()
      Get the dictionary file to read from
      the dictionary file to read from
    • setDictionaryIsBinary

      @OptionMetadata(displayName="Dictionary is binary", description="Dictionary file contains a binary serialized dictionary", commandLineParamName="binary-dict", commandLineParamSynopsis="-binary-dict", commandLineParamIsFlag=true, displayOrder=2) public void setDictionaryIsBinary(boolean binary)
      Set whether the dictionary file contains a binary serialized dictionary, rather than a plain text one
      binary - true if the dictionary is a binary serialized one
    • getDictionaryIsBinary

      public boolean getDictionaryIsBinary()
    • getOutputWordCounts

      public boolean getOutputWordCounts()
      Gets whether output instances contain 0 or 1 indicating word presence, or word counts.
      true if word counts should be output.
    • setOutputWordCounts

      @OptionMetadata(displayName="Output word counts", description="Output word counts rather than boolean 0 or 1 (indicating presence or absence of a word", commandLineParamName="C", commandLineParamSynopsis="-C", commandLineParamIsFlag=true, displayOrder=3) public void setOutputWordCounts(boolean outputWordCounts)
      Sets whether output instances contain 0 or 1 indicating word presence, or word counts.
      outputWordCounts - true if word counts should be output.
    • getAttributeIndices

      public String getAttributeIndices()
      Gets the current range selection.
      a string containing a comma separated list of ranges
    • setAttributeIndices

      @OptionMetadata(displayName="Range of attributes to operate on", description="Specify range of attributes to act on. This is a comma separated list of attribute\nindices, with \"first\" and \"last\" valid values.", commandLineParamName="R", commandLineParamSynopsis="-R <range>", displayOrder=4) public void setAttributeIndices(String rangeList)
      Sets which attributes are to be worked on.
      rangeList - a string representing the list of attributes. Since the string will typically come from a user, attributes are indexed from 1.
      eg: first-3,5,6-last
      IllegalArgumentException - if an invalid range list is supplied
    • getInvertSelection

      public boolean getInvertSelection()
      Gets whether the supplied columns are to be processed or skipped.
      true if the supplied columns will be kept
    • setInvertSelection

      @OptionMetadata(displayName="Invert selection", description="Set attributes selection mode. If false, only selected attributes in the range will\nbe worked on. If true, only non-selected attributes will be processed", commandLineParamName="V", commandLineParamSynopsis="-V", commandLineParamIsFlag=true, displayOrder=5) public void setInvertSelection(boolean invert)
      Sets whether selected columns should be processed or skipped.
      invert - the new invert setting
    • getAttributeNamePrefix

      public String getAttributeNamePrefix()
      Get the attribute name prefix.
      The current attribute name prefix.
    • setAttributeNamePrefix

      @OptionMetadata(displayName="Prefix for created attribute names", description="Specify a prefix for the created attribute names (default: \"\")", commandLineParamName="P", commandLineParamSynopsis="-P <attribute name prefix>", displayOrder=6) public void setAttributeNamePrefix(String newPrefix)
      Set the attribute name prefix.
      newPrefix - String to use as the attribute name prefix.
    • getTFTransform

      public boolean getTFTransform()
      Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
      true if word frequencies are to be transformed.
    • setTFTransform

      @OptionMetadata(displayName="TFT transform", description="Set whether the word frequencies should be transformed into\nlog(1+fij), where fij is the frequency of word i in document (instance) j.", commandLineParamName="T", commandLineParamSynopsis="-T", commandLineParamIsFlag=true, displayOrder=7) public void setTFTransform(boolean TFTransform)
      Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
      TFTransform - true if word frequencies are to be transformed.
    • getIDFTransform

      public boolean getIDFTransform()
      Sets whether if the word frequencies in a document should be transformed into:
      fij*log(num of Docs/num of Docs with word i)
      where fij is the frequency of word i in document(instance) j.
      true if the word frequencies are to be transformed.
    • setIDFTransform

      @OptionMetadata(displayName="IDF transform", description="Set whether the word frequencies in a document should be transformed into\nfij*log(num of Docs/num of docs with word i), where fij is the frequency\nof word i in document (instance) j.", commandLineParamName="I", commandLineParamSynopsis="-I", commandLineParamIsFlag=true, displayOrder=8) public void setIDFTransform(boolean IDFTransform)
      Sets whether if the word frequencies in a document should be transformed into:
      fij*log(num of Docs/num of Docs with word i)
      where fij is the frequency of word i in document(instance) j.
      IDFTransform - true if the word frequecies are to be transformed
    • setNormalizeDocLength

      @OptionMetadata(displayName="Normalize word frequencies", description="Whether to normalize to average length of documents seen during dictionary construction", commandLineParamName="N", commandLineParamSynopsis="-N", commandLineParamIsFlag=true, displayOrder=9) public void setNormalizeDocLength(boolean normalize)
      Sets whether if the word frequencies for a document (instance) should be normalized or not.
      normalize - the new type.
    • getNormalizeDocLength

      public boolean getNormalizeDocLength()
      Gets whether if the word frequencies for a document (instance) should be normalized or not.
      true if word frequencies are to be normalized.
    • getLowerCaseTokens

      public boolean getLowerCaseTokens()
      Gets whether if the tokens are to be downcased or not.
      true if the tokens are to be downcased.
    • setLowerCaseTokens

      @OptionMetadata(displayName="Lower case tokens", description="Convert all tokens to lowercase when matching against dictionary entries.", commandLineParamName="L", commandLineParamSynopsis="-L", commandLineParamIsFlag=true, displayOrder=10) public void setLowerCaseTokens(boolean downCaseTokens)
      Sets whether if the tokens are to be downcased or not. (Doesn't affect non-alphabetic characters in tokens).
      downCaseTokens - should be true if only lower case tokens are to be formed.
    • setStemmer

      @OptionMetadata(displayName="Stemmer to use", description="The stemming algorithm (classname plus parameters) to use.", commandLineParamName="stemmer", commandLineParamSynopsis="-stemmer <spec>", displayOrder=11) public void setStemmer(Stemmer value)
      the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).
      value - the configured stemming algorithm, or null
      See Also:
    • getStemmer

      public Stemmer getStemmer()
      Returns the current stemming algorithm, null if none is used.
      the current stemming algorithm, null if none set
    • setStopwordsHandler

      @OptionMetadata(displayName="Stop words handler", description="The stopwords handler to use (default = Null)", commandLineParamName="stopwords-handler", commandLineParamSynopsis="-stopwords-handler <spec>", displayOrder=12) public void setStopwordsHandler(StopwordsHandler value)
      Sets the stopwords handler to use.
      value - the stopwords handler, if null, Null is used
    • getStopwordsHandler

      public StopwordsHandler getStopwordsHandler()
      Gets the stopwords handler.
      the stopwords handler
    • setTokenizer

      @OptionMetadata(displayName="Tokenizer", description="The tokenizing algorithm (classname plus parameters) to use.\n(default: weka.core.tokenizers.WordTokenizer)", commandLineParamName="tokenizer", commandLineParamSynopsis="-tokenizer <spec>", displayOrder=13) public void setTokenizer(Tokenizer value)
      the tokenizer algorithm to use.
      value - the configured tokenizing algorithm
    • getTokenizer

      public Tokenizer getTokenizer()
      Returns the current tokenizer algorithm.
      the current tokenizer algorithm
    • globalInfo

      public String globalInfo()
      Description copied from class: SimpleFilter
      Returns a string describing this filter.
      Specified by:
      globalInfo in class SimpleFilter
      a description of the filter suitable for displaying in the explorer/experimenter gui
    • setEnvironment

      public void setEnvironment(Environment env)
      Description copied from interface: EnvironmentHandler
      Set environment variables to use.
      Specified by:
      setEnvironment in interface EnvironmentHandler
      env - the environment variables to use
    • main

      public static void main(String[] args)