weka.filters.unsupervised.attribute.FixedDictionaryStringToWordVector

All Implemented Interfaces:: Serializable, CapabilitiesHandler, CapabilitiesIgnorer, CommandlineRunnable, EnvironmentHandler, OptionHandler, RevisionHandler, WeightedInstancesHandler, StreamableFilter, UnsupervisedFilter

public class FixedDictionaryStringToWordVector extends SimpleStreamFilter implements UnsupervisedFilter, EnvironmentHandler, WeightedInstancesHandler

Converts String attributes into a set of attributes representing word occurrence (depending on the tokenizer) information from the text contained in the strings. The set of words (attributes) is taken from a user-supplied dictionary, either in plain text form or as a serialized java object.

Valid options are:

  -dictionary <path to dictionary file>
  The path to the dictionary to use

  -binary-dict
  Dictionary file contains a binary serialized dictionary

  -C
  Output word counts rather than boolean 0 or 1 (indicating presence or absence of a word

  -R <range>
  Specify range of attributes to act on. This is a comma separated list of attribute
  indices, with "first" and "last" valid values.

  -V
  Set attributes selection mode. If false, only selected attributes in the range will
  be worked on. If true, only non-selected attributes will be processed

  -P <attribute name prefix>
  Specify a prefix for the created attribute names (default: "")

  -T
  Set whether the word frequencies should be transformed into
  log(1+fij), where fij is the frequency of word i in document (instance) j.

  -I
  Set whether the word frequencies in a document should be transformed into
  fij*log(num of Docs/num of docs with word i), where fij is the frequency
  of word i in document (instance) j.

  -N
  Whether to normalize to average length of documents seen during dictionary construction

  -L
  Convert all tokens to lowercase when matching against dictionary entries.

  -stemmer <spec>
  The stemming algorithm (classname plus parameters) to use.

  -stopwords-handler <spec>
  The stopwords handler to use (default = Null)

  -tokenizer <spec>
  The tokenizing algorithm (classname plus parameters) to use.
  (default: weka.core.tokenizers.WordTokenizer)

  -output-debug-info
  If set, filter is run in debug mode and
  may output additional info to the console

  -do-not-check-capabilities
  If set, filter capabilities are not checked before filter is built
  (use with caution).

Version:

$Revision: 15573 $

Author:

Mark Hall (mhall{[at]}pentaho{[dot]}com)

See Also:

Serialized Form

Constructor Summary

Constructors

Constructor

Description

FixedDictionaryStringToWordVector()
Method Summary

Modifier and Type

Method

Description

String

getAttributeIndices()

Gets the current range selection.

String

getAttributeNamePrefix()

Get the attribute name prefix.

Capabilities

getCapabilities()

Returns the Capabilities of this filter.

File

getDictionaryFile()

Get the dictionary file to read from

DictionaryBuilder

getDictionaryHandler()

Get the dictionary builder used to manage the dictionary and perform the actual vectorization

boolean

getDictionaryIsBinary()

boolean

getIDFTransform()

Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.

boolean

getInvertSelection()

Gets whether the supplied columns are to be processed or skipped.

boolean

getLowerCaseTokens()

Gets whether if the tokens are to be downcased or not.

boolean

getNormalizeDocLength()

Gets whether if the word frequencies for a document (instance) should be normalized or not.

boolean

getOutputWordCounts()

Gets whether output instances contain 0 or 1 indicating word presence, or word counts.

Stemmer

getStemmer()

Returns the current stemming algorithm, null if none is used.

StopwordsHandler

getStopwordsHandler()

Gets the stopwords handler.

boolean

getTFTransform()

Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.

Tokenizer

getTokenizer()

Returns the current tokenizer algorithm.

String

globalInfo()

Returns a string describing this filter.

static void

main(String[] args)

void

setAttributeIndices(String rangeList)

Sets which attributes are to be worked on.

void

setAttributeNamePrefix(String newPrefix)

Set the attribute name prefix.

void

setDictionaryFile(File file)

Set the dictionary file to read from

void

setDictionaryIsBinary(boolean binary)

Set whether the dictionary file contains a binary serialized dictionary, rather than a plain text one

void

setDictionarySource(InputStream source)

Set an input stream to load a binary serialized dictionary from, rather than source it from a file

void

setDictionarySource(Reader source)

Set an input reader to load a textual dictionary from, rather than source it from a file

void

setEnvironment(Environment env)

Set environment variables to use.

void

setIDFTransform(boolean IDFTransform)

Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.

void

setInvertSelection(boolean invert)

Sets whether selected columns should be processed or skipped.

void

setLowerCaseTokens(boolean downCaseTokens)

Sets whether if the tokens are to be downcased or not.

void

setNormalizeDocLength(boolean normalize)

Sets whether if the word frequencies for a document (instance) should be normalized or not.

void

setOutputWordCounts(boolean outputWordCounts)

Sets whether output instances contain 0 or 1 indicating word presence, or word counts.

void

setStemmer(Stemmer value)

the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).

void

setStopwordsHandler(StopwordsHandler value)

Sets the stopwords handler to use.

void

setTFTransform(boolean TFTransform)

Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.

void

setTokenizer(Tokenizer value)

the tokenizer algorithm to use.

Methods inherited from class weka.filters.SimpleStreamFilter
batchFinished, input

Methods inherited from class weka.filters.SimpleFilter
setInputFormat

Methods inherited from class weka.filters.Filter
batchFilterFile, debugTipText, doNotCheckCapabilitiesTipText, filterFile, getCapabilities, getCopyOfInputFormat, getDebug, getDoNotCheckCapabilities, getOptions, getOutputFormat, getRevision, isFirstBatchDone, isNewBatch, isOutputFormatDefined, listOptions, makeCopies, makeCopy, mayRemoveInstanceAfterFirstBatchDone, numPendingOutput, output, outputPeek, postExecution, preExecution, run, runFilter, setDebug, setDoNotCheckCapabilities, setOptions, toString, useFilter, wekaStaticWrapper

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait

Constructor Details
- FixedDictionaryStringToWordVector
  
  public FixedDictionaryStringToWordVector()
Method Details
- getCapabilities
  
  public Capabilities getCapabilities()
  
  Returns the Capabilities of this filter.
  Specified by:
  
  getCapabilities in interface CapabilitiesHandler
  
  Overrides:
  
  getCapabilities in class Filter
  
  Returns:
  
  the capabilities of this object
  
  See Also:
  
  Capabilities
- getDictionaryHandler
  
  public DictionaryBuilder getDictionaryHandler()
  
  Get the dictionary builder used to manage the dictionary and perform the actual vectorization
  
  Returns:
  
  the DictionaryBuilder in use
- setDictionarySource
  
  public void setDictionarySource(InputStream source)
  
  Set an input stream to load a binary serialized dictionary from, rather than source it from a file
  
  Parameters:
  
  source - the input stream to read the dictionary from
- setDictionarySource
  
  public void setDictionarySource(Reader source)
  
  Set an input reader to load a textual dictionary from, rather than source it from a file
  
  Parameters:
  
  source - the input reader to read the dictionary from
- setDictionaryFile
  
  @OptionMetadata(displayName="Dictionary file", description="The path to the dictionary to use", commandLineParamName="dictionary", commandLineParamSynopsis="-dictionary <path to dictionary file>", displayOrder=1) @FilePropertyMetadata(fileChooserDialogType=0, directoriesOnly=false) public void setDictionaryFile(File file)
  
  Set the dictionary file to read from
  
  Parameters:
  
  file - the file to read from
- getDictionaryFile
  
  public File getDictionaryFile()
  
  Get the dictionary file to read from
  
  Returns:
  
  the dictionary file to read from
- setDictionaryIsBinary
  
  @OptionMetadata(displayName="Dictionary is binary", description="Dictionary file contains a binary serialized dictionary", commandLineParamName="binary-dict", commandLineParamSynopsis="-binary-dict", commandLineParamIsFlag=true, displayOrder=2) public void setDictionaryIsBinary(boolean binary)
  
  Set whether the dictionary file contains a binary serialized dictionary, rather than a plain text one
  
  Parameters:
  
  binary - true if the dictionary is a binary serialized one
- getDictionaryIsBinary
  
  public boolean getDictionaryIsBinary()
- getOutputWordCounts
  
  public boolean getOutputWordCounts()
  
  Gets whether output instances contain 0 or 1 indicating word presence, or word counts.
  
  Returns:
  
  true if word counts should be output.
- setOutputWordCounts
  
  @OptionMetadata(displayName="Output word counts", description="Output word counts rather than boolean 0 or 1 (indicating presence or absence of a word", commandLineParamName="C", commandLineParamSynopsis="-C", commandLineParamIsFlag=true, displayOrder=3) public void setOutputWordCounts(boolean outputWordCounts)
  
  Sets whether output instances contain 0 or 1 indicating word presence, or word counts.
  
  Parameters:
  
  outputWordCounts - true if word counts should be output.
- getAttributeIndices
  
  public String getAttributeIndices()
  
  Gets the current range selection.
  
  Returns:
  
  a string containing a comma separated list of ranges
- setAttributeIndices
  
  @OptionMetadata(displayName="Range of attributes to operate on", description="Specify range of attributes to act on. This is a comma separated list of attribute\nindices, with \"first\" and \"last\" valid values.", commandLineParamName="R", commandLineParamSynopsis="-R <range>", displayOrder=4) public void setAttributeIndices(String rangeList)
  
  Sets which attributes are to be worked on.
  
  Parameters:
  
  rangeList - a string representing the list of attributes. Since the string will typically come from a user, attributes are indexed from 1.
  eg: first-3,5,6-last
  
  Throws:
  
  IllegalArgumentException - if an invalid range list is supplied
- getInvertSelection
  
  public boolean getInvertSelection()
  
  Gets whether the supplied columns are to be processed or skipped.
  
  Returns:
  
  true if the supplied columns will be kept
- setInvertSelection
  
  @OptionMetadata(displayName="Invert selection", description="Set attributes selection mode. If false, only selected attributes in the range will\nbe worked on. If true, only non-selected attributes will be processed", commandLineParamName="V", commandLineParamSynopsis="-V", commandLineParamIsFlag=true, displayOrder=5) public void setInvertSelection(boolean invert)
  
  Sets whether selected columns should be processed or skipped.
  
  Parameters:
  
  invert - the new invert setting
- getAttributeNamePrefix
  
  public String getAttributeNamePrefix()
  
  Get the attribute name prefix.
  
  Returns:
  
  The current attribute name prefix.
- setAttributeNamePrefix
  
  @OptionMetadata(displayName="Prefix for created attribute names", description="Specify a prefix for the created attribute names (default: \"\")", commandLineParamName="P", commandLineParamSynopsis="-P <attribute name prefix>", displayOrder=6) public void setAttributeNamePrefix(String newPrefix)
  
  Set the attribute name prefix.
  
  Parameters:
  
  newPrefix - String to use as the attribute name prefix.
- getTFTransform
  
  public boolean getTFTransform()
  
  Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
  
  Returns:
  
  true if word frequencies are to be transformed.
- setTFTransform
  
  @OptionMetadata(displayName="TFT transform", description="Set whether the word frequencies should be transformed into\nlog(1+fij), where fij is the frequency of word i in document (instance) j.", commandLineParamName="T", commandLineParamSynopsis="-T", commandLineParamIsFlag=true, displayOrder=7) public void setTFTransform(boolean TFTransform)
  
  Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
  
  Parameters:
  
  TFTransform - true if word frequencies are to be transformed.
- getIDFTransform
  
  public boolean getIDFTransform()
  
  Sets whether if the word frequencies in a document should be transformed into:
  fij*log(num of Docs/num of Docs with word i)
  where fij is the frequency of word i in document(instance) j.
  
  Returns:
  
  true if the word frequencies are to be transformed.
- setIDFTransform
  
  @OptionMetadata(displayName="IDF transform", description="Set whether the word frequencies in a document should be transformed into\nfij*log(num of Docs/num of docs with word i), where fij is the frequency\nof word i in document (instance) j.", commandLineParamName="I", commandLineParamSynopsis="-I", commandLineParamIsFlag=true, displayOrder=8) public void setIDFTransform(boolean IDFTransform)
  
  Sets whether if the word frequencies in a document should be transformed into:
  fij*log(num of Docs/num of Docs with word i)
  where fij is the frequency of word i in document(instance) j.
  
  Parameters:
  
  IDFTransform - true if the word frequecies are to be transformed
- setNormalizeDocLength
  
  @OptionMetadata(displayName="Normalize word frequencies", description="Whether to normalize to average length of documents seen during dictionary construction", commandLineParamName="N", commandLineParamSynopsis="-N", commandLineParamIsFlag=true, displayOrder=9) public void setNormalizeDocLength(boolean normalize)
  
  Sets whether if the word frequencies for a document (instance) should be normalized or not.
  
  Parameters:
  
  normalize - the new type.
- getNormalizeDocLength
  
  public boolean getNormalizeDocLength()
  
  Gets whether if the word frequencies for a document (instance) should be normalized or not.
  
  Returns:
  
  true if word frequencies are to be normalized.
- getLowerCaseTokens
  
  public boolean getLowerCaseTokens()
  
  Gets whether if the tokens are to be downcased or not.
  
  Returns:
  
  true if the tokens are to be downcased.
- setLowerCaseTokens
  
  @OptionMetadata(displayName="Lower case tokens", description="Convert all tokens to lowercase when matching against dictionary entries.", commandLineParamName="L", commandLineParamSynopsis="-L", commandLineParamIsFlag=true, displayOrder=10) public void setLowerCaseTokens(boolean downCaseTokens)
  
  Sets whether if the tokens are to be downcased or not. (Doesn't affect non-alphabetic characters in tokens).
  
  Parameters:
  
  downCaseTokens - should be true if only lower case tokens are to be formed.
- setStemmer
  
  @OptionMetadata(displayName="Stemmer to use", description="The stemming algorithm (classname plus parameters) to use.", commandLineParamName="stemmer", commandLineParamSynopsis="-stemmer <spec>", displayOrder=11) public void setStemmer(Stemmer value)
  
  the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).
  Parameters:
  
  value - the configured stemming algorithm, or null
  
  See Also:
  
  NullStemmer
- getStemmer
  
  public Stemmer getStemmer()
  
  Returns the current stemming algorithm, null if none is used.
  
  Returns:
  
  the current stemming algorithm, null if none set
- setStopwordsHandler
  
  @OptionMetadata(displayName="Stop words handler", description="The stopwords handler to use (default = Null)", commandLineParamName="stopwords-handler", commandLineParamSynopsis="-stopwords-handler <spec>", displayOrder=12) public void setStopwordsHandler(StopwordsHandler value)
  
  Sets the stopwords handler to use.
  
  Parameters:
  
  value - the stopwords handler, if null, Null is used
- getStopwordsHandler
  
  public StopwordsHandler getStopwordsHandler()
  
  Gets the stopwords handler.
  
  Returns:
  
  the stopwords handler
- setTokenizer
  
  @OptionMetadata(displayName="Tokenizer", description="The tokenizing algorithm (classname plus parameters) to use.\n(default: weka.core.tokenizers.WordTokenizer)", commandLineParamName="tokenizer", commandLineParamSynopsis="-tokenizer <spec>", displayOrder=13) public void setTokenizer(Tokenizer value)
  
  the tokenizer algorithm to use.
  
  Parameters:
  
  value - the configured tokenizing algorithm
- getTokenizer
  
  public Tokenizer getTokenizer()
  
  Returns the current tokenizer algorithm.
  
  Returns:
  
  the current tokenizer algorithm
- globalInfo
  
  public String globalInfo()
  
  Description copied from class: SimpleFilter
  
  Returns a string describing this filter.
  
  Specified by:
  
  globalInfo in class SimpleFilter
  
  Returns:
  
  a description of the filter suitable for displaying in the explorer/experimenter gui
- setEnvironment
  
  public void setEnvironment(Environment env)
  
  Description copied from interface: EnvironmentHandler
  
  Set environment variables to use.
  
  Specified by:
  
  setEnvironment in interface EnvironmentHandler
  
  Parameters:
  
  env - the environment variables to use
- main
  
  public static void main(String[] args)

Class FixedDictionaryStringToWordVector

Constructor Summary

Method Summary

Methods inherited from class weka.filters.SimpleStreamFilter

Methods inherited from class weka.filters.SimpleFilter

Methods inherited from class weka.filters.Filter

Methods inherited from class java.lang.Object

Constructor Details

FixedDictionaryStringToWordVector

Method Details

getCapabilities

getDictionaryHandler

setDictionarySource

setDictionarySource

setDictionaryFile

getDictionaryFile

setDictionaryIsBinary

getDictionaryIsBinary

getOutputWordCounts

setOutputWordCounts

getAttributeIndices

setAttributeIndices

getInvertSelection

setInvertSelection

getAttributeNamePrefix

setAttributeNamePrefix

getTFTransform

setTFTransform

getIDFTransform

setIDFTransform

setNormalizeDocLength

getNormalizeDocLength

getLowerCaseTokens

setLowerCaseTokens

setStemmer

getStemmer

setStopwordsHandler

getStopwordsHandler

setTokenizer

getTokenizer

globalInfo

setEnvironment

main