Class FixedDictionaryStringToWordVector
java.lang.Object
weka.filters.Filter
weka.filters.SimpleFilter
weka.filters.SimpleStreamFilter
weka.filters.unsupervised.attribute.FixedDictionaryStringToWordVector
- All Implemented Interfaces:
Serializable
,CapabilitiesHandler
,CapabilitiesIgnorer
,CommandlineRunnable
,EnvironmentHandler
,OptionHandler
,RevisionHandler
,WeightedInstancesHandler
,StreamableFilter
,UnsupervisedFilter
public class FixedDictionaryStringToWordVector
extends SimpleStreamFilter
implements UnsupervisedFilter, EnvironmentHandler, WeightedInstancesHandler
Converts String attributes into a set of attributes
representing word occurrence (depending on the tokenizer) information from
the text contained in the strings. The set of words (attributes) is taken
from a user-supplied dictionary, either in plain text form or as a serialized
java object.
Valid options are:
Valid options are:
-dictionary <path to dictionary file> The path to the dictionary to use
-binary-dict Dictionary file contains a binary serialized dictionary
-C Output word counts rather than boolean 0 or 1 (indicating presence or absence of a word
-R <range> Specify range of attributes to act on. This is a comma separated list of attribute indices, with "first" and "last" valid values.
-V Set attributes selection mode. If false, only selected attributes in the range will be worked on. If true, only non-selected attributes will be processed
-P <attribute name prefix> Specify a prefix for the created attribute names (default: "")
-T Set whether the word frequencies should be transformed into log(1+fij), where fij is the frequency of word i in document (instance) j.
-I Set whether the word frequencies in a document should be transformed into fij*log(num of Docs/num of docs with word i), where fij is the frequency of word i in document (instance) j.
-N Whether to normalize to average length of documents seen during dictionary construction
-L Convert all tokens to lowercase when matching against dictionary entries.
-stemmer <spec> The stemming algorithm (classname plus parameters) to use.
-stopwords-handler <spec> The stopwords handler to use (default = Null)
-tokenizer <spec> The tokenizing algorithm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
-output-debug-info If set, filter is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, filter capabilities are not checked before filter is built (use with caution).
- Version:
- $Revision: 15573 $
- Author:
- Mark Hall (mhall{[at]}pentaho{[dot]}com)
- See Also:
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionGets the current range selection.Get the attribute name prefix.Returns the Capabilities of this filter.Get the dictionary file to read fromGet the dictionary builder used to manage the dictionary and perform the actual vectorizationboolean
boolean
Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.boolean
Gets whether the supplied columns are to be processed or skipped.boolean
Gets whether if the tokens are to be downcased or not.boolean
Gets whether if the word frequencies for a document (instance) should be normalized or not.boolean
Gets whether output instances contain 0 or 1 indicating word presence, or word counts.Returns the current stemming algorithm, null if none is used.Gets the stopwords handler.boolean
Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.Returns the current tokenizer algorithm.Returns a string describing this filter.static void
void
setAttributeIndices
(String rangeList) Sets which attributes are to be worked on.void
setAttributeNamePrefix
(String newPrefix) Set the attribute name prefix.void
setDictionaryFile
(File file) Set the dictionary file to read fromvoid
setDictionaryIsBinary
(boolean binary) Set whether the dictionary file contains a binary serialized dictionary, rather than a plain text onevoid
setDictionarySource
(InputStream source) Set an input stream to load a binary serialized dictionary from, rather than source it from a filevoid
setDictionarySource
(Reader source) Set an input reader to load a textual dictionary from, rather than source it from a filevoid
Set environment variables to use.void
setIDFTransform
(boolean IDFTransform) Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.void
setInvertSelection
(boolean invert) Sets whether selected columns should be processed or skipped.void
setLowerCaseTokens
(boolean downCaseTokens) Sets whether if the tokens are to be downcased or not.void
setNormalizeDocLength
(boolean normalize) Sets whether if the word frequencies for a document (instance) should be normalized or not.void
setOutputWordCounts
(boolean outputWordCounts) Sets whether output instances contain 0 or 1 indicating word presence, or word counts.void
setStemmer
(Stemmer value) the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).void
Sets the stopwords handler to use.void
setTFTransform
(boolean TFTransform) Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.void
setTokenizer
(Tokenizer value) the tokenizer algorithm to use.Methods inherited from class weka.filters.SimpleStreamFilter
batchFinished, input
Methods inherited from class weka.filters.SimpleFilter
setInputFormat
Methods inherited from class weka.filters.Filter
batchFilterFile, debugTipText, doNotCheckCapabilitiesTipText, filterFile, getCapabilities, getCopyOfInputFormat, getDebug, getDoNotCheckCapabilities, getOptions, getOutputFormat, getRevision, isFirstBatchDone, isNewBatch, isOutputFormatDefined, listOptions, makeCopies, makeCopy, mayRemoveInstanceAfterFirstBatchDone, numPendingOutput, output, outputPeek, postExecution, preExecution, run, runFilter, setDebug, setDoNotCheckCapabilities, setOptions, toString, useFilter, wekaStaticWrapper
-
Constructor Details
-
FixedDictionaryStringToWordVector
public FixedDictionaryStringToWordVector()
-
-
Method Details
-
getCapabilities
Returns the Capabilities of this filter.- Specified by:
getCapabilities
in interfaceCapabilitiesHandler
- Overrides:
getCapabilities
in classFilter
- Returns:
- the capabilities of this object
- See Also:
-
getDictionaryHandler
Get the dictionary builder used to manage the dictionary and perform the actual vectorization- Returns:
- the DictionaryBuilder in use
-
setDictionarySource
Set an input stream to load a binary serialized dictionary from, rather than source it from a file- Parameters:
source
- the input stream to read the dictionary from
-
setDictionarySource
Set an input reader to load a textual dictionary from, rather than source it from a file- Parameters:
source
- the input reader to read the dictionary from
-
setDictionaryFile
@OptionMetadata(displayName="Dictionary file", description="The path to the dictionary to use", commandLineParamName="dictionary", commandLineParamSynopsis="-dictionary <path to dictionary file>", displayOrder=1) @FilePropertyMetadata(fileChooserDialogType=0, directoriesOnly=false) public void setDictionaryFile(File file) Set the dictionary file to read from- Parameters:
file
- the file to read from
-
getDictionaryFile
Get the dictionary file to read from- Returns:
- the dictionary file to read from
-
setDictionaryIsBinary
@OptionMetadata(displayName="Dictionary is binary", description="Dictionary file contains a binary serialized dictionary", commandLineParamName="binary-dict", commandLineParamSynopsis="-binary-dict", commandLineParamIsFlag=true, displayOrder=2) public void setDictionaryIsBinary(boolean binary) Set whether the dictionary file contains a binary serialized dictionary, rather than a plain text one- Parameters:
binary
- true if the dictionary is a binary serialized one
-
getDictionaryIsBinary
public boolean getDictionaryIsBinary() -
getOutputWordCounts
public boolean getOutputWordCounts()Gets whether output instances contain 0 or 1 indicating word presence, or word counts.- Returns:
- true if word counts should be output.
-
setOutputWordCounts
@OptionMetadata(displayName="Output word counts", description="Output word counts rather than boolean 0 or 1 (indicating presence or absence of a word", commandLineParamName="C", commandLineParamSynopsis="-C", commandLineParamIsFlag=true, displayOrder=3) public void setOutputWordCounts(boolean outputWordCounts) Sets whether output instances contain 0 or 1 indicating word presence, or word counts.- Parameters:
outputWordCounts
- true if word counts should be output.
-
getAttributeIndices
Gets the current range selection.- Returns:
- a string containing a comma separated list of ranges
-
setAttributeIndices
@OptionMetadata(displayName="Range of attributes to operate on", description="Specify range of attributes to act on. This is a comma separated list of attribute\nindices, with \"first\" and \"last\" valid values.", commandLineParamName="R", commandLineParamSynopsis="-R <range>", displayOrder=4) public void setAttributeIndices(String rangeList) Sets which attributes are to be worked on.- Parameters:
rangeList
- a string representing the list of attributes. Since the string will typically come from a user, attributes are indexed from 1.
eg: first-3,5,6-last- Throws:
IllegalArgumentException
- if an invalid range list is supplied
-
getInvertSelection
public boolean getInvertSelection()Gets whether the supplied columns are to be processed or skipped.- Returns:
- true if the supplied columns will be kept
-
setInvertSelection
@OptionMetadata(displayName="Invert selection", description="Set attributes selection mode. If false, only selected attributes in the range will\nbe worked on. If true, only non-selected attributes will be processed", commandLineParamName="V", commandLineParamSynopsis="-V", commandLineParamIsFlag=true, displayOrder=5) public void setInvertSelection(boolean invert) Sets whether selected columns should be processed or skipped.- Parameters:
invert
- the new invert setting
-
getAttributeNamePrefix
Get the attribute name prefix.- Returns:
- The current attribute name prefix.
-
setAttributeNamePrefix
@OptionMetadata(displayName="Prefix for created attribute names", description="Specify a prefix for the created attribute names (default: \"\")", commandLineParamName="P", commandLineParamSynopsis="-P <attribute name prefix>", displayOrder=6) public void setAttributeNamePrefix(String newPrefix) Set the attribute name prefix.- Parameters:
newPrefix
- String to use as the attribute name prefix.
-
getTFTransform
public boolean getTFTransform()Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.- Returns:
- true if word frequencies are to be transformed.
-
setTFTransform
@OptionMetadata(displayName="TFT transform", description="Set whether the word frequencies should be transformed into\nlog(1+fij), where fij is the frequency of word i in document (instance) j.", commandLineParamName="T", commandLineParamSynopsis="-T", commandLineParamIsFlag=true, displayOrder=7) public void setTFTransform(boolean TFTransform) Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.- Parameters:
TFTransform
- true if word frequencies are to be transformed.
-
getIDFTransform
public boolean getIDFTransform()Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.- Returns:
- true if the word frequencies are to be transformed.
-
setIDFTransform
@OptionMetadata(displayName="IDF transform", description="Set whether the word frequencies in a document should be transformed into\nfij*log(num of Docs/num of docs with word i), where fij is the frequency\nof word i in document (instance) j.", commandLineParamName="I", commandLineParamSynopsis="-I", commandLineParamIsFlag=true, displayOrder=8) public void setIDFTransform(boolean IDFTransform) Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.- Parameters:
IDFTransform
- true if the word frequecies are to be transformed
-
setNormalizeDocLength
@OptionMetadata(displayName="Normalize word frequencies", description="Whether to normalize to average length of documents seen during dictionary construction", commandLineParamName="N", commandLineParamSynopsis="-N", commandLineParamIsFlag=true, displayOrder=9) public void setNormalizeDocLength(boolean normalize) Sets whether if the word frequencies for a document (instance) should be normalized or not.- Parameters:
normalize
- the new type.
-
getNormalizeDocLength
public boolean getNormalizeDocLength()Gets whether if the word frequencies for a document (instance) should be normalized or not.- Returns:
- true if word frequencies are to be normalized.
-
getLowerCaseTokens
public boolean getLowerCaseTokens()Gets whether if the tokens are to be downcased or not.- Returns:
- true if the tokens are to be downcased.
-
setLowerCaseTokens
@OptionMetadata(displayName="Lower case tokens", description="Convert all tokens to lowercase when matching against dictionary entries.", commandLineParamName="L", commandLineParamSynopsis="-L", commandLineParamIsFlag=true, displayOrder=10) public void setLowerCaseTokens(boolean downCaseTokens) Sets whether if the tokens are to be downcased or not. (Doesn't affect non-alphabetic characters in tokens).- Parameters:
downCaseTokens
- should be true if only lower case tokens are to be formed.
-
setStemmer
@OptionMetadata(displayName="Stemmer to use", description="The stemming algorithm (classname plus parameters) to use.", commandLineParamName="stemmer", commandLineParamSynopsis="-stemmer <spec>", displayOrder=11) public void setStemmer(Stemmer value) the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).- Parameters:
value
- the configured stemming algorithm, or null- See Also:
-
getStemmer
Returns the current stemming algorithm, null if none is used.- Returns:
- the current stemming algorithm, null if none set
-
setStopwordsHandler
@OptionMetadata(displayName="Stop words handler", description="The stopwords handler to use (default = Null)", commandLineParamName="stopwords-handler", commandLineParamSynopsis="-stopwords-handler <spec>", displayOrder=12) public void setStopwordsHandler(StopwordsHandler value) Sets the stopwords handler to use.- Parameters:
value
- the stopwords handler, if null, Null is used
-
getStopwordsHandler
Gets the stopwords handler.- Returns:
- the stopwords handler
-
setTokenizer
@OptionMetadata(displayName="Tokenizer", description="The tokenizing algorithm (classname plus parameters) to use.\n(default: weka.core.tokenizers.WordTokenizer)", commandLineParamName="tokenizer", commandLineParamSynopsis="-tokenizer <spec>", displayOrder=13) public void setTokenizer(Tokenizer value) the tokenizer algorithm to use.- Parameters:
value
- the configured tokenizing algorithm
-
getTokenizer
Returns the current tokenizer algorithm.- Returns:
- the current tokenizer algorithm
-
globalInfo
Description copied from class:SimpleFilter
Returns a string describing this filter.- Specified by:
globalInfo
in classSimpleFilter
- Returns:
- a description of the filter suitable for displaying in the explorer/experimenter gui
-
setEnvironment
Description copied from interface:EnvironmentHandler
Set environment variables to use.- Specified by:
setEnvironment
in interfaceEnvironmentHandler
- Parameters:
env
- the environment variables to use
-
main
-