Package weka.core
Class DictionaryBuilder
java.lang.Object
weka.core.DictionaryBuilder
- All Implemented Interfaces:
Serializable
,Aggregateable<DictionaryBuilder>
,OptionHandler
public class DictionaryBuilder
extends Object
implements Aggregateable<DictionaryBuilder>, OptionHandler, Serializable
Class for building and maintaining a dictionary of terms. Has methods for
loading, saving and aggregating dictionaries. Supports loading/saving in
binary and textual format. Textual format is expected to have one or two
comma separated values per line of the format.
term [,doc_count]where
doc_countis the number of documents that the term has occurred in.
- Version:
- $Revision: 15573 $
- Author:
- Mark Hall (mhall{[at]}pentaho{[dot]}com)
- See Also:
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionaggregate
(DictionaryBuilder toAgg) Aggregate an object with this oneReturns the tip text for this property.Returns the tip text for this property.Returns the tip text for this property.void
Call to complete the aggregation process.Performs final pruning and consolidation according to the number of words to keep property.Gets the current range selection.Get the attribute name prefix.double
Get the average document length to use when normalizinggetDictionaries
(boolean minFrequencyPrune) Get the current dictionary(s) (one per class for nominal class, if set).boolean
Get the DoNotOperateOnPerClassBasis value.boolean
Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.Gets the currently set input formatboolean
Gets whether the supplied columns are to be processed or skipped.boolean
Gets whether if the tokens are to be downcased or not.int
Get the MinTermFreq value.boolean
Get whether word frequencies for a document should be normalizedString[]
Gets the current settings of the DictionaryBuilderboolean
Gets whether output instances contain 0 or 1 indicating word presence, or word counts.long
Gets the rate (number of instances) at which the dictionary is periodically pruned.Get the value of m_SelectedRange.boolean
Get whether to keep the dictionary sorted alphabetically as it is built.Returns the current stemming algorithm, null if none is used.Gets the stopwords handler.boolean
Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.Returns the current tokenizer algorithm.Get the output formatint
Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.Returns the tip text for this property.Returns the tip text for this property.Returns an enumeration describing the available options.void
loadDictionary
(File toLoad, boolean plainText) Load a dictionary from a filevoid
Load a binary dictionary from an input streamvoid
loadDictionary
(Reader reader) Load a textual dictionary from a readervoid
loadDictionary
(String filename, boolean plainText) Load a dictionary from a fileReturns the tip text for this property.Returns the tip text for this property.Returns the tip text for this property.Tip text for this propertyReturns the tip text for this property.Returns the tip text for this property.void
processInstance
(Instance inst) Process an instance by tokenizing string attributes and updating the dictionary.boolean
Returns true if this DictionaryBuilder is ready to vectorize incoming instancesvoid
reset()
Clear the dictionary(s)void
saveDictionary
(File toSave, boolean plainText) Save a dictionaryvoid
Save the dictionary in binary formvoid
saveDictionary
(Writer writer) Save the dictionary in textual formatvoid
saveDictionary
(String filename, boolean plainText) Save the dictionaryvoid
setAttributeIndices
(String rangeList) Sets which attributes are to be worked on.void
setAttributeIndicesArray
(int[] attributes) Sets which attributes are to be processed.void
setAttributeNamePrefix
(String newPrefix) Set the attribute name prefix.void
setAverageDocLength
(double averageDocLength) Set the average document length to use when normalizingvoid
setDoNotOperateOnPerClassBasis
(boolean newDoNotOperateOnPerClassBasis) Set the DoNotOperateOnPerClassBasis value.void
setIDFTransform
(boolean IDFTransform) Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.void
setInvertSelection
(boolean invert) Sets whether selected columns should be processed or skipped.void
setLowerCaseTokens
(boolean downCaseTokens) Sets whether if the tokens are to be downcased or not.void
setMinTermFreq
(int newMinTermFreq) Set the MinTermFreq value.void
setNormalize
(boolean n) Set whether word frequencies for a document should be normalizedvoid
setOptions
(String[] options) Parses a given list of options.void
setOutputWordCounts
(boolean outputWordCounts) Sets whether output instances contain 0 or 1 indicating word presence, or word counts.void
setPeriodicPruning
(long newPeriodicPruning) Sets the rate (number of instances) at which the dictionary is periodically prunedvoid
setSelectedRange
(String newSelectedRange) Set the value of m_SelectedRange.void
setSortDictionary
(boolean sortDictionary) Set whether to keep the dictionary sorted alphabetically as it is built.void
setStemmer
(Stemmer value) the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).void
Sets the stopwords handler to use.void
setTFTransform
(boolean TFTransform) Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.void
setTokenizer
(Tokenizer value) the tokenizer algorithm to use.void
void
setWordsToKeep
(int newWordsToKeep) Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.Tip text for this propertyReturns the tip text for this property.Returns the tip text for this property.Returns the tip text for this property.Returns the tip text for this property.vectorizeBatch
(Instances batch, boolean setAvgDocLength) Convert a batch of instancesvectorizeInstance
(Instance input) Convert an input instance.vectorizeInstance
(Instance input, boolean retainStringAttValuesInMemory) Convert an input instance.Returns the tip text for this property.
-
Constructor Details
-
DictionaryBuilder
public DictionaryBuilder()
-
-
Method Details
-
setAverageDocLength
Set the average document length to use when normalizing- Parameters:
averageDocLength
- the average document length to use
-
getAverageDocLength
public double getAverageDocLength()Get the average document length to use when normalizing- Returns:
- the average document length
-
sortDictionaryTipText
Tip text for this property- Returns:
- the tip text for this property
-
setSortDictionary
public void setSortDictionary(boolean sortDictionary) Set whether to keep the dictionary sorted alphabetically as it is built. Setting this to true uses a TreeMap internally (which is slower than the default unsorted LinkedHashMap).- Parameters:
sortDictionary
- true to keep the dictionary sorted alphabetically
-
getSortDictionary
public boolean getSortDictionary()Get whether to keep the dictionary sorted alphabetically as it is built. Setting this to true uses a TreeMap internally (which is slower than the default unsorted LinkedHashMap).- Returns:
- true to keep the dictionary sorted alphabetically
-
getOutputWordCounts
public boolean getOutputWordCounts()Gets whether output instances contain 0 or 1 indicating word presence, or word counts.- Returns:
- true if word counts should be output.
-
setOutputWordCounts
public void setOutputWordCounts(boolean outputWordCounts) Sets whether output instances contain 0 or 1 indicating word presence, or word counts.- Parameters:
outputWordCounts
- true if word counts should be output.
-
outputWordCountsTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getSelectedRange
Get the value of m_SelectedRange.- Returns:
- Value of m_SelectedRange.
-
setSelectedRange
Set the value of m_SelectedRange.- Parameters:
newSelectedRange
- Value to assign to m_SelectedRange.
-
attributeIndicesTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getAttributeIndices
Gets the current range selection.- Returns:
- a string containing a comma separated list of ranges
-
setAttributeIndices
Sets which attributes are to be worked on.- Parameters:
rangeList
- a string representing the list of attributes. Since the string will typically come from a user, attributes are indexed from 1.
eg: first-3,5,6-last- Throws:
IllegalArgumentException
- if an invalid range list is supplied
-
setAttributeIndicesArray
public void setAttributeIndicesArray(int[] attributes) Sets which attributes are to be processed.- Parameters:
attributes
- an array containing indexes of attributes to process. Since the array will typically come from a program, attributes are indexed from 0.- Throws:
IllegalArgumentException
- if an invalid set of ranges is supplied
-
invertSelectionTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getInvertSelection
public boolean getInvertSelection()Gets whether the supplied columns are to be processed or skipped.- Returns:
- true if the supplied columns will be kept
-
setInvertSelection
public void setInvertSelection(boolean invert) Sets whether selected columns should be processed or skipped.- Parameters:
invert
- the new invert setting
-
getWordsToKeep
public int getWordsToKeep()Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.- Returns:
- the target number of words in the output vector (per class if assigned).
-
setWordsToKeep
public void setWordsToKeep(int newWordsToKeep) Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.- Parameters:
newWordsToKeep
- the target number of words in the output vector (per class if assigned).
-
wordsToKeepTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getPeriodicPruning
public long getPeriodicPruning()Gets the rate (number of instances) at which the dictionary is periodically pruned.- Returns:
- the rate at which the dictionary is periodically pruned
-
setPeriodicPruning
public void setPeriodicPruning(long newPeriodicPruning) Sets the rate (number of instances) at which the dictionary is periodically pruned- Parameters:
newPeriodicPruning
- the rate at which the dictionary is periodically pruned
-
periodicPruningTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getTFTransform
public boolean getTFTransform()Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.- Returns:
- true if word frequencies are to be transformed.
-
setTFTransform
public void setTFTransform(boolean TFTransform) Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.- Parameters:
TFTransform
- true if word frequencies are to be transformed.
-
TFTransformTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getAttributeNamePrefix
Get the attribute name prefix.- Returns:
- The current attribute name prefix.
-
setAttributeNamePrefix
Set the attribute name prefix.- Parameters:
newPrefix
- String to use as the attribute name prefix.
-
attributeNamePrefixTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getIDFTransform
public boolean getIDFTransform()Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.- Returns:
- true if the word frequencies are to be transformed.
-
setIDFTransform
public void setIDFTransform(boolean IDFTransform) Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.- Parameters:
IDFTransform
- true if the word frequecies are to be transformed
-
IDFTransformTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getNormalize
public boolean getNormalize()Get whether word frequencies for a document should be normalized- Returns:
- true if word frequencies should be normalized
-
setNormalize
public void setNormalize(boolean n) Set whether word frequencies for a document should be normalized- Parameters:
n
- true if word frequencies should be normalized
-
normalizeTipText
Tip text for this property- Returns:
- the tip text for this property
-
normalizeDocLengthTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getLowerCaseTokens
public boolean getLowerCaseTokens()Gets whether if the tokens are to be downcased or not.- Returns:
- true if the tokens are to be downcased.
-
setLowerCaseTokens
public void setLowerCaseTokens(boolean downCaseTokens) Sets whether if the tokens are to be downcased or not. (Doesn't affect non-alphabetic characters in tokens).- Parameters:
downCaseTokens
- should be true if only lower case tokens are to be formed.
-
lowerCaseTokensTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
doNotOperateOnPerClassBasisTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getDoNotOperateOnPerClassBasis
public boolean getDoNotOperateOnPerClassBasis()Get the DoNotOperateOnPerClassBasis value.- Returns:
- the DoNotOperateOnPerClassBasis value.
-
setDoNotOperateOnPerClassBasis
public void setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis) Set the DoNotOperateOnPerClassBasis value.- Parameters:
newDoNotOperateOnPerClassBasis
- The new DoNotOperateOnPerClassBasis value.
-
minTermFreqTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getMinTermFreq
public int getMinTermFreq()Get the MinTermFreq value.- Returns:
- the MinTermFreq value.
-
setMinTermFreq
public void setMinTermFreq(int newMinTermFreq) Set the MinTermFreq value.- Parameters:
newMinTermFreq
- The new MinTermFreq value.
-
getStemmer
Returns the current stemming algorithm, null if none is used.- Returns:
- the current stemming algorithm, null if none set
-
setStemmer
the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).- Parameters:
value
- the configured stemming algorithm, or null- See Also:
-
stemmerTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getStopwordsHandler
Gets the stopwords handler.- Returns:
- the stopwords handler
-
setStopwordsHandler
Sets the stopwords handler to use.- Parameters:
value
- the stopwords handler, if null, Null is used
-
stopwordsHandlerTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getTokenizer
Returns the current tokenizer algorithm.- Returns:
- the current tokenizer algorithm
-
setTokenizer
the tokenizer algorithm to use.- Parameters:
value
- the configured tokenizing algorithm
-
tokenizerTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
listOptions
Returns an enumeration describing the available options.- Specified by:
listOptions
in interfaceOptionHandler
- Returns:
- an enumeration of all the available options
-
getOptions
Gets the current settings of the DictionaryBuilder- Specified by:
getOptions
in interfaceOptionHandler
- Returns:
- an array of strings suitable for passing to setOptions
-
setOptions
Parses a given list of options. Valid options are:-C Output word counts rather than boolean word presence.
-R <index1,index2-index4,...> Specify list of string attributes to convert to words (as weka Range). (default: select all string attributes)
-V Invert matching sense of column indexes.
-P <attribute name prefix> Specify a prefix for the created attribute names. (default: "")
-W <number of words to keep> Specify approximate number of word fields to create. Surplus words will be discarded.. (default: 1000)
-prune-rate <rate as a percentage of dataset> Specify the rate (e.g., every 10% of the input dataset) at which to periodically prune the dictionary. -W prunes after creating a full dictionary. You may not have enough memory for this approach. (default: no periodic pruning)
-T Transform the word frequencies into log(1+fij) where fij is the frequency of word i in jth document(instance).
-I Transform each word frequency into: fij*log(num of Documents/num of documents containing word i) where fij if frequency of word i in jth document(instance)
-N Whether to 0=not normalize/1=normalize all data/2=normalize test data only to average length of training documents (default 0=don't normalize).
-L Convert all tokens to lowercase before adding to the dictionary.
-stopwords-handler The stopwords handler to use (default Null).
-stemmer <spec> The stemming algorithm (classname plus parameters) to use.
-M <int> The minimum term frequency (default = 1).
-O If this is set, the maximum number of words and the minimum term frequency is not enforced on a per-class basis but based on the documents in all the classes (even if a class attribute is set).
-tokenizer <spec> The tokenizing algorihtm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
- Specified by:
setOptions
in interfaceOptionHandler
- Parameters:
options
- the list of options as an array of strings- Throws:
Exception
- if an option is not supported
-
setup
- Throws:
Exception
-
getInputFormat
Gets the currently set input format- Returns:
- the current input format
-
readyToVectorize
public boolean readyToVectorize()Returns true if this DictionaryBuilder is ready to vectorize incoming instances- Returns:
- true if we can vectorize incoming instances
-
getVectorizedFormat
Get the output format- Returns:
- the output format
- Throws:
Exception
- if there is no input format set and/or the dictionary has not been constructed yet.
-
vectorizeBatch
Convert a batch of instances- Parameters:
batch
- the batch to convert.setAvgDocLength
- true to compute and set the average document length for this DictionaryBuilder from the batch - this uses the final pruned dictionary when computing doc lengths. When vectorizing non-training batches, and normalization has been turned on, this should be set to false.- Returns:
- the converted batch
- Throws:
Exception
- if there is no input format set and/or the dictionary has not been constructed yet.
-
vectorizeInstance
Convert an input instance. Any string attributes not being vectorized do not have their values retained in memory (i.e. only the string values for the instance being vectorized are held in memory).- Parameters:
input
- the input instance- Returns:
- a converted instance
- Throws:
Exception
- if there is no input format set and/or the dictionary has not been constructed yet.
-
vectorizeInstance
public Instance vectorizeInstance(Instance input, boolean retainStringAttValuesInMemory) throws Exception Convert an input instance.- Parameters:
input
- the input instanceretainStringAttValuesInMemory
- true if the values of string attributes not being vectorized should be retained in memory- Returns:
- a converted instance
- Throws:
Exception
- if there is no input format set and/or the dictionary has not been constructed yet
-
processInstance
Process an instance by tokenizing string attributes and updating the dictionary.- Parameters:
inst
- the instance to process
-
reset
public void reset()Clear the dictionary(s) -
getDictionaries
Get the current dictionary(s) (one per class for nominal class, if set). These are the dictionaries that are built/updated when processInstance() is called. The finalized dictionary (used for vectorization) can be obtained by calling finalizeDictionary() - this returns a consolidated (over classes) and pruned final dictionary.- Parameters:
minFrequencyPrune
- prune the dictionaries of low frequency terms before returning them- Returns:
- the dictionaries
- Throws:
WekaException
-
aggregate
Description copied from interface:Aggregateable
Aggregate an object with this one- Specified by:
aggregate
in interfaceAggregateable<DictionaryBuilder>
- Parameters:
toAgg
- the object to aggregate- Returns:
- the result of aggregation
- Throws:
Exception
- if the supplied object can't be aggregated for some reason
-
finalizeAggregation
Description copied from interface:Aggregateable
Call to complete the aggregation process. Allows implementers to do any final processing based on how many objects were aggregated.- Specified by:
finalizeAggregation
in interfaceAggregateable<DictionaryBuilder>
- Throws:
Exception
- if the aggregation can't be finalized for some reason
-
finalizeDictionary
Performs final pruning and consolidation according to the number of words to keep property. Finalization is performed just once, subsequent calls to this method return the finalized dictionary computed on the first call (unless reset() has been called in between).- Returns:
- the consolidated and pruned final dictionary, or null if the input format did not contain any string attributes within the selected range to process
- Throws:
Exception
- if a problem occurs
-
loadDictionary
Load a dictionary from a file- Parameters:
filename
- the file to load fromplainText
- true if the dictionary is in text format- Throws:
IOException
- if a problem occurs
-
loadDictionary
Load a dictionary from a file- Parameters:
toLoad
- the file to load fromplainText
- true if the dictionary is in text format- Throws:
IOException
- if a problem occurs
-
loadDictionary
Load a textual dictionary from a reader- Parameters:
reader
- the reader to read from- Throws:
IOException
- if a problem occurs
-
loadDictionary
Load a binary dictionary from an input stream- Parameters:
is
- the input stream to read from- Throws:
IOException
- if a problem occurs
-
saveDictionary
Save the dictionary- Parameters:
filename
- the file to save toplainText
- true if the dictionary should be saved in text format- Throws:
IOException
- if a problem occurs
-
saveDictionary
Save a dictionary- Parameters:
toSave
- the file to save toplainText
- true if the dictionary should be saved in text format- Throws:
IOException
- if a problem occurs
-
saveDictionary
Save the dictionary in textual format- Parameters:
writer
- the writer to write to- Throws:
IOException
- if a problem occurs
-
saveDictionary
Save the dictionary in binary form- Parameters:
os
- the output stream to write to- Throws:
IOException
- if a problem occurs
-