weka.core.DictionaryBuilder

All Implemented Interfaces:: Serializable, Aggregateable<DictionaryBuilder>, OptionHandler

public class DictionaryBuilder extends Object implements Aggregateable<DictionaryBuilder>, OptionHandler, Serializable

Class for building and maintaining a dictionary of terms. Has methods for loading, saving and aggregating dictionaries. Supports loading/saving in binary and textual format. Textual format is expected to have one or two comma separated values per line of the format.

 term [,doc_count]

where

 doc_count

is the number of documents that the term has occurred in.

Version:

$Revision: 15573 $

Author:

Mark Hall (mhall{[at]}pentaho{[dot]}com)

See Also:

Serialized Form

Constructor Summary

Constructors

Constructor

Description

DictionaryBuilder()
Method Summary

Modifier and Type

Method

Description

DictionaryBuilder

aggregate(DictionaryBuilder toAgg)

Aggregate an object with this one

String

attributeIndicesTipText()

Returns the tip text for this property.

String

attributeNamePrefixTipText()

Returns the tip text for this property.

String

doNotOperateOnPerClassBasisTipText()

Returns the tip text for this property.

void

finalizeAggregation()

Call to complete the aggregation process.

Map<String,int[]>

finalizeDictionary()

Performs final pruning and consolidation according to the number of words to keep property.

String

getAttributeIndices()

Gets the current range selection.

String

getAttributeNamePrefix()

Get the attribute name prefix.

double

getAverageDocLength()

Get the average document length to use when normalizing

Map<String,int[]>[]

getDictionaries(boolean minFrequencyPrune)

Get the current dictionary(s) (one per class for nominal class, if set).

boolean

getDoNotOperateOnPerClassBasis()

Get the DoNotOperateOnPerClassBasis value.

boolean

getIDFTransform()

Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.

Instances

getInputFormat()

Gets the currently set input format

boolean

getInvertSelection()

Gets whether the supplied columns are to be processed or skipped.

boolean

getLowerCaseTokens()

Gets whether if the tokens are to be downcased or not.

int

getMinTermFreq()

Get the MinTermFreq value.

boolean

getNormalize()

Get whether word frequencies for a document should be normalized

String[]

getOptions()

Gets the current settings of the DictionaryBuilder

boolean

getOutputWordCounts()

Gets whether output instances contain 0 or 1 indicating word presence, or word counts.

long

getPeriodicPruning()

Gets the rate (number of instances) at which the dictionary is periodically pruned.

Range

getSelectedRange()

Get the value of m_SelectedRange.

boolean

getSortDictionary()

Get whether to keep the dictionary sorted alphabetically as it is built.

Stemmer

getStemmer()

Returns the current stemming algorithm, null if none is used.

StopwordsHandler

getStopwordsHandler()

Gets the stopwords handler.

boolean

getTFTransform()

Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.

Tokenizer

getTokenizer()

Returns the current tokenizer algorithm.

Instances

getVectorizedFormat()

Get the output format

int

getWordsToKeep()

Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.

String

IDFTransformTipText()

Returns the tip text for this property.

String

invertSelectionTipText()

Returns the tip text for this property.

Enumeration<Option>

listOptions()

Returns an enumeration describing the available options.

void

loadDictionary(File toLoad, boolean plainText)

Load a dictionary from a file

void

loadDictionary(InputStream is)

Load a binary dictionary from an input stream

void

loadDictionary(Reader reader)

Load a textual dictionary from a reader

void

loadDictionary(String filename, boolean plainText)

Load a dictionary from a file

String

lowerCaseTokensTipText()

Returns the tip text for this property.

String

minTermFreqTipText()

Returns the tip text for this property.

String

normalizeDocLengthTipText()

Returns the tip text for this property.

String

normalizeTipText()

Tip text for this property

String

outputWordCountsTipText()

Returns the tip text for this property.

String

periodicPruningTipText()

Returns the tip text for this property.

void

processInstance(Instance inst)

Process an instance by tokenizing string attributes and updating the dictionary.

boolean

readyToVectorize()

Returns true if this DictionaryBuilder is ready to vectorize incoming instances

void

reset()

Clear the dictionary(s)

void

saveDictionary(File toSave, boolean plainText)

Save a dictionary

void

saveDictionary(OutputStream os)

Save the dictionary in binary form

void

saveDictionary(Writer writer)

Save the dictionary in textual format

void

saveDictionary(String filename, boolean plainText)

Save the dictionary

void

setAttributeIndices(String rangeList)

Sets which attributes are to be worked on.

void

setAttributeIndicesArray(int[] attributes)

Sets which attributes are to be processed.

void

setAttributeNamePrefix(String newPrefix)

Set the attribute name prefix.

void

setAverageDocLength(double averageDocLength)

Set the average document length to use when normalizing

void

setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis)

Set the DoNotOperateOnPerClassBasis value.

void

setIDFTransform(boolean IDFTransform)

Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.

void

setInvertSelection(boolean invert)

Sets whether selected columns should be processed or skipped.

void

setLowerCaseTokens(boolean downCaseTokens)

Sets whether if the tokens are to be downcased or not.

void

setMinTermFreq(int newMinTermFreq)

Set the MinTermFreq value.

void

setNormalize(boolean n)

Set whether word frequencies for a document should be normalized

void

setOptions(String[] options)

Parses a given list of options.

void

setOutputWordCounts(boolean outputWordCounts)

Sets whether output instances contain 0 or 1 indicating word presence, or word counts.

void

setPeriodicPruning(long newPeriodicPruning)

Sets the rate (number of instances) at which the dictionary is periodically pruned

void

setSelectedRange(String newSelectedRange)

Set the value of m_SelectedRange.

void

setSortDictionary(boolean sortDictionary)

Set whether to keep the dictionary sorted alphabetically as it is built.

void

setStemmer(Stemmer value)

the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).

void

setStopwordsHandler(StopwordsHandler value)

Sets the stopwords handler to use.

void

setTFTransform(boolean TFTransform)

Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.

void

setTokenizer(Tokenizer value)

the tokenizer algorithm to use.

void

setup(Instances inputFormat)

void

setWordsToKeep(int newWordsToKeep)

Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.

String

sortDictionaryTipText()

Tip text for this property

String

stemmerTipText()

Returns the tip text for this property.

String

stopwordsHandlerTipText()

Returns the tip text for this property.

String

TFTransformTipText()

Returns the tip text for this property.

String

tokenizerTipText()

Returns the tip text for this property.

Instances

vectorizeBatch(Instances batch, boolean setAvgDocLength)

Convert a batch of instances

Instance

vectorizeInstance(Instance input)

Convert an input instance.

Instance

vectorizeInstance(Instance input, boolean retainStringAttValuesInMemory)

Convert an input instance.

String

wordsToKeepTipText()

Returns the tip text for this property.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- DictionaryBuilder
  
  public DictionaryBuilder()
Method Details
- setAverageDocLength
  
  @ProgrammaticProperty public void setAverageDocLength(double averageDocLength)
  
  Set the average document length to use when normalizing
  
  Parameters:
  
  averageDocLength - the average document length to use
- getAverageDocLength
  
  public double getAverageDocLength()
  
  Get the average document length to use when normalizing
  
  Returns:
  
  the average document length
- sortDictionaryTipText
  
  public String sortDictionaryTipText()
  
  Tip text for this property
  
  Returns:
  
  the tip text for this property
- setSortDictionary
  
  public void setSortDictionary(boolean sortDictionary)
  
  Set whether to keep the dictionary sorted alphabetically as it is built. Setting this to true uses a TreeMap internally (which is slower than the default unsorted LinkedHashMap).
  
  Parameters:
  
  sortDictionary - true to keep the dictionary sorted alphabetically
- getSortDictionary
  
  public boolean getSortDictionary()
  
  Get whether to keep the dictionary sorted alphabetically as it is built. Setting this to true uses a TreeMap internally (which is slower than the default unsorted LinkedHashMap).
  
  Returns:
  
  true to keep the dictionary sorted alphabetically
- getOutputWordCounts
  
  public boolean getOutputWordCounts()
  
  Gets whether output instances contain 0 or 1 indicating word presence, or word counts.
  
  Returns:
  
  true if word counts should be output.
- setOutputWordCounts
  
  public void setOutputWordCounts(boolean outputWordCounts)
  
  Sets whether output instances contain 0 or 1 indicating word presence, or word counts.
  
  Parameters:
  
  outputWordCounts - true if word counts should be output.
- outputWordCountsTipText
  
  public String outputWordCountsTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getSelectedRange
  
  public Range getSelectedRange()
  
  Get the value of m_SelectedRange.
  
  Returns:
  
  Value of m_SelectedRange.
- setSelectedRange
  
  public void setSelectedRange(String newSelectedRange)
  
  Set the value of m_SelectedRange.
  
  Parameters:
  
  newSelectedRange - Value to assign to m_SelectedRange.
- attributeIndicesTipText
  
  public String attributeIndicesTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getAttributeIndices
  
  public String getAttributeIndices()
  
  Gets the current range selection.
  
  Returns:
  
  a string containing a comma separated list of ranges
- setAttributeIndices
  
  public void setAttributeIndices(String rangeList)
  
  Sets which attributes are to be worked on.
  
  Parameters:
  
  rangeList - a string representing the list of attributes. Since the string will typically come from a user, attributes are indexed from 1.
  eg: first-3,5,6-last
  
  Throws:
  
  IllegalArgumentException - if an invalid range list is supplied
- setAttributeIndicesArray
  
  public void setAttributeIndicesArray(int[] attributes)
  
  Sets which attributes are to be processed.
  
  Parameters:
  
  attributes - an array containing indexes of attributes to process. Since the array will typically come from a program, attributes are indexed from 0.
  
  Throws:
  
  IllegalArgumentException - if an invalid set of ranges is supplied
- invertSelectionTipText
  
  public String invertSelectionTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getInvertSelection
  
  public boolean getInvertSelection()
  
  Gets whether the supplied columns are to be processed or skipped.
  
  Returns:
  
  true if the supplied columns will be kept
- setInvertSelection
  
  public void setInvertSelection(boolean invert)
  
  Sets whether selected columns should be processed or skipped.
  
  Parameters:
  
  invert - the new invert setting
- getWordsToKeep
  
  public int getWordsToKeep()
  
  Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.
  
  Returns:
  
  the target number of words in the output vector (per class if assigned).
- setWordsToKeep
  
  public void setWordsToKeep(int newWordsToKeep)
  
  Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.
  
  Parameters:
  
  newWordsToKeep - the target number of words in the output vector (per class if assigned).
- wordsToKeepTipText
  
  public String wordsToKeepTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getPeriodicPruning
  
  public long getPeriodicPruning()
  
  Gets the rate (number of instances) at which the dictionary is periodically pruned.
  
  Returns:
  
  the rate at which the dictionary is periodically pruned
- setPeriodicPruning
  
  public void setPeriodicPruning(long newPeriodicPruning)
  
  Sets the rate (number of instances) at which the dictionary is periodically pruned
  
  Parameters:
  
  newPeriodicPruning - the rate at which the dictionary is periodically pruned
- periodicPruningTipText
  
  public String periodicPruningTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getTFTransform
  
  public boolean getTFTransform()
  
  Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
  
  Returns:
  
  true if word frequencies are to be transformed.
- setTFTransform
  
  public void setTFTransform(boolean TFTransform)
  
  Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
  
  Parameters:
  
  TFTransform - true if word frequencies are to be transformed.
- TFTransformTipText
  
  public String TFTransformTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getAttributeNamePrefix
  
  public String getAttributeNamePrefix()
  
  Get the attribute name prefix.
  
  Returns:
  
  The current attribute name prefix.
- setAttributeNamePrefix
  
  public void setAttributeNamePrefix(String newPrefix)
  
  Set the attribute name prefix.
  
  Parameters:
  
  newPrefix - String to use as the attribute name prefix.
- attributeNamePrefixTipText
  
  public String attributeNamePrefixTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getIDFTransform
  
  public boolean getIDFTransform()
  
  Sets whether if the word frequencies in a document should be transformed into:
  fij*log(num of Docs/num of Docs with word i)
  where fij is the frequency of word i in document(instance) j.
  
  Returns:
  
  true if the word frequencies are to be transformed.
- setIDFTransform
  
  public void setIDFTransform(boolean IDFTransform)
  
  Sets whether if the word frequencies in a document should be transformed into:
  fij*log(num of Docs/num of Docs with word i)
  where fij is the frequency of word i in document(instance) j.
  
  Parameters:
  
  IDFTransform - true if the word frequecies are to be transformed
- IDFTransformTipText
  
  public String IDFTransformTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getNormalize
  
  public boolean getNormalize()
  
  Get whether word frequencies for a document should be normalized
  
  Returns:
  
  true if word frequencies should be normalized
- setNormalize
  
  public void setNormalize(boolean n)
  
  Set whether word frequencies for a document should be normalized
  
  Parameters:
  
  n - true if word frequencies should be normalized
- normalizeTipText
  
  public String normalizeTipText()
  
  Tip text for this property
  
  Returns:
  
  the tip text for this property
- normalizeDocLengthTipText
  
  public String normalizeDocLengthTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getLowerCaseTokens
  
  public boolean getLowerCaseTokens()
  
  Gets whether if the tokens are to be downcased or not.
  
  Returns:
  
  true if the tokens are to be downcased.
- setLowerCaseTokens
  
  public void setLowerCaseTokens(boolean downCaseTokens)
  
  Sets whether if the tokens are to be downcased or not. (Doesn't affect non-alphabetic characters in tokens).
  
  Parameters:
  
  downCaseTokens - should be true if only lower case tokens are to be formed.
- lowerCaseTokensTipText
  
  public String lowerCaseTokensTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- doNotOperateOnPerClassBasisTipText
  
  public String doNotOperateOnPerClassBasisTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getDoNotOperateOnPerClassBasis
  
  public boolean getDoNotOperateOnPerClassBasis()
  
  Get the DoNotOperateOnPerClassBasis value.
  
  Returns:
  
  the DoNotOperateOnPerClassBasis value.
- setDoNotOperateOnPerClassBasis
  
  public void setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis)
  
  Set the DoNotOperateOnPerClassBasis value.
  
  Parameters:
  
  newDoNotOperateOnPerClassBasis - The new DoNotOperateOnPerClassBasis value.
- minTermFreqTipText
  
  public String minTermFreqTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getMinTermFreq
  
  public int getMinTermFreq()
  
  Get the MinTermFreq value.
  
  Returns:
  
  the MinTermFreq value.
- setMinTermFreq
  
  public void setMinTermFreq(int newMinTermFreq)
  
  Set the MinTermFreq value.
  
  Parameters:
  
  newMinTermFreq - The new MinTermFreq value.
- getStemmer
  
  public Stemmer getStemmer()
  
  Returns the current stemming algorithm, null if none is used.
  
  Returns:
  
  the current stemming algorithm, null if none set
- setStemmer
  
  public void setStemmer(Stemmer value)
  
  the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).
  Parameters:
  
  value - the configured stemming algorithm, or null
  
  See Also:
  
  NullStemmer
- stemmerTipText
  
  public String stemmerTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getStopwordsHandler
  
  public StopwordsHandler getStopwordsHandler()
  
  Gets the stopwords handler.
  
  Returns:
  
  the stopwords handler
- setStopwordsHandler
  
  public void setStopwordsHandler(StopwordsHandler value)
  
  Sets the stopwords handler to use.
  
  Parameters:
  
  value - the stopwords handler, if null, Null is used
- stopwordsHandlerTipText
  
  public String stopwordsHandlerTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- getTokenizer
  
  public Tokenizer getTokenizer()
  
  Returns the current tokenizer algorithm.
  
  Returns:
  
  the current tokenizer algorithm
- setTokenizer
  
  public void setTokenizer(Tokenizer value)
  
  the tokenizer algorithm to use.
  
  Parameters:
  
  value - the configured tokenizing algorithm
- tokenizerTipText
  
  public String tokenizerTipText()
  
  Returns the tip text for this property.
  
  Returns:
  
  tip text for this property suitable for displaying in the explorer/experimenter gui
- listOptions
  
  public Enumeration<Option> listOptions()
  
  Returns an enumeration describing the available options.
  
  Specified by:
  
  listOptions in interface OptionHandler
  
  Returns:
  
  an enumeration of all the available options
- getOptions
  
  public String[] getOptions()
  
  Gets the current settings of the DictionaryBuilder
  
  Specified by:
  
  getOptions in interface OptionHandler
  
  Returns:
  
  an array of strings suitable for passing to setOptions
- setOptions
  
  public void setOptions(String[] options) throws Exception
  Parses a given list of options.
  Valid options are:
  
  -C Output word counts rather than boolean word presence.
  
  -R <index1,index2-index4,...> Specify list of string attributes to convert to words (as weka Range). (default: select all string attributes)
  
  -V Invert matching sense of column indexes.
  
  -P <attribute name prefix> Specify a prefix for the created attribute names. (default: "")
  
  -W <number of words to keep> Specify approximate number of word fields to create. Surplus words will be discarded.. (default: 1000)
  
  -prune-rate <rate as a percentage of dataset> Specify the rate (e.g., every 10% of the input dataset) at which to periodically prune the dictionary. -W prunes after creating a full dictionary. You may not have enough memory for this approach. (default: no periodic pruning)
  
  -T Transform the word frequencies into log(1+fij) where fij is the frequency of word i in jth document(instance).
  
  -I Transform each word frequency into: fij*log(num of Documents/num of documents containing word i) where fij if frequency of word i in jth document(instance)
  
  -N Whether to 0=not normalize/1=normalize all data/2=normalize test data only to average length of training documents (default 0=don't normalize).
  
  -L Convert all tokens to lowercase before adding to the dictionary.
  
  -stopwords-handler The stopwords handler to use (default Null).
  
  -stemmer <spec> The stemming algorithm (classname plus parameters) to use.
  
  -M <int> The minimum term frequency (default = 1).
  
  -O If this is set, the maximum number of words and the minimum term frequency is not enforced on a per-class basis but based on the documents in all the classes (even if a class attribute is set).
  
  -tokenizer <spec> The tokenizing algorihtm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
  Specified by:
  
  setOptions in interface OptionHandler
  
  Parameters:
  
  options - the list of options as an array of strings
  
  Throws:
  
  Exception - if an option is not supported
- setup
  
  public void setup(Instances inputFormat) throws Exception
  
  Throws:
  
  Exception
- getInputFormat
  
  public Instances getInputFormat()
  
  Gets the currently set input format
  
  Returns:
  
  the current input format
- readyToVectorize
  
  public boolean readyToVectorize()
  
  Returns true if this DictionaryBuilder is ready to vectorize incoming instances
  
  Returns:
  
  true if we can vectorize incoming instances
- getVectorizedFormat
  
  public Instances getVectorizedFormat() throws Exception
  
  Get the output format
  
  Returns:
  
  the output format
  
  Throws:
  
  Exception - if there is no input format set and/or the dictionary has not been constructed yet.
- vectorizeBatch
  
  public Instances vectorizeBatch(Instances batch, boolean setAvgDocLength) throws Exception
  
  Convert a batch of instances
  
  Parameters:
  
  batch - the batch to convert.
  
  setAvgDocLength - true to compute and set the average document length for this DictionaryBuilder from the batch - this uses the final pruned dictionary when computing doc lengths. When vectorizing non-training batches, and normalization has been turned on, this should be set to false.
  
  Returns:
  
  the converted batch
  
  Throws:
  
  Exception - if there is no input format set and/or the dictionary has not been constructed yet.
- vectorizeInstance
  
  public Instance vectorizeInstance(Instance input) throws Exception
  
  Convert an input instance. Any string attributes not being vectorized do not have their values retained in memory (i.e. only the string values for the instance being vectorized are held in memory).
  
  Parameters:
  
  input - the input instance
  
  Returns:
  
  a converted instance
  
  Throws:
  
  Exception - if there is no input format set and/or the dictionary has not been constructed yet.
- vectorizeInstance
  
  public Instance vectorizeInstance(Instance input, boolean retainStringAttValuesInMemory) throws Exception
  
  Convert an input instance.
  
  Parameters:
  
  input - the input instance
  
  retainStringAttValuesInMemory - true if the values of string attributes not being vectorized should be retained in memory
  
  Returns:
  
  a converted instance
  
  Throws:
  
  Exception - if there is no input format set and/or the dictionary has not been constructed yet
- processInstance
  
  public void processInstance(Instance inst)
  
  Process an instance by tokenizing string attributes and updating the dictionary.
  
  Parameters:
  
  inst - the instance to process
- reset
  
  public void reset()
  
  Clear the dictionary(s)
- getDictionaries
  
  public Map<String,int[]>[] getDictionaries(boolean minFrequencyPrune) throws WekaException
  
  Get the current dictionary(s) (one per class for nominal class, if set). These are the dictionaries that are built/updated when processInstance() is called. The finalized dictionary (used for vectorization) can be obtained by calling finalizeDictionary() - this returns a consolidated (over classes) and pruned final dictionary.
  
  Parameters:
  
  minFrequencyPrune - prune the dictionaries of low frequency terms before returning them
  
  Returns:
  
  the dictionaries
  
  Throws:
  
  WekaException
- aggregate
  
  public DictionaryBuilder aggregate(DictionaryBuilder toAgg) throws Exception
  
  Description copied from interface: Aggregateable
  
  Aggregate an object with this one
  
  Specified by:
  
  aggregate in interface Aggregateable<DictionaryBuilder>
  
  Parameters:
  
  toAgg - the object to aggregate
  
  Returns:
  
  the result of aggregation
  
  Throws:
  
  Exception - if the supplied object can't be aggregated for some reason
- finalizeAggregation
  
  public void finalizeAggregation() throws Exception
  
  Description copied from interface: Aggregateable
  
  Call to complete the aggregation process. Allows implementers to do any final processing based on how many objects were aggregated.
  
  Specified by:
  
  finalizeAggregation in interface Aggregateable<DictionaryBuilder>
  
  Throws:
  
  Exception - if the aggregation can't be finalized for some reason
- finalizeDictionary
  
  public Map<String,int[]> finalizeDictionary() throws Exception
  
  Performs final pruning and consolidation according to the number of words to keep property. Finalization is performed just once, subsequent calls to this method return the finalized dictionary computed on the first call (unless reset() has been called in between).
  
  Returns:
  
  the consolidated and pruned final dictionary, or null if the input format did not contain any string attributes within the selected range to process
  
  Throws:
  
  Exception - if a problem occurs
- loadDictionary
  
  public void loadDictionary(String filename, boolean plainText) throws IOException
  
  Load a dictionary from a file
  
  Parameters:
  
  filename - the file to load from
  
  plainText - true if the dictionary is in text format
  
  Throws:
  
  IOException - if a problem occurs
- loadDictionary
  
  public void loadDictionary(File toLoad, boolean plainText) throws IOException
  
  Load a dictionary from a file
  
  Parameters:
  
  toLoad - the file to load from
  
  plainText - true if the dictionary is in text format
  
  Throws:
  
  IOException - if a problem occurs
- loadDictionary
  
  public void loadDictionary(Reader reader) throws IOException
  
  Load a textual dictionary from a reader
  
  Parameters:
  
  reader - the reader to read from
  
  Throws:
  
  IOException - if a problem occurs
- loadDictionary
  
  public void loadDictionary(InputStream is) throws IOException
  
  Load a binary dictionary from an input stream
  
  Parameters:
  
  is - the input stream to read from
  
  Throws:
  
  IOException - if a problem occurs
- saveDictionary
  
  public void saveDictionary(String filename, boolean plainText) throws IOException
  
  Save the dictionary
  
  Parameters:
  
  filename - the file to save to
  
  plainText - true if the dictionary should be saved in text format
  
  Throws:
  
  IOException - if a problem occurs
- saveDictionary
  
  public void saveDictionary(File toSave, boolean plainText) throws IOException
  
  Save a dictionary
  
  Parameters:
  
  toSave - the file to save to
  
  plainText - true if the dictionary should be saved in text format
  
  Throws:
  
  IOException - if a problem occurs
- saveDictionary
  
  public void saveDictionary(Writer writer) throws IOException
  
  Save the dictionary in textual format
  
  Parameters:
  
  writer - the writer to write to
  
  Throws:
  
  IOException - if a problem occurs
- saveDictionary
  
  public void saveDictionary(OutputStream os) throws IOException
  
  Save the dictionary in binary form
  
  Parameters:
  
  os - the output stream to write to
  
  Throws:
  
  IOException - if a problem occurs

Class DictionaryBuilder

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

DictionaryBuilder

Method Details

setAverageDocLength

getAverageDocLength

sortDictionaryTipText

setSortDictionary

getSortDictionary

getOutputWordCounts

setOutputWordCounts

outputWordCountsTipText

getSelectedRange

setSelectedRange

attributeIndicesTipText

getAttributeIndices

setAttributeIndices

setAttributeIndicesArray

invertSelectionTipText

getInvertSelection

setInvertSelection

getWordsToKeep

setWordsToKeep

wordsToKeepTipText

getPeriodicPruning

setPeriodicPruning

periodicPruningTipText

getTFTransform

setTFTransform

TFTransformTipText

getAttributeNamePrefix

setAttributeNamePrefix

attributeNamePrefixTipText

getIDFTransform

setIDFTransform

IDFTransformTipText

getNormalize

setNormalize

normalizeTipText

normalizeDocLengthTipText

getLowerCaseTokens

setLowerCaseTokens

lowerCaseTokensTipText

doNotOperateOnPerClassBasisTipText

getDoNotOperateOnPerClassBasis

setDoNotOperateOnPerClassBasis

minTermFreqTipText

getMinTermFreq

setMinTermFreq

getStemmer

setStemmer

stemmerTipText

getStopwordsHandler

setStopwordsHandler

stopwordsHandlerTipText

getTokenizer

setTokenizer

tokenizerTipText

listOptions

getOptions

setOptions

setup

getInputFormat

readyToVectorize

getVectorizedFormat

vectorizeBatch

vectorizeInstance

vectorizeInstance

processInstance

reset

getDictionaries

aggregate

finalizeAggregation

finalizeDictionary

loadDictionary

loadDictionary

loadDictionary