Package weka.core

Class DictionaryBuilder

java.lang.Object
weka.core.DictionaryBuilder
All Implemented Interfaces:
Serializable, Aggregateable<DictionaryBuilder>, OptionHandler

public class DictionaryBuilder extends Object implements Aggregateable<DictionaryBuilder>, OptionHandler, Serializable
Class for building and maintaining a dictionary of terms. Has methods for loading, saving and aggregating dictionaries. Supports loading/saving in binary and textual format. Textual format is expected to have one or two comma separated values per line of the format.

 term [,doc_count]
 
where
 doc_count
 
is the number of documents that the term has occurred in.
Version:
$Revision: 15573 $
Author:
Mark Hall (mhall{[at]}pentaho{[dot]}com)
See Also:
  • Constructor Details

    • DictionaryBuilder

      public DictionaryBuilder()
  • Method Details

    • setAverageDocLength

      @ProgrammaticProperty public void setAverageDocLength(double averageDocLength)
      Set the average document length to use when normalizing
      Parameters:
      averageDocLength - the average document length to use
    • getAverageDocLength

      public double getAverageDocLength()
      Get the average document length to use when normalizing
      Returns:
      the average document length
    • sortDictionaryTipText

      public String sortDictionaryTipText()
      Tip text for this property
      Returns:
      the tip text for this property
    • setSortDictionary

      public void setSortDictionary(boolean sortDictionary)
      Set whether to keep the dictionary sorted alphabetically as it is built. Setting this to true uses a TreeMap internally (which is slower than the default unsorted LinkedHashMap).
      Parameters:
      sortDictionary - true to keep the dictionary sorted alphabetically
    • getSortDictionary

      public boolean getSortDictionary()
      Get whether to keep the dictionary sorted alphabetically as it is built. Setting this to true uses a TreeMap internally (which is slower than the default unsorted LinkedHashMap).
      Returns:
      true to keep the dictionary sorted alphabetically
    • getOutputWordCounts

      public boolean getOutputWordCounts()
      Gets whether output instances contain 0 or 1 indicating word presence, or word counts.
      Returns:
      true if word counts should be output.
    • setOutputWordCounts

      public void setOutputWordCounts(boolean outputWordCounts)
      Sets whether output instances contain 0 or 1 indicating word presence, or word counts.
      Parameters:
      outputWordCounts - true if word counts should be output.
    • outputWordCountsTipText

      public String outputWordCountsTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getSelectedRange

      public Range getSelectedRange()
      Get the value of m_SelectedRange.
      Returns:
      Value of m_SelectedRange.
    • setSelectedRange

      public void setSelectedRange(String newSelectedRange)
      Set the value of m_SelectedRange.
      Parameters:
      newSelectedRange - Value to assign to m_SelectedRange.
    • attributeIndicesTipText

      public String attributeIndicesTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getAttributeIndices

      public String getAttributeIndices()
      Gets the current range selection.
      Returns:
      a string containing a comma separated list of ranges
    • setAttributeIndices

      public void setAttributeIndices(String rangeList)
      Sets which attributes are to be worked on.
      Parameters:
      rangeList - a string representing the list of attributes. Since the string will typically come from a user, attributes are indexed from 1.
      eg: first-3,5,6-last
      Throws:
      IllegalArgumentException - if an invalid range list is supplied
    • setAttributeIndicesArray

      public void setAttributeIndicesArray(int[] attributes)
      Sets which attributes are to be processed.
      Parameters:
      attributes - an array containing indexes of attributes to process. Since the array will typically come from a program, attributes are indexed from 0.
      Throws:
      IllegalArgumentException - if an invalid set of ranges is supplied
    • invertSelectionTipText

      public String invertSelectionTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getInvertSelection

      public boolean getInvertSelection()
      Gets whether the supplied columns are to be processed or skipped.
      Returns:
      true if the supplied columns will be kept
    • setInvertSelection

      public void setInvertSelection(boolean invert)
      Sets whether selected columns should be processed or skipped.
      Parameters:
      invert - the new invert setting
    • getWordsToKeep

      public int getWordsToKeep()
      Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.
      Returns:
      the target number of words in the output vector (per class if assigned).
    • setWordsToKeep

      public void setWordsToKeep(int newWordsToKeep)
      Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.
      Parameters:
      newWordsToKeep - the target number of words in the output vector (per class if assigned).
    • wordsToKeepTipText

      public String wordsToKeepTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getPeriodicPruning

      public long getPeriodicPruning()
      Gets the rate (number of instances) at which the dictionary is periodically pruned.
      Returns:
      the rate at which the dictionary is periodically pruned
    • setPeriodicPruning

      public void setPeriodicPruning(long newPeriodicPruning)
      Sets the rate (number of instances) at which the dictionary is periodically pruned
      Parameters:
      newPeriodicPruning - the rate at which the dictionary is periodically pruned
    • periodicPruningTipText

      public String periodicPruningTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getTFTransform

      public boolean getTFTransform()
      Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
      Returns:
      true if word frequencies are to be transformed.
    • setTFTransform

      public void setTFTransform(boolean TFTransform)
      Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
      Parameters:
      TFTransform - true if word frequencies are to be transformed.
    • TFTransformTipText

      public String TFTransformTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getAttributeNamePrefix

      public String getAttributeNamePrefix()
      Get the attribute name prefix.
      Returns:
      The current attribute name prefix.
    • setAttributeNamePrefix

      public void setAttributeNamePrefix(String newPrefix)
      Set the attribute name prefix.
      Parameters:
      newPrefix - String to use as the attribute name prefix.
    • attributeNamePrefixTipText

      public String attributeNamePrefixTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getIDFTransform

      public boolean getIDFTransform()
      Sets whether if the word frequencies in a document should be transformed into:
      fij*log(num of Docs/num of Docs with word i)
      where fij is the frequency of word i in document(instance) j.
      Returns:
      true if the word frequencies are to be transformed.
    • setIDFTransform

      public void setIDFTransform(boolean IDFTransform)
      Sets whether if the word frequencies in a document should be transformed into:
      fij*log(num of Docs/num of Docs with word i)
      where fij is the frequency of word i in document(instance) j.
      Parameters:
      IDFTransform - true if the word frequecies are to be transformed
    • IDFTransformTipText

      public String IDFTransformTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getNormalize

      public boolean getNormalize()
      Get whether word frequencies for a document should be normalized
      Returns:
      true if word frequencies should be normalized
    • setNormalize

      public void setNormalize(boolean n)
      Set whether word frequencies for a document should be normalized
      Parameters:
      n - true if word frequencies should be normalized
    • normalizeTipText

      public String normalizeTipText()
      Tip text for this property
      Returns:
      the tip text for this property
    • normalizeDocLengthTipText

      public String normalizeDocLengthTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getLowerCaseTokens

      public boolean getLowerCaseTokens()
      Gets whether if the tokens are to be downcased or not.
      Returns:
      true if the tokens are to be downcased.
    • setLowerCaseTokens

      public void setLowerCaseTokens(boolean downCaseTokens)
      Sets whether if the tokens are to be downcased or not. (Doesn't affect non-alphabetic characters in tokens).
      Parameters:
      downCaseTokens - should be true if only lower case tokens are to be formed.
    • lowerCaseTokensTipText

      public String lowerCaseTokensTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • doNotOperateOnPerClassBasisTipText

      public String doNotOperateOnPerClassBasisTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getDoNotOperateOnPerClassBasis

      public boolean getDoNotOperateOnPerClassBasis()
      Get the DoNotOperateOnPerClassBasis value.
      Returns:
      the DoNotOperateOnPerClassBasis value.
    • setDoNotOperateOnPerClassBasis

      public void setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis)
      Set the DoNotOperateOnPerClassBasis value.
      Parameters:
      newDoNotOperateOnPerClassBasis - The new DoNotOperateOnPerClassBasis value.
    • minTermFreqTipText

      public String minTermFreqTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getMinTermFreq

      public int getMinTermFreq()
      Get the MinTermFreq value.
      Returns:
      the MinTermFreq value.
    • setMinTermFreq

      public void setMinTermFreq(int newMinTermFreq)
      Set the MinTermFreq value.
      Parameters:
      newMinTermFreq - The new MinTermFreq value.
    • getStemmer

      public Stemmer getStemmer()
      Returns the current stemming algorithm, null if none is used.
      Returns:
      the current stemming algorithm, null if none set
    • setStemmer

      public void setStemmer(Stemmer value)
      the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).
      Parameters:
      value - the configured stemming algorithm, or null
      See Also:
    • stemmerTipText

      public String stemmerTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getStopwordsHandler

      public StopwordsHandler getStopwordsHandler()
      Gets the stopwords handler.
      Returns:
      the stopwords handler
    • setStopwordsHandler

      public void setStopwordsHandler(StopwordsHandler value)
      Sets the stopwords handler to use.
      Parameters:
      value - the stopwords handler, if null, Null is used
    • stopwordsHandlerTipText

      public String stopwordsHandlerTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getTokenizer

      public Tokenizer getTokenizer()
      Returns the current tokenizer algorithm.
      Returns:
      the current tokenizer algorithm
    • setTokenizer

      public void setTokenizer(Tokenizer value)
      the tokenizer algorithm to use.
      Parameters:
      value - the configured tokenizing algorithm
    • tokenizerTipText

      public String tokenizerTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • listOptions

      public Enumeration<Option> listOptions()
      Returns an enumeration describing the available options.
      Specified by:
      listOptions in interface OptionHandler
      Returns:
      an enumeration of all the available options
    • getOptions

      public String[] getOptions()
      Gets the current settings of the DictionaryBuilder
      Specified by:
      getOptions in interface OptionHandler
      Returns:
      an array of strings suitable for passing to setOptions
    • setOptions

      public void setOptions(String[] options) throws Exception
      Parses a given list of options.

      Valid options are:

       -C
        Output word counts rather than boolean word presence.
       
       -R <index1,index2-index4,...>
        Specify list of string attributes to convert to words (as weka Range).
        (default: select all string attributes)
       
       -V
        Invert matching sense of column indexes.
       
       -P <attribute name prefix>
        Specify a prefix for the created attribute names.
        (default: "")
       
       -W <number of words to keep>
        Specify approximate number of word fields to create.
        Surplus words will be discarded..
        (default: 1000)
       
       -prune-rate <rate as a percentage of dataset>
        Specify the rate (e.g., every 10% of the input dataset) at which to periodically prune the dictionary.
        -W prunes after creating a full dictionary. You may not have enough memory for this approach.
        (default: no periodic pruning)
       
       -T
        Transform the word frequencies into log(1+fij)
        where fij is the frequency of word i in jth document(instance).
       
       -I
        Transform each word frequency into:
        fij*log(num of Documents/num of documents containing word i)
          where fij if frequency of word i in jth document(instance)
       
       -N
        Whether to 0=not normalize/1=normalize all data/2=normalize test data only
        to average length of training documents (default 0=don't normalize).
       
       -L
        Convert all tokens to lowercase before adding to the dictionary.
       
       -stopwords-handler
        The stopwords handler to use (default Null).
       
       -stemmer <spec>
        The stemming algorithm (classname plus parameters) to use.
       
       -M <int>
        The minimum term frequency (default = 1).
       
       -O
        If this is set, the maximum number of words and the
        minimum term frequency is not enforced on a per-class
        basis but based on the documents in all the classes
        (even if a class attribute is set).
       
       -tokenizer <spec>
        The tokenizing algorihtm (classname plus parameters) to use.
        (default: weka.core.tokenizers.WordTokenizer)
       
      Specified by:
      setOptions in interface OptionHandler
      Parameters:
      options - the list of options as an array of strings
      Throws:
      Exception - if an option is not supported
    • setup

      public void setup(Instances inputFormat) throws Exception
      Throws:
      Exception
    • getInputFormat

      public Instances getInputFormat()
      Gets the currently set input format
      Returns:
      the current input format
    • readyToVectorize

      public boolean readyToVectorize()
      Returns true if this DictionaryBuilder is ready to vectorize incoming instances
      Returns:
      true if we can vectorize incoming instances
    • getVectorizedFormat

      public Instances getVectorizedFormat() throws Exception
      Get the output format
      Returns:
      the output format
      Throws:
      Exception - if there is no input format set and/or the dictionary has not been constructed yet.
    • vectorizeBatch

      public Instances vectorizeBatch(Instances batch, boolean setAvgDocLength) throws Exception
      Convert a batch of instances
      Parameters:
      batch - the batch to convert.
      setAvgDocLength - true to compute and set the average document length for this DictionaryBuilder from the batch - this uses the final pruned dictionary when computing doc lengths. When vectorizing non-training batches, and normalization has been turned on, this should be set to false.
      Returns:
      the converted batch
      Throws:
      Exception - if there is no input format set and/or the dictionary has not been constructed yet.
    • vectorizeInstance

      public Instance vectorizeInstance(Instance input) throws Exception
      Convert an input instance. Any string attributes not being vectorized do not have their values retained in memory (i.e. only the string values for the instance being vectorized are held in memory).
      Parameters:
      input - the input instance
      Returns:
      a converted instance
      Throws:
      Exception - if there is no input format set and/or the dictionary has not been constructed yet.
    • vectorizeInstance

      public Instance vectorizeInstance(Instance input, boolean retainStringAttValuesInMemory) throws Exception
      Convert an input instance.
      Parameters:
      input - the input instance
      retainStringAttValuesInMemory - true if the values of string attributes not being vectorized should be retained in memory
      Returns:
      a converted instance
      Throws:
      Exception - if there is no input format set and/or the dictionary has not been constructed yet
    • processInstance

      public void processInstance(Instance inst)
      Process an instance by tokenizing string attributes and updating the dictionary.
      Parameters:
      inst - the instance to process
    • reset

      public void reset()
      Clear the dictionary(s)
    • getDictionaries

      public Map<String,int[]>[] getDictionaries(boolean minFrequencyPrune) throws WekaException
      Get the current dictionary(s) (one per class for nominal class, if set). These are the dictionaries that are built/updated when processInstance() is called. The finalized dictionary (used for vectorization) can be obtained by calling finalizeDictionary() - this returns a consolidated (over classes) and pruned final dictionary.
      Parameters:
      minFrequencyPrune - prune the dictionaries of low frequency terms before returning them
      Returns:
      the dictionaries
      Throws:
      WekaException
    • aggregate

      public DictionaryBuilder aggregate(DictionaryBuilder toAgg) throws Exception
      Description copied from interface: Aggregateable
      Aggregate an object with this one
      Specified by:
      aggregate in interface Aggregateable<DictionaryBuilder>
      Parameters:
      toAgg - the object to aggregate
      Returns:
      the result of aggregation
      Throws:
      Exception - if the supplied object can't be aggregated for some reason
    • finalizeAggregation

      public void finalizeAggregation() throws Exception
      Description copied from interface: Aggregateable
      Call to complete the aggregation process. Allows implementers to do any final processing based on how many objects were aggregated.
      Specified by:
      finalizeAggregation in interface Aggregateable<DictionaryBuilder>
      Throws:
      Exception - if the aggregation can't be finalized for some reason
    • finalizeDictionary

      public Map<String,int[]> finalizeDictionary() throws Exception
      Performs final pruning and consolidation according to the number of words to keep property. Finalization is performed just once, subsequent calls to this method return the finalized dictionary computed on the first call (unless reset() has been called in between).
      Returns:
      the consolidated and pruned final dictionary, or null if the input format did not contain any string attributes within the selected range to process
      Throws:
      Exception - if a problem occurs
    • loadDictionary

      public void loadDictionary(String filename, boolean plainText) throws IOException
      Load a dictionary from a file
      Parameters:
      filename - the file to load from
      plainText - true if the dictionary is in text format
      Throws:
      IOException - if a problem occurs
    • loadDictionary

      public void loadDictionary(File toLoad, boolean plainText) throws IOException
      Load a dictionary from a file
      Parameters:
      toLoad - the file to load from
      plainText - true if the dictionary is in text format
      Throws:
      IOException - if a problem occurs
    • loadDictionary

      public void loadDictionary(Reader reader) throws IOException
      Load a textual dictionary from a reader
      Parameters:
      reader - the reader to read from
      Throws:
      IOException - if a problem occurs
    • loadDictionary

      public void loadDictionary(InputStream is) throws IOException
      Load a binary dictionary from an input stream
      Parameters:
      is - the input stream to read from
      Throws:
      IOException - if a problem occurs
    • saveDictionary

      public void saveDictionary(String filename, boolean plainText) throws IOException
      Save the dictionary
      Parameters:
      filename - the file to save to
      plainText - true if the dictionary should be saved in text format
      Throws:
      IOException - if a problem occurs
    • saveDictionary

      public void saveDictionary(File toSave, boolean plainText) throws IOException
      Save a dictionary
      Parameters:
      toSave - the file to save to
      plainText - true if the dictionary should be saved in text format
      Throws:
      IOException - if a problem occurs
    • saveDictionary

      public void saveDictionary(Writer writer) throws IOException
      Save the dictionary in textual format
      Parameters:
      writer - the writer to write to
      Throws:
      IOException - if a problem occurs
    • saveDictionary

      public void saveDictionary(OutputStream os) throws IOException
      Save the dictionary in binary form
      Parameters:
      os - the output stream to write to
      Throws:
      IOException - if a problem occurs