Class NGramTokenizer

All Implemented Interfaces:
Serializable, Enumeration<String>, OptionHandler, RevisionHandler

public class NGramTokenizer extends CharacterDelimitedTokenizer
Splits a string into an n-gram with min and max grams.

Valid options are:

 -delimiters <value>
  The delimiters to use
  (default ' \r\n\t.,;:'"()?!').
 
 -max <int>
  The max size of the Ngram (default = 3).
 
 -min <int>
  The min size of the Ngram (default = 1).
 
Version:
$Revision: 10971 $
Author:
Sebastian Germesin (sebastian.germesin@dfki.de), FracPete (fracpete at waikato dot ac dot nz)
See Also:
  • Constructor Details

    • NGramTokenizer

      public NGramTokenizer()
  • Method Details

    • globalInfo

      public String globalInfo()
      Returns a string describing the stemmer
      Specified by:
      globalInfo in class Tokenizer
      Returns:
      a description suitable for displaying in the explorer/experimenter gui
    • listOptions

      public Enumeration<Option> listOptions()
      Returns an enumeration of all the available options..
      Specified by:
      listOptions in interface OptionHandler
      Overrides:
      listOptions in class CharacterDelimitedTokenizer
      Returns:
      an enumeration of all available options.
    • getOptions

      public String[] getOptions()
      Gets the current option settings for the OptionHandler.
      Specified by:
      getOptions in interface OptionHandler
      Overrides:
      getOptions in class CharacterDelimitedTokenizer
      Returns:
      the list of current option settings as an array of strings
    • setOptions

      public void setOptions(String[] options) throws Exception
      Parses a given list of options.

      Valid options are:

       -delimiters <value>
        The delimiters to use
        (default ' \r\n\t.,;:'"()?!').
       
       -max <int>
        The max size of the Ngram (default = 3).
       
       -min <int>
        The min size of the Ngram (default = 1).
       
      Specified by:
      setOptions in interface OptionHandler
      Overrides:
      setOptions in class CharacterDelimitedTokenizer
      Parameters:
      options - the list of options as an array of strings
      Throws:
      Exception - if an option is not supported
    • getNGramMaxSize

      public int getNGramMaxSize()
      Gets the max N of the NGram.
      Returns:
      the size (N) of the NGram.
    • setNGramMaxSize

      public void setNGramMaxSize(int value)
      Sets the max size of the Ngram.
      Parameters:
      value - the size of the NGram.
    • NGramMaxSizeTipText

      public String NGramMaxSizeTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setNGramMinSize

      public void setNGramMinSize(int value)
      Sets the min size of the Ngram.
      Parameters:
      value - the size of the NGram.
    • getNGramMinSize

      public int getNGramMinSize()
      Gets the min N of the NGram.
      Returns:
      the size (N) of the NGram.
    • NGramMinSizeTipText

      public String NGramMinSizeTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • hasMoreElements

      public boolean hasMoreElements()
      returns true if there's more elements available
      Specified by:
      hasMoreElements in interface Enumeration<String>
      Specified by:
      hasMoreElements in class Tokenizer
      Returns:
      true if there are more elements available
    • nextElement

      public String nextElement()
      Returns N-grams and also (N-1)-grams and .... and 1-grams.
      Specified by:
      nextElement in interface Enumeration<String>
      Specified by:
      nextElement in class Tokenizer
      Returns:
      the next element
    • tokenize

      public void tokenize(String s)
      Sets the string to tokenize. Tokenization happens immediately.
      Specified by:
      tokenize in class Tokenizer
      Parameters:
      s - the string to tokenize
    • getRevision

      public String getRevision()
      Returns the revision string.
      Returns:
      the revision
    • main

      public static void main(String[] args)
      Runs the tokenizer with the given options and strings to tokenize. The tokens are printed to stdout.
      Parameters:
      args - the commandline options and strings to tokenize