Package weka.core.tokenizers
Class CharacterNGramTokenizer
java.lang.Object
weka.core.tokenizers.Tokenizer
weka.core.tokenizers.CharacterNGramTokenizer
- All Implemented Interfaces:
Serializable
,Enumeration<String>
,OptionHandler
,RevisionHandler
Splits a string into an n-gram with min and max
grams.
Valid options are:
-max <int> The max size of the Ngram (default = 3).
-min <int> The min size of the Ngram (default = 1).
- Version:
- $Revision: 10971 $
- Author:
- Sebastian Germesin (sebastian.germesin@dfki.de), Eibe Frank (eibe@cs.waikato.ac.nz)
- See Also:
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionint
Gets the max N of the NGram.int
Gets the min N of the NGram.String[]
Gets the current option settings for the OptionHandler.Returns the revision string.Returns a string describing the tokenizerboolean
returns true if there's more elements availableReturns an enumeration of all the available options..static void
Runs the tokenizer with the given options and strings to tokenize.Returns N-grams and also (N-1)-grams and ....Returns the tip text for this property.Returns the tip text for this property.void
setNGramMaxSize
(int value) Sets the max size of the Ngram.void
setNGramMinSize
(int value) Sets the min size of the Ngram.void
setOptions
(String[] options) Parses a given list of options.void
Sets the string to tokenize.Methods inherited from class weka.core.tokenizers.Tokenizer
runTokenizer, tokenize
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface java.util.Enumeration
asIterator
-
Constructor Details
-
CharacterNGramTokenizer
public CharacterNGramTokenizer()
-
-
Method Details
-
globalInfo
Returns a string describing the tokenizer- Specified by:
globalInfo
in classTokenizer
- Returns:
- a description suitable for displaying in the explorer/experimenter GUI
-
listOptions
Returns an enumeration of all the available options..- Specified by:
listOptions
in interfaceOptionHandler
- Overrides:
listOptions
in classTokenizer
- Returns:
- an enumeration of all available options.
-
getOptions
Gets the current option settings for the OptionHandler.- Specified by:
getOptions
in interfaceOptionHandler
- Overrides:
getOptions
in classTokenizer
- Returns:
- the list of current option settings as an array of strings
-
setOptions
Parses a given list of options. Valid options are:-max <int> The max size of the Ngram (default = 3).
-min <int> The min size of the Ngram (default = 1).
- Specified by:
setOptions
in interfaceOptionHandler
- Overrides:
setOptions
in classTokenizer
- Parameters:
options
- the list of options as an array of strings- Throws:
Exception
- if an option is not supported
-
getNGramMaxSize
public int getNGramMaxSize()Gets the max N of the NGram.- Returns:
- the size (N) of the NGram.
-
setNGramMaxSize
public void setNGramMaxSize(int value) Sets the max size of the Ngram.- Parameters:
value
- the size of the NGram.
-
NGramMaxSizeTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setNGramMinSize
public void setNGramMinSize(int value) Sets the min size of the Ngram.- Parameters:
value
- the size of the NGram.
-
getNGramMinSize
public int getNGramMinSize()Gets the min N of the NGram.- Returns:
- the size (N) of the NGram.
-
NGramMinSizeTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
hasMoreElements
public boolean hasMoreElements()returns true if there's more elements available- Specified by:
hasMoreElements
in interfaceEnumeration<String>
- Specified by:
hasMoreElements
in classTokenizer
- Returns:
- true if there are more elements available
-
nextElement
Returns N-grams and also (N-1)-grams and ....- Specified by:
nextElement
in interfaceEnumeration<String>
- Specified by:
nextElement
in classTokenizer
- Returns:
- the next element
-
tokenize
Sets the string to tokenize. -
getRevision
Returns the revision string.- Returns:
- the revision
-
main
Runs the tokenizer with the given options and strings to tokenize. The tokens are printed to stdout.- Parameters:
args
- the commandline options and strings to tokenize
-