Package weka.core.converters
Class DictionarySaver
java.lang.Object
weka.core.converters.AbstractSaver
weka.core.converters.AbstractFileSaver
weka.core.converters.DictionarySaver
- All Implemented Interfaces:
Serializable
,CapabilitiesHandler
,CapabilitiesIgnorer
,BatchConverter
,FileSourcedConverter
,IncrementalConverter
,Saver
,EnvironmentHandler
,OptionHandler
,RevisionHandler
public class DictionarySaver
extends AbstractFileSaver
implements BatchConverter, IncrementalConverter
Writes a dictionary constructed from string
attributes in incoming instances to a destination.
Valid options are:
Valid options are:
-binary-dict Save as a binary serialized dictionary
-R <range> Specify range of attributes to act on. This is a comma separated list of attribute indices, with "first" and "last" valid values.
-V Set attributes selection mode. If false, only selected attributes in the range will be worked on. If true, only non-selected attributes will be processed
-L Convert all tokens to lowercase when matching against dictionary entries.
-stemmer <spec> The stemming algorithm (classname plus parameters) to use.
-stopwords-handler <spec> The stopwords handler to use (default = Null)
-tokenizer <spec> The tokenizing algorithm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
-P <integer> Prune the dictionary every x instances (default = 0 - i.e. no periodic pruning)
-W <integer> The number of words (per class if there is a class attribute assigned) to attempt to keep.
-M <integer> The minimum term frequency to use when pruning the dictionary (default = 1).
-O If this is set, the maximum number of words and the minimum term frequency is not enforced on a per-class basis but based on the documents in all the classes (even if a class attribute is set).
-sort Sort the dictionary alphabetically
-i <the input file> The input file
-o <the output file> The output file
- Version:
- $Revision: 12690 $
- Author:
- Mark Hall (mhall{[at]}pentaho{[dot]}com)
- See Also:
-
Field Summary
Fields inherited from interface weka.core.converters.Saver
BATCH, INCREMENTAL, NONE
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionGets the current range selection.Returns the Capabilities of this saver.boolean
Get the DoNotOperateOnPerClassBasis value.to be pverriddenboolean
Gets whether the supplied columns are to be processed or skipped.boolean
Get whether to keep the dictionary sorted alphabetically or notboolean
Gets whether if the tokens are to be downcased or not.int
Get the MinTermFreq value.long
Gets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.Returns the revision string.boolean
Get whether to save the dictionary as a binary serialized dictionary, rather than a plain text oneReturns the current stemming algorithm, null if none is used.Gets the stopwords handler.Returns the current tokenizer algorithm.int
Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.Returns a string describing this Saver.static void
void
resets the optionsvoid
Sets the writer to null.void
setAttributeIndices
(String rangeList) Sets which attributes are to be worked on.void
setDestination
(OutputStream output) Sets the destination output stream.void
setDoNotOperateOnPerClassBasis
(boolean newDoNotOperateOnPerClassBasis) Set the DoNotOperateOnPerClassBasis value.void
setInvertSelection
(boolean invert) Sets whether selected columns should be processed or skipped.void
setKeepDictionarySorted
(boolean sorted) Set whether to keep the dictionary sorted alphabetically or notvoid
setLowerCaseTokens
(boolean downCaseTokens) Sets whether if the tokens are to be downcased or not.void
setMinTermFreq
(int newMinTermFreq) Set the MinTermFreq value.void
setPeriodicPruning
(long newPeriodicPruning) Sets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.void
setSaveBinaryDictionary
(boolean binary) Set whether to save the dictionary as a binary serialized dictionary, rather than a plain text onevoid
setStemmer
(Stemmer value) the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).void
Sets the stopwords handler to use.void
setTokenizer
(Tokenizer value) the tokenizer algorithm to use.void
setWordsToKeep
(int newWordsToKeep) Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.void
Writes to a file in batch mode To be overridden.void
writeIncremental
(Instance inst) Method for incremental saving.Methods inherited from class weka.core.converters.AbstractFileSaver
cancel, filePrefix, getFileExtension, getFileExtensions, getOptions, getUseRelativePath, getWriter, listOptions, retrieveDir, retrieveFile, runFileSaver, setDestination, setDir, setDirAndPrefix, setEnvironment, setFile, setFilePrefix, setOptions, setUseRelativePath, useRelativePathTipText
Methods inherited from class weka.core.converters.AbstractSaver
doNotCheckCapabilitiesTipText, getDoNotCheckCapabilities, getInstances, getWriteMode, resetStructure, setDoNotCheckCapabilities, setInstances, setRetrieval, setStructure
-
Constructor Details
-
DictionarySaver
public DictionarySaver()
-
-
Method Details
-
globalInfo
Returns a string describing this Saver.- Returns:
- a description of the Saver suitable for displaying in the explorer/experimenter gui
-
setSaveBinaryDictionary
@OptionMetadata(displayName="Save dictionary in binary form", description="Save as a binary serialized dictionary", commandLineParamName="binary-dict", commandLineParamSynopsis="-binary-dict", commandLineParamIsFlag=true, displayOrder=2) public void setSaveBinaryDictionary(boolean binary) Set whether to save the dictionary as a binary serialized dictionary, rather than a plain text one- Parameters:
binary
- true if the dictionary is to be saved as binary rather than plain text
-
getSaveBinaryDictionary
public boolean getSaveBinaryDictionary()Get whether to save the dictionary as a binary serialized dictionary, rather than a plain text one- Returns:
- true if the dictionary is to be saved as binary rather than plain text
-
getAttributeIndices
Gets the current range selection.- Returns:
- a string containing a comma separated list of ranges
-
setAttributeIndices
@OptionMetadata(displayName="Range of attributes to operate on", description="Specify range of attributes to act on. This is a comma separated list of attribute\nindices, with \"first\" and \"last\" valid values.", commandLineParamName="R", commandLineParamSynopsis="-R <range>", displayOrder=4) public void setAttributeIndices(String rangeList) Sets which attributes are to be worked on.- Parameters:
rangeList
- a string representing the list of attributes. Since the string will typically come from a user, attributes are indexed from 1.
eg: first-3,5,6-last- Throws:
IllegalArgumentException
- if an invalid range list is supplied
-
getInvertSelection
public boolean getInvertSelection()Gets whether the supplied columns are to be processed or skipped.- Returns:
- true if the supplied columns will be kept
-
setInvertSelection
@OptionMetadata(displayName="Invert selection", description="Set attributes selection mode. If false, only selected attributes in the range will\nbe worked on. If true, only non-selected attributes will be processed", commandLineParamName="V", commandLineParamSynopsis="-V", commandLineParamIsFlag=true, displayOrder=5) public void setInvertSelection(boolean invert) Sets whether selected columns should be processed or skipped.- Parameters:
invert
- the new invert setting
-
getLowerCaseTokens
public boolean getLowerCaseTokens()Gets whether if the tokens are to be downcased or not.- Returns:
- true if the tokens are to be downcased.
-
setLowerCaseTokens
@OptionMetadata(displayName="Lower case tokens", description="Convert all tokens to lowercase when matching against dictionary entries.", commandLineParamName="L", commandLineParamSynopsis="-L", commandLineParamIsFlag=true, displayOrder=10) public void setLowerCaseTokens(boolean downCaseTokens) Sets whether if the tokens are to be downcased or not. (Doesn't affect non-alphabetic characters in tokens).- Parameters:
downCaseTokens
- should be true if only lower case tokens are to be formed.
-
setStemmer
@OptionMetadata(displayName="Stemmer to use", description="The stemming algorithm (classname plus parameters) to use.", commandLineParamName="stemmer", commandLineParamSynopsis="-stemmer <spec>", displayOrder=11) public void setStemmer(Stemmer value) the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).- Parameters:
value
- the configured stemming algorithm, or null- See Also:
-
getStemmer
Returns the current stemming algorithm, null if none is used.- Returns:
- the current stemming algorithm, null if none set
-
setStopwordsHandler
@OptionMetadata(displayName="Stop words handler", description="The stopwords handler to use (default = Null)", commandLineParamName="stopwords-handler", commandLineParamSynopsis="-stopwords-handler <spec>", displayOrder=12) public void setStopwordsHandler(StopwordsHandler value) Sets the stopwords handler to use.- Parameters:
value
- the stopwords handler, if null, Null is used
-
getStopwordsHandler
Gets the stopwords handler.- Returns:
- the stopwords handler
-
setTokenizer
@OptionMetadata(displayName="Tokenizer", description="The tokenizing algorithm (classname plus parameters) to use.\n(default: weka.core.tokenizers.WordTokenizer)", commandLineParamName="tokenizer", commandLineParamSynopsis="-tokenizer <spec>", displayOrder=13) public void setTokenizer(Tokenizer value) the tokenizer algorithm to use.- Parameters:
value
- the configured tokenizing algorithm
-
getTokenizer
Returns the current tokenizer algorithm.- Returns:
- the current tokenizer algorithm
-
getPeriodicPruning
public long getPeriodicPruning()Gets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.- Returns:
- the rate at which the dictionary is periodically pruned
-
setPeriodicPruning
@OptionMetadata(displayName="Periodic pruning rate", description="Prune the dictionary every x instances\n(default = 0 - i.e. no periodic pruning)", commandLineParamName="P", commandLineParamSynopsis="-P <integer>", displayOrder=14) public void setPeriodicPruning(long newPeriodicPruning) Sets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.- Parameters:
newPeriodicPruning
- the rate at which the dictionary is periodically pruned
-
getWordsToKeep
public int getWordsToKeep()Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.- Returns:
- the target number of words in the output vector (per class if assigned).
-
setWordsToKeep
@OptionMetadata(displayName="Number of words to attempt to keep", description="The number of words (per class if there is a class attribute assigned) to attempt to keep.", commandLineParamName="W", commandLineParamSynopsis="-W <integer>", displayOrder=15) public void setWordsToKeep(int newWordsToKeep) Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.- Parameters:
newWordsToKeep
- the target number of words in the output vector (per class if assigned).
-
getMinTermFreq
public int getMinTermFreq()Get the MinTermFreq value.- Returns:
- the MinTermFreq value.
-
setMinTermFreq
@OptionMetadata(displayName="Minimum term frequency", description="The minimum term frequency to use when pruning the dictionary\n(default = 1).", commandLineParamName="M", commandLineParamSynopsis="-M <integer>", displayOrder=16) public void setMinTermFreq(int newMinTermFreq) Set the MinTermFreq value.- Parameters:
newMinTermFreq
- The new MinTermFreq value.
-
getDoNotOperateOnPerClassBasis
public boolean getDoNotOperateOnPerClassBasis()Get the DoNotOperateOnPerClassBasis value.- Returns:
- the DoNotOperateOnPerClassBasis value.
-
setDoNotOperateOnPerClassBasis
@OptionMetadata(displayName="Do not operate on a per-class basis", description="If this is set, the maximum number of words and the\nminimum term frequency is not enforced on a per-class\nbasis but based on the documents in all the classes\n(even if a class attribute is set).", commandLineParamName="O", commandLineParamSynopsis="-O", commandLineParamIsFlag=true, displayOrder=17) public void setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis) Set the DoNotOperateOnPerClassBasis value.- Parameters:
newDoNotOperateOnPerClassBasis
- The new DoNotOperateOnPerClassBasis value.
-
setKeepDictionarySorted
@OptionMetadata(displayName="Sort dictionary", description="Sort the dictionary alphabetically", commandLineParamName="sort", commandLineParamSynopsis="-sort", commandLineParamIsFlag=true, displayOrder=18) public void setKeepDictionarySorted(boolean sorted) Set whether to keep the dictionary sorted alphabetically or not- Parameters:
sorted
- true to keep the dictionary sorted
-
getKeepDictionarySorted
public boolean getKeepDictionarySorted()Get whether to keep the dictionary sorted alphabetically or not- Returns:
- true to keep the dictionary sorted
-
getCapabilities
Returns the Capabilities of this saver.- Specified by:
getCapabilities
in interfaceCapabilitiesHandler
- Overrides:
getCapabilities
in classAbstractSaver
- Returns:
- the capabilities of this object
- See Also:
-
getFileDescription
Description copied from class:AbstractFileSaver
to be pverridden- Specified by:
getFileDescription
in interfaceFileSourcedConverter
- Specified by:
getFileDescription
in classAbstractFileSaver
- Returns:
- the file type description.
-
writeIncremental
Description copied from class:AbstractSaver
Method for incremental saving. Standard behaviour: no incremental saving is possible, therefore throw an IOException. An incremental saving process is stopped by calling this method with null.- Specified by:
writeIncremental
in interfaceSaver
- Overrides:
writeIncremental
in classAbstractSaver
- Parameters:
inst
- the instance to be saved- Throws:
IOException
- IOEXception if the instance acnnot be written to the specified destination
-
writeBatch
Description copied from class:AbstractSaver
Writes to a file in batch mode To be overridden.- Specified by:
writeBatch
in interfaceSaver
- Specified by:
writeBatch
in classAbstractSaver
- Throws:
IOException
- exception if writting is not possible
-
resetOptions
public void resetOptions()Description copied from class:AbstractFileSaver
resets the options- Overrides:
resetOptions
in classAbstractFileSaver
-
resetWriter
public void resetWriter()Description copied from class:AbstractFileSaver
Sets the writer to null.- Overrides:
resetWriter
in classAbstractFileSaver
-
setDestination
Description copied from class:AbstractFileSaver
Sets the destination output stream.- Specified by:
setDestination
in interfaceSaver
- Overrides:
setDestination
in classAbstractFileSaver
- Parameters:
output
- the output stream.- Throws:
IOException
- throws an IOException if destination cannot be set
-
getRevision
Description copied from interface:RevisionHandler
Returns the revision string.- Specified by:
getRevision
in interfaceRevisionHandler
- Returns:
- the revision
-
main
-