Package weka.core.converters
Class CSVLoader
java.lang.Object
weka.core.converters.AbstractLoader
weka.core.converters.AbstractFileLoader
weka.core.converters.CSVLoader
- All Implemented Interfaces:
Serializable
,BatchConverter
,FileSourcedConverter
,IncrementalConverter
,Loader
,EnvironmentHandler
,OptionHandler
,RevisionHandler
public class CSVLoader
extends AbstractFileLoader
implements BatchConverter, IncrementalConverter, OptionHandler
Reads a source that is in comma separated format (the default). One can also change the column separator from comma to tab or another character, specify string enclosures, specify whether aheader row is present or not and specify which attributes are to beforced to be nominal or date. Can operate in batch or incremental mode. In batch mode, a buffer is used to process a fixed number of rows in memory at any one time and the data is dumped to a temporary file. This allows the legal values for nominal attributes to be automatically determined. The final ARFF file is produced in a second pass over the temporary file using the structure determined on the first pass. In incremental mode, the first buffer full of rows is used to determine the structure automatically. Following this all rows are read and output incrementally. An error will occur if a row containing nominal values not seen in the initial buffer is encountered. In this case, the size of the initial buffer can be increased, or the user can explicitly provide the legal values of all nominal attributes using the -L (setNominalLabelSpecs) option.
Valid options are:
-H No header row present in the data.
-N <range> The range of attributes to force type to be NOMINAL. 'first' and 'last' are accepted as well. Examples: "first-last", "1,4,5-27,50-last" (default: -none-)
-L <nominal label spec> Optional specification of legal labels for nominal attributes. May be specified multiple times. Batch mode can determine this automatically (and so can incremental mode if the first in memory buffer load of instances contains an example of each legal value). The spec contains two parts separated by a ":". The first part can be a range of attribute indexes or a comma-separated list off attruibute names; the second part is a comma-separated list of labels. E.g "1,2,4-6:red,green,blue" or "att1,att2:red,green,blue"
-S <range> The range of attribute to force type to be STRING. 'first' and 'last' are accepted as well. Examples: "first-last", "1,4,5-27,50-last" (default: -none-)
-D <range> The range of attribute to force type to be DATE. 'first' and 'last' are accepted as well. Examples: "first-last", "1,4,5-27,50-last" (default: -none-)
-format <date format> The date formatting string to use to parse date values. (default: "yyyy-MM-dd'T'HH:mm:ss")
-R <range> The range of attribute to force type to be NUMERIC. 'first' and 'last' are accepted as well. Examples: "first-last", "1,4,5-27,50-last" (default: -none-)
-M <str> The string representing a missing value. (default: ?)
-F <separator> The field separator to be used. '\t' can be used as well. (default: ',')
-E <enclosures> The enclosure character(s) to use for strings. Specify as a comma separated list (e.g. ",' (default: ",')
-B <num> The size of the in memory buffer (in rows). (default: 100)
- Version:
- $Revision: 14115 $
- Author:
- Mark Hall (mhall{[at]}pentaho{[dot]}com)
- See Also:
-
Nested Class Summary
Nested classes/interfaces inherited from interface weka.core.converters.Loader
Loader.StructureNotReadyException
-
Field Summary
Fields inherited from class weka.core.converters.AbstractFileLoader
FILE_EXTENSION_COMPRESSED
Fields inherited from interface weka.core.converters.Loader
BATCH, INCREMENTAL, NONE
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionReturns the tip text for this property.Returns the tip text for this property.Returns the tip text for this property.Returns the tip text for this property.Returns the tip text for this property.int
Get the buffer size to use - i.e.Return the full data set.Returns the current attribute range to be forced to type date.Get the format to use for parsing date values.Get the character(s) to use/recognize as string enclosuresReturns the character used as column separator.Get a one line description of the type of fileGet the file extension used for this type of fileString[]
Gets all the file extensions used for this type of fileReturns the current placeholder for missing values.getNextInstance
(Instances structure) Read the data set incrementally---get the next instance in the data set or returns null if there are no more instances to get.boolean
Get whether there is no header row in the data.Returns the current attribute range to be forced to type nominal.Object[]
Get label specifications for nominal attributes.Gets the attribute range to be forced to type numericString[]
Gets the current option settings for the OptionHandler.Returns the revision string.Returns the current attribute range to be forced to type string.Determines and returns (if possible) the structure (internally the header) of the data set as an empty set of instances.Returns a string describing this attribute evaluator.Returns an enumeration of all the available options..static void
Main method.Returns the tip text for this property.Returns the tip text for this property.Returns the tip text for this property.Returns the tip text for this property.Returns the tip text for this property.void
reset()
Resets the loader ready to read a new data setvoid
setBufferSize
(int buff) Set the buffer size to use - i.e.void
setDateAttributes
(String value) Set the attribute range to be forced to type date.void
setDateFormat
(String value) Set the format to use for parsing date values.void
setEnclosureCharacters
(String enclosure) Set the character(s) to use/recognize as string enclosuresvoid
setFieldSeparator
(String value) Sets the character used as column separator.void
setMissingValue
(String value) Sets the placeholder for missing values.void
setNoHeaderRowPresent
(boolean b) Set whether there is no header row in the data.void
setNominalAttributes
(String value) Sets the attribute range to be forced to type nominal.void
setNominalLabelSpecs
(Object[] specs) Set label specifications for nominal attributes.void
setNumericAttributes
(String value) Sets the attribute range to be forced to type numericvoid
setOptions
(String[] options) Sets the OptionHandler's options using the given list.void
Resets the Loader object and sets the source of the data set to be the supplied File object.void
setSource
(InputStream input) Resets the Loader object and sets the source of the data set to be the supplied Stream object.void
setStringAttributes
(String value) Sets the attribute range to be forced to type string.Returns the tip text for this property.Methods inherited from class weka.core.converters.AbstractFileLoader
getUseRelativePath, retrieveFile, runFileLoader, setEnvironment, setFile, setUseRelativePath, useRelativePathTipText
Methods inherited from class weka.core.converters.AbstractLoader
setRetrieval
-
Field Details
-
FILE_EXTENSION
the file extension.
-
-
Constructor Details
-
CSVLoader
public CSVLoader()default constructor.
-
-
Method Details
-
main
Main method.- Parameters:
args
- should contain the name of an input file.
-
globalInfo
Returns a string describing this attribute evaluator.- Returns:
- a description of the evaluator suitable for displaying in the explorer/experimenter gui
-
getFileExtension
Description copied from interface:FileSourcedConverter
Get the file extension used for this type of file- Specified by:
getFileExtension
in interfaceFileSourcedConverter
- Returns:
- the file extension
-
getFileExtensions
Description copied from interface:FileSourcedConverter
Gets all the file extensions used for this type of file- Specified by:
getFileExtensions
in interfaceFileSourcedConverter
- Returns:
- the file extensions
-
getFileDescription
Description copied from interface:FileSourcedConverter
Get a one line description of the type of file- Specified by:
getFileDescription
in interfaceFileSourcedConverter
- Returns:
- a description of the file type
-
getRevision
Description copied from interface:RevisionHandler
Returns the revision string.- Specified by:
getRevision
in interfaceRevisionHandler
- Returns:
- the revision
-
noHeaderRowPresentTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getNoHeaderRowPresent
public boolean getNoHeaderRowPresent()Get whether there is no header row in the data.- Returns:
- true if there is no header row in the data
-
setNoHeaderRowPresent
public void setNoHeaderRowPresent(boolean b) Set whether there is no header row in the data.- Parameters:
b
- true if there is no header row in the data
-
getMissingValue
Returns the current placeholder for missing values.- Returns:
- the placeholder
-
setMissingValue
Sets the placeholder for missing values.- Parameters:
value
- the placeholder
-
missingValueTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getStringAttributes
Returns the current attribute range to be forced to type string.- Returns:
- the range
-
setStringAttributes
Sets the attribute range to be forced to type string.- Parameters:
value
- the range
-
stringAttributesTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getNominalAttributes
Returns the current attribute range to be forced to type nominal.- Returns:
- the range
-
setNominalAttributes
Sets the attribute range to be forced to type nominal.- Parameters:
value
- the range
-
nominalAttributesTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getNumericAttributes
Gets the attribute range to be forced to type numeric- Returns:
- the range
-
setNumericAttributes
Sets the attribute range to be forced to type numeric- Parameters:
value
- the range
-
numericAttributesTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getDateFormat
Get the format to use for parsing date values.- Returns:
- the format to use for parsing date values.
-
setDateFormat
Set the format to use for parsing date values.- Parameters:
value
- the format to use.
-
dateFormatTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getDateAttributes
Returns the current attribute range to be forced to type date.- Returns:
- the range.
-
setDateAttributes
Set the attribute range to be forced to type date.- Parameters:
value
- the range
-
dateAttributesTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
enclosureCharactersTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getEnclosureCharacters
Get the character(s) to use/recognize as string enclosures- Returns:
- the characters to use as string enclosures
-
setEnclosureCharacters
Set the character(s) to use/recognize as string enclosures- Parameters:
enclosure
- the characters to use as string enclosures
-
getFieldSeparator
Returns the character used as column separator.- Returns:
- the character to use
-
setFieldSeparator
Sets the character used as column separator.- Parameters:
value
- the character to use
-
fieldSeparatorTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getBufferSize
public int getBufferSize()Get the buffer size to use - i.e. the number of rows to load and process in memory at any one time- Returns:
-
setBufferSize
public void setBufferSize(int buff) Set the buffer size to use - i.e. the number of rows to load and process in memory at any one time- Parameters:
buff
- the buffer size (number of rows)
-
bufferSizeTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
getNominalLabelSpecs
Get label specifications for nominal attributes.- Returns:
- an array of label specifications
-
setNominalLabelSpecs
Set label specifications for nominal attributes.- Parameters:
specs
- an array of label specifications
-
nominalLabelSpecsTipText
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
listOptions
Description copied from interface:OptionHandler
Returns an enumeration of all the available options..- Specified by:
listOptions
in interfaceOptionHandler
- Returns:
- an enumeration of all available options.
-
getOptions
Description copied from interface:OptionHandler
Gets the current option settings for the OptionHandler.- Specified by:
getOptions
in interfaceOptionHandler
- Returns:
- the list of current option settings as an array of strings
-
setOptions
Description copied from interface:OptionHandler
Sets the OptionHandler's options using the given list. All options will be set (or reset) during this call (i.e. incremental setting of options is not possible).- Specified by:
setOptions
in interfaceOptionHandler
- Parameters:
options
- the list of options as an array of strings- Throws:
Exception
- if an option is not supported
-
getNextInstance
Description copied from interface:Loader
Read the data set incrementally---get the next instance in the data set or returns null if there are no more instances to get. If the structure hasn't yet been determined by a call to getStructure then method should do so before returning the next instance in the data set. If it is not possible to read the data set incrementally (ie. in cases where the data set structure cannot be fully established before all instances have been seen) then an exception should be thrown.- Specified by:
getNextInstance
in interfaceLoader
- Specified by:
getNextInstance
in classAbstractLoader
- Parameters:
structure
- the dataset header information, will get updated in case of string or relational attributes- Returns:
- the next instance in the data set as an Instance object or null if there are no more instances to be read
- Throws:
IOException
- if there is an error during parsing or if getDataSet has been called on this source (either incremental or batch loading can be used, not both).
-
getDataSet
Description copied from interface:Loader
Return the full data set. If the structure hasn't yet been determined by a call to getStructure then the method should do so before processing the rest of the data set.- Specified by:
getDataSet
in interfaceLoader
- Specified by:
getDataSet
in classAbstractLoader
- Returns:
- the full data set as an Instances object
- Throws:
IOException
- if there is an error during parsing or if getNextInstance has been called on this source (either incremental or batch loading can be used, not both).public_normal_behavior requires: model_sourceSupplied == true && (* successful parse *); modifiable: model_structureDetermined; ensures: \result != null && \result.numInstances() >= 0 && model_structureDetermined == true; also public_exceptional_behavior requires: model_sourceSupplied == false || (* unsuccessful parse *); signals: (IOException);
-
setSource
Resets the Loader object and sets the source of the data set to be the supplied Stream object.- Specified by:
setSource
in interfaceLoader
- Overrides:
setSource
in classAbstractLoader
- Parameters:
input
- the input stream- Throws:
IOException
- if an error occurs
-
setSource
Resets the Loader object and sets the source of the data set to be the supplied File object.- Specified by:
setSource
in interfaceLoader
- Overrides:
setSource
in classAbstractFileLoader
- Parameters:
file
- the source file.- Throws:
IOException
- if an error occurs
-
getStructure
Description copied from interface:Loader
Determines and returns (if possible) the structure (internally the header) of the data set as an empty set of instances.- Specified by:
getStructure
in interfaceLoader
- Specified by:
getStructure
in classAbstractLoader
- Returns:
- the structure of the data set as an empty set of Instances
- Throws:
IOException
- if there is no source or parsing failspublic_normal_behavior requires: model_sourceSupplied == true && model_structureDetermined == false && (* successful parse *); modifiable: model_structureDetermined; ensures: \result != null && \result.numInstances() == 0 && model_structureDetermined == true; also public_exceptional_behavior requires: model_sourceSupplied == false || (* unsuccessful parse *); signals: (IOException);
-
reset
Description copied from class:AbstractFileLoader
Resets the loader ready to read a new data set- Specified by:
reset
in interfaceLoader
- Overrides:
reset
in classAbstractFileLoader
- Throws:
IOException
- if something goes wrong
-