Class CSVLoader

All Implemented Interfaces:
Serializable, BatchConverter, FileSourcedConverter, IncrementalConverter, Loader, EnvironmentHandler, OptionHandler, RevisionHandler

public class CSVLoader extends AbstractFileLoader implements BatchConverter, IncrementalConverter, OptionHandler
Reads a source that is in comma separated format (the default). One can also change the column separator from comma to tab or another character, specify string enclosures, specify whether aheader row is present or not and specify which attributes are to beforced to be nominal or date. Can operate in batch or incremental mode. In batch mode, a buffer is used to process a fixed number of rows in memory at any one time and the data is dumped to a temporary file. This allows the legal values for nominal attributes to be automatically determined. The final ARFF file is produced in a second pass over the temporary file using the structure determined on the first pass. In incremental mode, the first buffer full of rows is used to determine the structure automatically. Following this all rows are read and output incrementally. An error will occur if a row containing nominal values not seen in the initial buffer is encountered. In this case, the size of the initial buffer can be increased, or the user can explicitly provide the legal values of all nominal attributes using the -L (setNominalLabelSpecs) option.

Valid options are:

 -H
  No header row present in the data.
 -N <range>
  The range of attributes to force type to be NOMINAL.
  'first' and 'last' are accepted as well.
  Examples: "first-last", "1,4,5-27,50-last"
  (default: -none-)
 -L <nominal label spec>
  Optional specification of legal labels for nominal
  attributes. May be specified multiple times.
  Batch mode can determine this
  automatically (and so can incremental mode if
  the first in memory buffer load of instances
  contains an example of each legal value). The
  spec contains two parts separated by a ":". The
  first part can be a range of attribute indexes or
  a comma-separated list off attruibute names; the
  second part is a comma-separated list of labels. E.g
  "1,2,4-6:red,green,blue" or "att1,att2:red,green,blue"
 -S <range>
  The range of attribute to force type to be STRING.
  'first' and 'last' are accepted as well.
  Examples: "first-last", "1,4,5-27,50-last"
  (default: -none-)
 -D <range>
  The range of attribute to force type to be DATE.
  'first' and 'last' are accepted as well.
  Examples: "first-last", "1,4,5-27,50-last"
  (default: -none-)
 -format <date format>
  The date formatting string to use to parse date values.
  (default: "yyyy-MM-dd'T'HH:mm:ss")
 -R <range>
  The range of attribute to force type to be NUMERIC.
  'first' and 'last' are accepted as well.
  Examples: "first-last", "1,4,5-27,50-last"
  (default: -none-)
 -M <str>
  The string representing a missing value.
  (default: ?)
 -F <separator>
  The field separator to be used.
  '\t' can be used as well.
  (default: ',')
 -E <enclosures>
  The enclosure character(s) to use for strings.
  Specify as a comma separated list (e.g. ",' (default: ",')
 -B <num>
  The size of the in memory buffer (in rows).
  (default: 100)
Version:
$Revision: 14115 $
Author:
Mark Hall (mhall{[at]}pentaho{[dot]}com)
See Also:
  • Field Details

    • FILE_EXTENSION

      public static String FILE_EXTENSION
      the file extension.
  • Constructor Details

    • CSVLoader

      public CSVLoader()
      default constructor.
  • Method Details

    • main

      public static void main(String[] args)
      Main method.
      Parameters:
      args - should contain the name of an input file.
    • globalInfo

      public String globalInfo()
      Returns a string describing this attribute evaluator.
      Returns:
      a description of the evaluator suitable for displaying in the explorer/experimenter gui
    • getFileExtension

      public String getFileExtension()
      Description copied from interface: FileSourcedConverter
      Get the file extension used for this type of file
      Specified by:
      getFileExtension in interface FileSourcedConverter
      Returns:
      the file extension
    • getFileExtensions

      public String[] getFileExtensions()
      Description copied from interface: FileSourcedConverter
      Gets all the file extensions used for this type of file
      Specified by:
      getFileExtensions in interface FileSourcedConverter
      Returns:
      the file extensions
    • getFileDescription

      public String getFileDescription()
      Description copied from interface: FileSourcedConverter
      Get a one line description of the type of file
      Specified by:
      getFileDescription in interface FileSourcedConverter
      Returns:
      a description of the file type
    • getRevision

      public String getRevision()
      Description copied from interface: RevisionHandler
      Returns the revision string.
      Specified by:
      getRevision in interface RevisionHandler
      Returns:
      the revision
    • noHeaderRowPresentTipText

      public String noHeaderRowPresentTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getNoHeaderRowPresent

      public boolean getNoHeaderRowPresent()
      Get whether there is no header row in the data.
      Returns:
      true if there is no header row in the data
    • setNoHeaderRowPresent

      public void setNoHeaderRowPresent(boolean b)
      Set whether there is no header row in the data.
      Parameters:
      b - true if there is no header row in the data
    • getMissingValue

      public String getMissingValue()
      Returns the current placeholder for missing values.
      Returns:
      the placeholder
    • setMissingValue

      public void setMissingValue(String value)
      Sets the placeholder for missing values.
      Parameters:
      value - the placeholder
    • missingValueTipText

      public String missingValueTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getStringAttributes

      public String getStringAttributes()
      Returns the current attribute range to be forced to type string.
      Returns:
      the range
    • setStringAttributes

      public void setStringAttributes(String value)
      Sets the attribute range to be forced to type string.
      Parameters:
      value - the range
    • stringAttributesTipText

      public String stringAttributesTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getNominalAttributes

      public String getNominalAttributes()
      Returns the current attribute range to be forced to type nominal.
      Returns:
      the range
    • setNominalAttributes

      public void setNominalAttributes(String value)
      Sets the attribute range to be forced to type nominal.
      Parameters:
      value - the range
    • nominalAttributesTipText

      public String nominalAttributesTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getNumericAttributes

      public String getNumericAttributes()
      Gets the attribute range to be forced to type numeric
      Returns:
      the range
    • setNumericAttributes

      public void setNumericAttributes(String value)
      Sets the attribute range to be forced to type numeric
      Parameters:
      value - the range
    • numericAttributesTipText

      public String numericAttributesTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getDateFormat

      public String getDateFormat()
      Get the format to use for parsing date values.
      Returns:
      the format to use for parsing date values.
    • setDateFormat

      public void setDateFormat(String value)
      Set the format to use for parsing date values.
      Parameters:
      value - the format to use.
    • dateFormatTipText

      public String dateFormatTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getDateAttributes

      public String getDateAttributes()
      Returns the current attribute range to be forced to type date.
      Returns:
      the range.
    • setDateAttributes

      public void setDateAttributes(String value)
      Set the attribute range to be forced to type date.
      Parameters:
      value - the range
    • dateAttributesTipText

      public String dateAttributesTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • enclosureCharactersTipText

      public String enclosureCharactersTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getEnclosureCharacters

      public String getEnclosureCharacters()
      Get the character(s) to use/recognize as string enclosures
      Returns:
      the characters to use as string enclosures
    • setEnclosureCharacters

      public void setEnclosureCharacters(String enclosure)
      Set the character(s) to use/recognize as string enclosures
      Parameters:
      enclosure - the characters to use as string enclosures
    • getFieldSeparator

      public String getFieldSeparator()
      Returns the character used as column separator.
      Returns:
      the character to use
    • setFieldSeparator

      public void setFieldSeparator(String value)
      Sets the character used as column separator.
      Parameters:
      value - the character to use
    • fieldSeparatorTipText

      public String fieldSeparatorTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getBufferSize

      public int getBufferSize()
      Get the buffer size to use - i.e. the number of rows to load and process in memory at any one time
      Returns:
    • setBufferSize

      public void setBufferSize(int buff)
      Set the buffer size to use - i.e. the number of rows to load and process in memory at any one time
      Parameters:
      buff - the buffer size (number of rows)
    • bufferSizeTipText

      public String bufferSizeTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getNominalLabelSpecs

      public Object[] getNominalLabelSpecs()
      Get label specifications for nominal attributes.
      Returns:
      an array of label specifications
    • setNominalLabelSpecs

      public void setNominalLabelSpecs(Object[] specs)
      Set label specifications for nominal attributes.
      Parameters:
      specs - an array of label specifications
    • nominalLabelSpecsTipText

      public String nominalLabelSpecsTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • listOptions

      public Enumeration<Option> listOptions()
      Description copied from interface: OptionHandler
      Returns an enumeration of all the available options..
      Specified by:
      listOptions in interface OptionHandler
      Returns:
      an enumeration of all available options.
    • getOptions

      public String[] getOptions()
      Description copied from interface: OptionHandler
      Gets the current option settings for the OptionHandler.
      Specified by:
      getOptions in interface OptionHandler
      Returns:
      the list of current option settings as an array of strings
    • setOptions

      public void setOptions(String[] options) throws Exception
      Description copied from interface: OptionHandler
      Sets the OptionHandler's options using the given list. All options will be set (or reset) during this call (i.e. incremental setting of options is not possible).
      Specified by:
      setOptions in interface OptionHandler
      Parameters:
      options - the list of options as an array of strings
      Throws:
      Exception - if an option is not supported
    • getNextInstance

      public Instance getNextInstance(Instances structure) throws IOException
      Description copied from interface: Loader
      Read the data set incrementally---get the next instance in the data set or returns null if there are no more instances to get. If the structure hasn't yet been determined by a call to getStructure then method should do so before returning the next instance in the data set. If it is not possible to read the data set incrementally (ie. in cases where the data set structure cannot be fully established before all instances have been seen) then an exception should be thrown.
      Specified by:
      getNextInstance in interface Loader
      Specified by:
      getNextInstance in class AbstractLoader
      Parameters:
      structure - the dataset header information, will get updated in case of string or relational attributes
      Returns:
      the next instance in the data set as an Instance object or null if there are no more instances to be read
      Throws:
      IOException - if there is an error during parsing or if getDataSet has been called on this source (either incremental or batch loading can be used, not both).
    • getDataSet

      public Instances getDataSet() throws IOException
      Description copied from interface: Loader
      Return the full data set. If the structure hasn't yet been determined by a call to getStructure then the method should do so before processing the rest of the data set.
      Specified by:
      getDataSet in interface Loader
      Specified by:
      getDataSet in class AbstractLoader
      Returns:
      the full data set as an Instances object
      Throws:
      IOException - if there is an error during parsing or if getNextInstance has been called on this source (either incremental or batch loading can be used, not both).
       
          public_normal_behavior
            requires: model_sourceSupplied == true
                      && (* successful parse *);
            modifiable: model_structureDetermined;
            ensures: \result != null
                     && \result.numInstances() >= 0
                     && model_structureDetermined == true;
        also
          public_exceptional_behavior
            requires: model_sourceSupplied == false
                      || (* unsuccessful parse *);
            signals: (IOException);
       
       
    • setSource

      public void setSource(InputStream input) throws IOException
      Resets the Loader object and sets the source of the data set to be the supplied Stream object.
      Specified by:
      setSource in interface Loader
      Overrides:
      setSource in class AbstractLoader
      Parameters:
      input - the input stream
      Throws:
      IOException - if an error occurs
    • setSource

      public void setSource(File file) throws IOException
      Resets the Loader object and sets the source of the data set to be the supplied File object.
      Specified by:
      setSource in interface Loader
      Overrides:
      setSource in class AbstractFileLoader
      Parameters:
      file - the source file.
      Throws:
      IOException - if an error occurs
    • getStructure

      public Instances getStructure() throws IOException
      Description copied from interface: Loader
      Determines and returns (if possible) the structure (internally the header) of the data set as an empty set of instances.
      Specified by:
      getStructure in interface Loader
      Specified by:
      getStructure in class AbstractLoader
      Returns:
      the structure of the data set as an empty set of Instances
      Throws:
      IOException - if there is no source or parsing fails
       
          public_normal_behavior
            requires: model_sourceSupplied == true
                      && model_structureDetermined == false
                      && (* successful parse *);
            modifiable: model_structureDetermined;
            ensures: \result != null
                     && \result.numInstances() == 0
                     && model_structureDetermined == true;
        also
          public_exceptional_behavior
            requires: model_sourceSupplied == false
                      || (* unsuccessful parse *);
            signals: (IOException);
       
       
    • reset

      public void reset() throws IOException
      Description copied from class: AbstractFileLoader
      Resets the loader ready to read a new data set
      Specified by:
      reset in interface Loader
      Overrides:
      reset in class AbstractFileLoader
      Throws:
      IOException - if something goes wrong