CSV file support

CsvDataContext

The Apache MetaModel CSV module is one of the most advanced implementations there is, compared to how simple a file format CSV is. The implementation's main features are:

Full implementation of DataContext and UpdateableDataContext.
Streaming query support without memory leaks, tested on billion-record data sets.
Support for parallelized row-parsing when multiline values are turned OFF. In these cases the Row objects served for queries have not yet been parsed, making this a potential parallel data consumption activity.
Support for sample-based COUNT queries when the query's COUNT select item has the "allow function approximation" flag set. This means that applications can get a quick approximation of the amount of rows, even in a really big file.

Creating from plain old java code - CsvDataContext

This is really simple:

Resource csvResource = new FileResource("/path/to/my/file.csv");
CsvConfiguration configuration = new CsvConfiguration(
  // arguments here to fit the resource
);
 
DataContext dataContext = new JdbcDataContext(resource, configuration);

Creating from properties - CsvDataContextFactory

If you wish to construct your CSV DataContext from properties, this is also possible. For instance:

final DataContextPropertiesImpl properties = new DataContextPropertiesImpl();
properties.put("type", "csv");
properties.put("resource", "/path/to/my/file.csv");

DataContext dataContext = DataContextFactoryRegistryImpl.getDefaultInstance().createDataContext(properties);

The relevant properties for this type of instantiation are:

Property	Example value	Description
type	csv	Must be set to 'csv' or else another type of DataContext will be constructed.
resource	/data/stuff.csv	Must reference the resource path to read/write CSV data from/to.
quote-char	"	The enclosing quote character to use for values in the CSV file.
separator-char	,	The separator character to use for separating values in the CSV file.
escape-char	\	The escape character to use for escaping CSV parsing of special characters.
encoding	UTF-8	The character set encoding of the data.
column-name-line-number	1	The line-number which holds column names / headers.
fail-on-inconsistent-row-length	true	Whether or not to fail (throw exception) on inconsistent row lengths, or to suppress these parsing issues.
multiline-values	false	Whether or not the data contains values spanning multiple lines (if this never happens, a faster parsing approach can be applied).

Updating CSV data

Modifying CSV data is done just like with any other MetaModel module - by means of implementing your an update script that is then submitted to the UpdateableDataContext's executeUpdate(...) method. This approach guarantees isolation and coherence in all update operations. Here is a simple example:

File myFile = new File("unexisting_file.csv");

UpdateableDataContext dataContext = DataContextFactory.createCsvDataContext(myFile);
final Schema schema = dataContext.getDefaultSchema();
dataContext.executeUpdate(new UpdateScript() {  public void run(UpdateCallback callback) {

    // CREATING A TABLE
    Table table = callback.createTable(schema, "my_table")
      .withColumn("name").ofType(VARCHAR)
      .withColumn("gender").ofType(CHAR)
      .withColumn("age").ofType(INTEGER)
      .execute();
 
    // INSERTING SOME ROWS
    callback.insertInto(table).value("name","John Doe").value("gender",'M').value("age",42).execute();
    callback.insertInto(table).value("name","Jane Doe").value("gender",'F').value("age",42).execute();
  }});

If you just want to insert or update a single record, you can skip the UpdateScript implementation and use the pre-built InsertInto, Update or DeleteFrom classes. But beware though that then you don't have any transaction boundaries or isolation inbetween those calls:

Table table = schema.getTableByName("my_table");
dataContext.executeUpdate(new InsertInto(table).value("name", "Polly the Sheep").value("age", -1));
dataContext.executeUpdate(new Update(table).where("name").eq("Polly the Sheep").value("age", 10));
dataContext.executeUpdate(new DeleteFrom(table).where("name").eq("Polly the Sheep"));

... And just to go full circle, here's how you can continue to explore the data:

System.out.println("Columns: " + Arrays.toString(table.getColumnNames()));

DataSet ds = dc.query().from(table).select(table.getColumns()).orderBy(table.getColumnByName("name")).execute();
while (ds.next()) {
   System.out.println("Row: " + Arrays.toString(ds.getRow().getValues()));
}

This snippet will print out:

Columns: [name, gender, age]
Row: [Jane Doe,F,42]
Row: [John Doe,M,42]

Space shortcuts

Page tree

CsvDataContext

Creating from plain old java code - CsvDataContext

Creating from properties - CsvDataContextFactory

Updating CSV data