The Kaptain on … stuff

01 Jul, 2010

Groovy and CSV: How to Get Your Data Out?

Posted by: TheKaptain In: Development

I don’t know exactly how many CSV files I’ve read/written to date, but I’m willing to bet it’s a lot. These kind of files are a simple and common way to exchange data and are interoperable with spreadsheet programs as well, making them more easily accessible to non-programmer types. There is some excellent support out there for reading and writing CSV files and so here, from simplest to most complex, are a few different ways to load and query the data trapped inside your CSV files. But first off…

CSV is Not XML

Or any other suitably constrained data format for that matter. There are some pretty clear ‘rules’ out there for transporting data in this format, but of course hardly anybody listens to them. I found this and this in the first page of a quick Google search.
However, while looking for available sample material on the web to demonstrate with, I found many great examples of how to screw it up: not quoting empty strings, not escaping embedded line breaks, commas embedded in unquoted text, etc. All of these are things that prevent automation from properly interacting with the data. There are also, of course, the CSV ‘formats’ that insist on embedding comments and descriptions, trying to present more than one matrix of information or any of the other crazy things you can do when writing to a format with no real metadata layer.

< \rant>

Unfortunately, there’s nothing as convenient as an xml schema available for describing what information is encoded in the document and, in general, the only potentially useful metadata in an individual file is the header line. The good news is that all of these deficiencies can be overcome if you know they exist. And there’s some pretty nifty tools you can use to take out much of the guesswork. I’ll suggest a couple of ideas for how to deal with files that don’t adhere to the ‘rules’, but the main example I’ve decided on is a straightforward and well formed set representing the periodic table of elements, provided generously by Akiscode. This file is directly machine readable(has no headers or embedded data), quotes all values and just follows all the ‘rules’. Here’s the first five lines as an example.
[table id=4 /]

Groovy All by Itself

Groovy makes dealing with String objects pretty painless. By adding facilities like easy casting and find()/findAll() for quickly turning raw Strings into real data, you can program some fairly complex questions. In this simple example I’m depending on all values being quoted and I am not protecting against casting problems. I’m running this through Maven which really helps to keep test data organized; all you have to do is drop your test file in /src/test/resources and it’s automagically available on the test classpath.
No specific handling is done for different types here; everything is read in as a String and cast to a more specific type as needed. In a ‘real’ application you’re very likely going to encounter mixed alphanumeric data in any given column of data – ‘UNKNOWN’ instead of a number in a column labeled ‘Quantity’ for instance – and be more careful about casting.

[groovy language=”true”]
//load and split the file
InputStream inputFile = getClass().classLoader.getResourceAsStream(TEST_FILE_NAME)
String[] lines = inputFile.text.split(‘\n’)
List<String[]> rows = lines.collect {it.split(‘,’)}

/**
* A little helper method to get rid of the quotes in the input
* and cast values so they can be compared properly.
*/
private double castToDouble(string)
{
return string.replaceAll(‘"’, ”).toDouble()
}

//OK, it’s parsed – let’s ask some questions
private static final int ATOMIC_MASS = 1;

def elementsOver200Mass = rows.findAll {castToDouble(it[ATOMIC_MASS]) > 200}
def elementsBetween10And20 = rows.findAll { row ->
double mass = castToDouble(row[ATOMIC_MASS])
mass <= 20 && mass >= 10
}
[/groovy]

So this approach works fine for well formed input, but falls apart quickly in other cases. For instance, if any columns in the file are missing values the split() function treats them as nulls, leading to different size arrays being stored in the list. Similarly you need to deal with quoted VS non-quoted content, embedded line breaks and other issues individually. Fortunately, some nice guys have done that for us.

OpenCSV

Glen Smith’s OpenCSV library nicely abstracts away a lot of the minutiae of dealing with CSV. It properly deals with escaped characters, multiline input and a host of other issues for you. Give it pretty much anything that satisfies the Reader interface as an input and it’s good to go. The return type of the CSVReader.readAll() method also satisfies the same contract as the plain Groovy version shown above, so we can interact with the parsed results in exactly the same way as in the previous example.
All values are still typed as Strings, so we need to cast in order to compare numeric values.

[groovy language=”true”]
import au.com.bytecode.opencsv.CSVReader

List<String[]> rows = new CSVReader(
new InputStreamReader(getClass().classLoader.getResourceAsStream(TEST_FILE_NAME)))
.readAll()

//same finders as in the Groovy version
def elementsOver200Mass = rows.findAll {it[ATOMIC_MASS].toDouble() > 200}
def elementsBetween10And20 = rows.findAll { row ->
double mass = castToDouble(row[ATOMIC_MASS])
mass <= 20 && mass >= 10
}
[/groovy]

HsqlDB

HsqlDB allows for using flat files as Text Tables, effectively turning a CSV file into a database table. Data values are cast to the types specified in your table definition, so any potential type cast failures happen right up front when loading the data. The benefit is that once the data is read in no further manual processing is necessary. Note: by default the csv file and the database file must be located in the same directory, as a security precaution (can be overridden through configuration).

In addition, leveraging sql makes querying the data extremely easy. Groovy brings in the Sql class to abstract away a lot of the normal Java boilerplate you encounter when working with a ResultSet. I didn’t do any in-depth testing to prove it out, but this is also the only one of the strategies described here that doesn’t require holding the entire data in memory in order to do arbitrary queries. That can be mitigated in the previously shown methods by processing files line by line rather than in bulk if memory usage is a concern.

[groovy language=”true”]
//create the table definition to insert
String tableName = ‘elements’
String tableDefinition = """CREATE TEXT TABLE $tableName (
atomic_number INTEGER PRIMARY KEY,
atomic_mass NUMERIC,
name VARCHAR(255),
symbol VARCHAR(3)
);"""

//create a new file database and a table corresponding to the csv file
Sql sql = Sql.newInstance("jdbc:hsqldb:file:${testdbDir.absolutePath}/testdb", ‘sa’, ”
,’org.hsqldb.jdbcDriver’)
sql.execute(tableDefinition)

//set the source to the csv file
sql.execute("SET TABLE elements SOURCE ‘${TEST_FILE_NAME};all_quoted=true’".toString())

//querying the database that’s wrapping our CSV file
def elementsOver200Mass = sql.rows("SELECT * FROM $tableName WHERE atomic_mass > ?", [200])
def elementsBetween10And20 = sql.rows(
"SELECT * FROM $tableName WHERE atomic_mass <= ? AND atomic_mass >= ?", [20, 10])

//simple db aggregates
def count = 0
sql.eachRow("SELECT count(1) FROM $tableName WHERE atomic_mass <= ?", [20]){row->
count = row[0]
}
def avg = 0
sql.eachRow("SELECT avg(atomic_mass) FROM $tableName".toString()){row->
avg = row[0]
}
[/groovy]

Which One do I Use???

Pick the one that fits best for your use case is the real answer. I didn’t know about the HsqlDB option until recently, but OpenCSV has been a personal standby for years. Then again, if you’re writing something quick and dirty at the script level, the simplicity of just applying a couple of split() operations is pretty appealing. Here’s how I generally decide.

[table id=5 /]

Enhanced by Zemanta

13 Responses to "Groovy and CSV: How to Get Your Data Out?"

1 | Tweets that mention Groovy and CSV: How to Get Your Data Out? | The Kaptain on ... stuff -- Topsy.com

July 1st, 2010 at 12:40 pm

Avatar

[…] This post was mentioned on Twitter by kellyrob99, groovyblogs.org. groovyblogs.org said: Groovy and CSV: How to Get Your Data Out? — http://bit.ly/aBP7EQ — The Kaptain on … stuff […]

2 | Chris Hane

July 1st, 2010 at 5:11 pm

Avatar

I like the the csvjdbc library:

http://csvjdbc.sourceforge.net/

We read lots of files and it makes it very easy to open a file, iterate over it’s contents and grab just the fields I want.

I have not explored all of the weird cases that are out there for formatting csv files, just the ones we’ve run into.

Anyway, food for thought…

4 | TheKaptain

July 2nd, 2010 at 3:25 pm

Avatar

Nice! That csvread function sounds just about perfect. Thanks much for pointing it out.

5 | TheKaptain

July 2nd, 2010 at 3:41 pm

Avatar

And I’m happy to report it works almost perfectly as a swap-in replacement for the HsqlDB example. Looks like it automatically assumes the first line contains the header values though? Do you happen to know if that’s configurable or not? Nothing jumped out at me in the page you linked or the javadocs.

6 | CK

July 2nd, 2010 at 5:05 am

Avatar

Really nice, thanks for sharing.

Quick question: what’s with the .toString() call on the query string on line 16 of the HsqlDB example?

7 | TheKaptain

July 2nd, 2010 at 3:24 pm

Avatar

The jdbc driver was throwing an ‘index out of range’ exception. The Sql.execute() method can take a GString as a param, but apparently it’s not getting evaluated properly. Should probably log a bug(mental note).

8 | Blog bookmarks 07/03/2010 « My Diigo bookmarks

July 2nd, 2010 at 7:32 pm

Avatar

[…] Groovy and CSV: How to Get Your Data Out? | The Kaptain on … stuff […]

9 | Random Links #219 | YASDW - yet another software developer weblog

July 5th, 2010 at 11:44 am

Avatar

[…] Groovy and CSV: How to Get Your Data Out? csv … omg […]

10 | Leonard Axelsson/@xlson

July 5th, 2010 at 10:36 pm

Avatar

Thanks for putting together this overview. I wasn’t aware that HSQLDB could parse csv. Seems like a good way to go when it comes to very large/complicated datasets.

As it happens I started working on a csv-lib for Groovy last week. Nothing fancy, just trying to add a little of the “Groovy” feeling to csv parsing by allowing data-access by column name.

More about csv-parser here:
http://xlson.github.com/csv-parser/

11 | TheKaptain

July 7th, 2010 at 4:29 pm

Avatar

That project looks interesting Leonard. And you gotta love git for making it easy to share.

12 | Leonard Axelsson

July 9th, 2010 at 3:37 am

Avatar

Really like having Git and Github, makes everything a lot easier. Esp. when they added their own issue-tracker as well 🙂

13 | Wanderson Santos

September 3rd, 2010 at 1:58 pm

Avatar

Mr. WHUT? rs

Just do this GROOOOOVY code!

def csv = new File(“c:\test\file.csv”)

csv.splitEachLine(‘,’) { campo ->

def id = field[0] ?: 0

if (id > 0) {
def user = User.get(id)
user.name= field[1] ?: “Nameless”
user.house = field[2] ?: “Houseless”
}

}

Comment Form