The Kaptain on … stuff

18 Mar, 2012

JFreeChart with Groovy and Apache POI

Posted by: TheKaptain In: Development

The point of this article is to show you how to parse data from an Excel spreadsheet that looks like this:

[singlepic id=78 w=700 h=400 mode=watermark]

and turn it into a series of graphs that look like this:

[singlepic id=79 w=700 h=400 mode=watermark]

Recently I was looking for an opportunity to get some practice with JFreeChart and ended up looking at a dataset released by the Canadian government as part of their ‘Open Data’ initiative.

The particular set of data is entitled ‘Number of Seedlings Planted by Ownership, Species’ and is delivered as an Excel spreadsheet, hence the need for the Apache POI library in order to read the data in. As is fairly usual, at least in my experience, the Excel spreadsheet is designed primarily for human consumption which adds a degree of complexity to the parsing. Fortunately the spreadsheet does follow a repetitive pattern that can be accounted for fairly easily, so this is not insurmountable. Still, we want to get the data out of Excel to make it more approachable for machine consumption so the first step is to convert it to a JSON representation. Once it is in this much more transportable form we can readily convert the data into graph visualizations using JFreeChart.

The spreadsheet format

Excel as a workplace tool is very well established, can increase individual productivity and is definitely a boon to your average office worker. The problem is that once the data is there it’s often trapped there. Data tends to be laid out based on human aesthetics and not on parsability, meaning that unless you want to use Excel itself to do further analysis, there’s not a lot of options. Exports to more neutral formats like csv suffer from the same problems- namely that there’s no way to read in the data coherently without designing a custom parser. In this particular case, parsing the spreadsheet has to take into account the following:

  • Merged cells where one column is meant to represent a fixed value for a number of sequential rows.
  • Column headers that do not represent all of the actual columns. Here we have a ‘notes’ column for each province that immediately follows its’ data column. As the header cells are merged across both of these columns, they cannot be used directly to parse the data.
  • Data is broken down into several domains that lead to repetitions in the format.
  • The data contains a mix of numbers where results are available and text where they are not. The meanings of the text entries are described in a table at the end of the spreadsheet.
  • Section titles and headers are repeated throughout the document, apparently trying to match some print layout, or perhaps just trying to provide some assistance to those scrolling through the long document.

Data in the spreadsheet is first divided into reporting by Provincial crown land, private land, Federal land, and finally a total for all of them.

Within each of these sections, data is reported for each tree species on a yearly basis across all Provinces and Territories along with aggregate totals of these figures across Canada.

Each of these species data-tables has an identical row/column structure which allows us to create a single parsing structure sufficient for reading in data from each of them separately.

Converting the spreadsheet to JSON

For parsing the Excel document, I’m using the Apache POI library and a Groovy wrapper class to assist in processing. The wrapper class is very simple but allows us to abstract most of the mechanics of dealing with the Excel document away. The full source is available on this blog post from author Goran Ehrsson. The key benefit is the ability to specify a window of the file to process based on ‘offset’ and ‘max’ parameters provided in a simple map. Here’s an example for reading data for the text symbols table at the end of the spreadsheet.

We define a Map which states which sheet to read from, which line to start on(offset) and how many lines to process. The ExcelBuilder class(which isn’t really a builder at all) takes in the path to a File object and under the hood reads that into a POI HSSFWorkbook which is then referenced by the call to the eachLine method.

[groovy language=”language="true”]
public static final Map SYMBOLS = [sheet: SHEET1, offset: 910, max: 8]

final ExcelBuilder excelReader = new ExcelBuilder(data.absolutePath)
Map<String, String> symbolTable = [:]
excelReader.eachLine(SYMBOLS) { HSSFRow row ->
symbolTable[row.getCell(0).stringCellValue] = row.getCell(1).stringCellValue
}

[/groovy]

Eventually when we turn this into JSON, it will look like this:

[plain language=”true”]
"Symbols": {
"…": "Figures not appropriate or not applicable",
"..": "Figures not available",
"–": "Amount too small to be expressed",
"-": "Nil or zero",
"p": "Preliminary figures",
"r": "Revised figures",
"e": "Estimated by provincial or territorial forestry agency",
"E": "Estimated by the Canadian Forest Service or by Statistics Canada"
}
[/plain]

Now processing the other data blocks gets a little bit trickier. The first column consists of 2 merged cells, and all but one of the other headers actually represents two columns of information: a count and an optional notation. The merged column is handled by a simple EMPTY placeholder and the extra columns by processing the list of headers;.

[groovy language=”language="true”]
public static final List<String> HEADERS = [‘Species’, ‘EMPTY’, ‘Year’, ‘NL’, ‘PE’, ‘NS’, ‘NB’, ‘QC’, ‘ON’, ‘MB’, ‘SK’, ‘AB’,
‘BC’, ‘YT’, ‘NT *a’, ‘NU’, ‘CA’]
/**
* For each header add a second following header for a ‘notes’ column
* @param strings
* @return expanded list of headers
*/
private List<String> expandHeaders(List<String> strings)
{
strings.collect {[it, "${it}_notes"]}.flatten()
}
[/groovy]
Each data block corresponds to a particular species of tree, broken down by year and Province or Territory. Each species is represented by a map which defines where in the document that information is contained so we can iterate over a collection of these maps and aggregate data quite easily. This set of constants and code is sufficient for parsing all of the data in the document.
[groovy language=”language="true”]
public static final int HEADER_OFFSET = 3
public static final int YEARS = 21
public static final Map PINE = [sheet: SHEET1, offset: 6, max: YEARS, species: ‘Pine’]
public static final Map SPRUCE = [sheet: SHEET1, offset: 29, max: YEARS, species: ‘Spruce’]
public static final Map FIR = [sheet: SHEET1, offset: 61, max: YEARS, species: ‘Fir’]
public static final Map DOUGLAS_FIR = [sheet: SHEET1, offset: 84, max: YEARS, species: ‘Douglas-fir’]
public static final Map MISCELLANEOUS_SOFTWOODS = [sheet: SHEET1, offset: 116, max: YEARS, species: ‘Miscellaneous softwoods’]
public static final Map MISCELLANEOUS_HARDWOODS = [sheet: SHEET1, offset: 139, max: YEARS, species: ‘Miscellaneous hardwoods’]
public static final Map UNSPECIFIED = [sheet: SHEET1, offset: 171, max: YEARS, species: ‘Unspecified’]
public static final Map TOTAL_PLANTING = [sheet: SHEET1, offset: 194, max: YEARS, species: ‘Total planting’]
public static final List<Map> PROVINCIAL = [PINE, SPRUCE, FIR, DOUGLAS_FIR, MISCELLANEOUS_SOFTWOODS, MISCELLANEOUS_HARDWOODS, UNSPECIFIED, TOTAL_PLANTING]
public static final List<String> AREAS = HEADERS[HEADER_OFFSET..-1]

final Closure collector = { Map species ->
Map speciesMap = [name: species.species]
excelReader.eachLine(species) {HSSFRow row ->
//ensure that we are reading from the correct place in the file
if (row.rowNum == species.offset)
{
assert row.getCell(0).stringCellValue == species.species
}
//process rows
if (row.rowNum > species.offset)
{
final int year = row.getCell(HEADERS.indexOf(‘Year’)).stringCellValue as int
Map yearMap = [:]
expandHeaders(AREAS).eachWithIndex {String header, int index ->
final HSSFCell cell = row.getCell(index + HEADER_OFFSET)
yearMap[header] = cell.cellType == HSSFCell.CELL_TYPE_STRING ? cell.stringCellValue : cell.numericCellValue
}
speciesMap[year] = yearMap.asImmutable()
}
}
speciesMap.asImmutable()
}
[/groovy]

The defined collector Closure returns a map of all species data for one of the four groupings(Provincial, private land, Federal and totals). The only thing that differentiates these groups is their offset in the file so we can define maps for the structure of each simply by updating the offsets of the first.

[groovy language=”true”]
public static final List<Map> PROVINCIAL = [PINE, SPRUCE, FIR, DOUGLAS_FIR, MISCELLANEOUS_SOFTWOODS, MISCELLANEOUS_HARDWOODS, UNSPECIFIED, TOTAL_PLANTING]
public static final List<Map> PRIVATE_LAND = offset(PROVINCIAL, 220)
public static final List<Map> FEDERAL = offset(PROVINCIAL, 441)
public static final List<Map> TOTAL = offset(PROVINCIAL, 662)

private static List<Map> offset(List<Map> maps, int offset)
{
maps.collect { Map map ->
Map offsetMap = new LinkedHashMap(map)
offsetMap.offset = offsetMap.offset + offset
offsetMap
}
}
[/groovy]

Finally, we can iterate over these simple map structures applying the collector Closure and we end up with a single map representing all of the data.

[groovy language=”true”]
def parsedSpreadsheet = [PROVINCIAL, PRIVATE_LAND, FEDERAL, TOTAL].collect {
it.collect(collector)
}
Map resultsMap = [:]
GROUPINGS.eachWithIndex {String groupName, int index ->
resultsMap[groupName] = parsedSpreadsheet[index]
}
resultsMap[‘Symbols’] = symbolTable
[/groovy]

And the JsonBuilder class provides an easy way to convert any map to a JSON document ready to write out the results.

[groovy language=”true”]
Map map = new NaturalResourcesCanadaExcelParser().convertToMap(data)
new File(‘src/test/resources/NaturalResourcesCanadaNewSeedlings.json’).withWriter {Writer writer ->
writer << new JsonBuilder(map).toPrettyString()
}
[/groovy]

Parsing JSON into JFreeChart line charts

All right, so now that we’ve turned the data into a slightly more consumable format, it’s time to visualize it. For this case I’m using a combination of the JFreeChart library and the GroovyChart project which provides a nice DSL syntax for working with the JFreeChart API. It doesn’t look to be under development presently, but aside from the fact that the jar isn’t published to an available repository it was totally up to this task.

We’re going to create four charts for each of the fourteen areas represented for a total of 56 graphs overall. All of these graphs contain plotlines for each of the eight tree species tracked. This means that overall we need to create 448 distinct time series. I didn’t do any formal timings of how long this takes, but in general it came in somewhere under ten seconds to generate all of these. Just for fun, I added GPars to the mix to parallelize creation of the charts, but since writing the images to disk is going to be the most expensive part of this process, I don’t imagine it’s speeding things up terribly much.

First, reading in the JSON data from a file is simple with JsonSlurper.

[groovy language=”true”]
def data
new File(jsonFilename).withReader {Reader reader ->
data = new JsonSlurper().parse(reader)
}
assert data
[/groovy]

Here’s a sample of what the JSON data looks like for one species over a single year, broken down first by one of the four major groups, then by tree species, then by year and finally by Province or Territory.

[plain language=”true”]
{
"Provincial": [
{
"name": "Pine",
"1990": {
"NL": 583.0,
"NL_notes": "",
"PE": 52.0,
"PE_notes": "",
"NS": 4.0,
"NS_notes": "",
"NB": 4715.0,
"NB_notes": "",
"QC": 33422.0,
"QC_notes": "",
"ON": 51062.0,
"ON_notes": "",
"MB": 2985.0,
"MB_notes": "",
"SK": 4671.0,
"SK_notes": "",
"AB": 8130.0,
"AB_notes": "",
"BC": 89167.0,
"BC_notes": "e",
"YT": "-",
"YT_notes": "",
"NT *a": 15.0,
"NT *a_notes": "",
"NU": "..",
"NU_notes": "",
"CA": 194806.0,
"CA_notes": "e"
},

[/plain]

Building the charts is a simple matter of iterating over the resulting map of parsed data. In this case we’re ignoring the ‘notes’ data but have included it in the dataset in case we want to use it later. We’re also just ignoring any non-numeric values.

[groovy language=”true”]
GROUPINGS.each { group ->
withPool {
AREAS.eachParallel { area ->
ChartBuilder builder = new ChartBuilder();
String title = sanitizeName("$group-$area")
TimeseriesChart chart = builder.timeserieschart(title: group,
timeAxisLabel: ‘Year’,
valueAxisLabel: ‘Number of Seedlings(1000s)’,
legend: true,
tooltips: false,
urls: false
) {
timeSeriesCollection {
data."$group".each { species ->
Set years = (species.keySet() – ‘name’).collect {it as int}
timeSeries(name: species.name, timePeriodClass: ‘org.jfree.data.time.Year’) {
years.sort().each { year ->
final value = species."$year"."$area"
//check that it’s a numeric value
if (!(value instanceof String))
{
add(period: new Year(year), value: value)
}
}
}
}
}
}

}
[/groovy]

Then we apply some additional formatting to the JFreeChart to enhance the output styling, insert an image into the background, and fix the plot color schemes.

[groovy language=”true”]
JFreeChart innerChart = chart.chart
String longName = PROVINCE_SHORT_FORM_MAPPINGS.find {it.value == area}.key
innerChart.addSubtitle(new TextTitle(longName))
innerChart.setBackgroundPaint(Color.white)
innerChart.plot.setBackgroundPaint(Color.lightGray.brighter())
innerChart.plot.setBackgroundImageAlignment(Align.TOP_RIGHT)
innerChart.plot.setBackgroundImage(logo)
[Color.BLUE, Color.GREEN, Color.ORANGE, Color.CYAN, Color.MAGENTA, Color.BLACK, Color.PINK, Color.RED].eachWithIndex { color, int index ->
innerChart.XYPlot.renderer.setSeriesPaint(index, color)
}
[/groovy]

And we write out each of the charts to a formulaically named png file.

[groovy language=”true”]
def fileTitle = "$FILE_PREFIX-${title}.png"
File outputDir = new File(outputDirectory)
if (!outputDir.exists())
{
outputDir.mkdirs()
}
File file = new File(outputDir, fileTitle)
if (file.exists())
{
file.delete()
}
ChartUtilities.saveChartAsPNG(file, innerChart, 550, 300)
[/groovy]

To tie it all together, an html page is created using MarkupBuilder to showcase all of the results, organized by Province or Territory.

[groovy language=”true”]
def buildHtml(inputDirectory)
{
File inputDir = new File(inputDirectory)
assert inputDir.exists()
Writer writer = new StringWriter()
MarkupBuilder builder = new MarkupBuilder(writer)
builder.html {
head {
title(‘Number of Seedlings Planted by Ownership, Species’)
style(type: "text/css") {
mkp.yield(CSS)
}
}
body {
ul {
AREAS.each { area ->
String areaName = sanitizeName(area)
div(class: ‘area rounded-corners’, id: areaName) {
h2(PROVINCE_SHORT_FORM_MAPPINGS.find {it.value == area}.key)
inputDir.eachFileMatch(~/.*$areaName\.png/) {
img(src: it.name)
}
}
}
}
script(type: ‘text/javascript’, src: ‘https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js’, ”)
script(type: ‘text/javascript’) {
mkp.yield(JQUERY_FUNCTION)
}
}
}
writer.toString()
}
[/groovy]

The generated html page assumes that all images are co-located in the same folder, presents four images per Province/Territory and, just for fun, uses JQuery to attach a click handler to each of the headers. Click on a header and the images in that div will animate into the background. I’m sure the actual JQuery being used could be improved upon, but it serves its purpose. Here’s a sample of the html output:

[xml language=”true”]
<ul>
<div class=’area rounded-corners’ id=’NL’>
<h2>Newfoundland and Labrador</h2>
<img src=’naturalResourcesCanadaNewSeedlings-Federal-NL.png’ />
<img src=’naturalResourcesCanadaNewSeedlings-PrivateLand-NL.png’ />
<img src=’naturalResourcesCanadaNewSeedlings-Provincial-NL.png’ />
<img src=’naturalResourcesCanadaNewSeedlings-Total-NL.png’ />
</div>

[/xml]

The resulting page looks like this in Firefox.

[singlepic id=80 w=700 h=400 mode=watermark]

Source code and Links

The source code is available on GitHub. So is the final resulting html page. The entire source required to go from Excel to charts embedded in an html page comes in at slightly under 300 lines of code and I don’t think the results are too bad for the couple of hours effort involved. Finally, the JSON results are also hosted on the GitHub pages for the project for anyone else who might want to delve into the data.

Some reading related to this topic:

[nggallery id=15]

2 Responses to "JFreeChart with Groovy and Apache POI"

1 | 2012년 3월 21일 it 기술 동향 |

March 20th, 2012 at 4:36 pm

Avatar

[…] Groovy 와 Apache POI 와 JFreeChart의 활용 […]

2 | Dave Cherry

November 16th, 2013 at 5:54 am

Avatar

This is a great usage of the groovychart project. The binary distribution had been lost during java.net moving to new infrastructure. I’ve put a snapshot build back out just today in the downloads section of groovychart. I’ll link this as another example usage from the groovychart project.

Comment Form