Exploratory Data Visualisation with R, ggplot and ggobi Part 1

Error message

  • Deprecated function: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in _drutex_create_regex() (line 363 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module).
  • Notice: Use of undefined constant filter - assumed 'filter' in preg_replace() (line 1 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module(363) : regexp code).
  • Deprecated function: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in _filter_drutex_process() (line 315 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module).
  • Deprecated function: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in _drutex_create_regex() (line 363 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module).
  • Notice: Use of undefined constant filter - assumed 'filter' in preg_replace() (line 1 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module(363) : regexp code).
  • Deprecated function: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in _filter_drutex_process() (line 315 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module).
  • Deprecated function: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in _drutex_create_regex() (line 363 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module).
  • Notice: Use of undefined constant filter - assumed 'filter' in preg_replace() (line 1 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module(363) : regexp code).
  • Deprecated function: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in _filter_drutex_process() (line 315 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module).
  • Deprecated function: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in _drutex_create_regex() (line 363 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module).
  • Notice: Use of undefined constant filter - assumed 'filter' in preg_replace() (line 1 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module(363) : regexp code).
  • Deprecated function: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in _filter_drutex_process() (line 315 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module).
  • Deprecated function: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in _drutex_create_regex() (line 363 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module).
  • Notice: Use of undefined constant filter - assumed 'filter' in preg_replace() (line 1 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module(363) : regexp code).
  • Deprecated function: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in _filter_drutex_process() (line 315 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module).
  • Deprecated function: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in _drutex_create_regex() (line 363 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module).
  • Notice: Use of undefined constant filter - assumed 'filter' in preg_replace() (line 1 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module(363) : regexp code).
  • Deprecated function: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in _filter_drutex_process() (line 315 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module).
  • Deprecated function: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in _drutex_create_regex() (line 363 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module).
  • Notice: Use of undefined constant filter - assumed 'filter' in preg_replace() (line 1 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module(363) : regexp code).
  • Deprecated function: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in _filter_drutex_process() (line 315 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module).
  • Deprecated function: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in _drutex_create_regex() (line 363 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module).
  • Deprecated function: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in _drutex_unhide_all() (line 479 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module).
  • Notice: Undefined index: en in drutex_node_view() (line 81 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module).

Here's a video which summarises some data visualisation I've been doing recently. The rest of this post explains the story behind it.

== Background ==

I'm doing some work with [http://www.sruc.ac.uk/asoteriades Andreas Diomedes Soteriades] and [http://www.ed.ac.uk/schools-departments/geosciences/people?indv=1578&cw_xml=person.html Marc Metzger], looking at global responses to climate change. I can't be more specific until we've finished the paper, but the general overview is that we're taking current data, classifying it, and then trying to classify the results of a bunch of climate models.

We're working with gridded global data, at 5 minute resolution, so we have 2 million cells, covering the land surface of the globe. We have:
* 4 climatic variables for each cell
* a classification for each cell into one of 125 classes
These make up our training set, as the variables used here are not the same variables as were used to do the classification.

We also have:
* 4 climatic variables for each cell, for a number of future scenarios, based on a bunch of different climate models
It's these data that we now want to classify according to the original schema. So, we need to train a classifier on the training set, and then run the classifier over the future scenarios to generate our output. This is a fairly big dataset - 2M points, 125 classes - so we need something simple and fast.

We started with a [http://en.wikipedia.org/wiki/Naive_bayes Naive Bayes] approach - it's fast and quite robust, and doesn't need tuning - and got 76% classification accuracy on current data. Not perfect, but good enough to be going on with. The rest of this post is about how we discovered that it was limited, and why.

When we looked at the initial outputs, it seemed fairly reasonable - classes moved around in the future in ways we were expecting. However, for some outputs, we realised that there were huge regions of a single class, in a place where we really wouldn't expect to find it:

Whole World Classification

What's all that blue stuff doing in the middle of Africa - that looks weird! Note: it only really looks weird if you have some idea what the classes are, but it's still the same colour over the middle of africa as over bits of Northern Europe, which is a bit worrying.

So, we needed to find out why this was happening, is it a problem with the classification, or a natural effect of the data? It's DataViz time!

== Setup ==

For this, I used three main tools:

* [http://cran.r-project.org/ R], the statistical language
* [http://had.co.nz/ggplot2/ ggplot2], a really lovely graph package
* [http://www.ggobi.org/ ggobi], a multidimensional visualisation program, which can talk to R

If you're following along at home I'm going to assume that you've installed them all, and done library(ggplot2);library(rggobi), so it's all ready to go. I will put up links to data files once it's published, so for now you'll just have to follow the code and use your own data.

First up, I read the original data and the future scenario data into data.frames, and made sure the format matched:

> class <- read.csv("data/output.csv")
> head( origA )
          V1           V2           V3         V4       LON      LAT Class ID
1      9927       1707        279      11967 -33.37559 83.62503    11         11
2      9927       1707        279      11959 -33.20892 83.62503    11         13
3      9896       1708        279      11965 -33.12559 83.62503    11         14

Now, since the datasets are very large, I needed to create a subset to explore. To get a good range of points, I took:
* 100 points from each class, randomly scattered over the globe
* all the points from a region around Africa where weird things were happening
* another 1000 points sampled from the class which was looking strange (class 80)
(see snippet at bottom)

I did this both for the baseline data and the classified future data, so I could compare the samples. Each one looked a bit like this:

> clasSamp <- getRegionAndStrata( class )
> ggplot( clasSamp, aes( x=LON,y=LAT,color=Class ) ) + geom_point(size=0.8)

World Stratified Sample

Then, I added a column to say which set each point was from, and combined them:

> clasSamp<em class="error">TeX Embedding failed!</em>Scenario <- "Baseline"
> d <- rbind( clasSamp, basSamp )

== Initial Exploration ==

Next, the data goes into GGobi. GGobi is designed to help with visualising and understanding large, multivariate datasets. There are a bunch of different tools it has for displaying the data, but to start with I used a 2D tour, along with some automatic brushing to highlight certain points. The 2D tour takes multidimensional data, and projects it down into 2D, but it does it by mapping a combination of input variables onto X and Y, and then changes that combination over time. As ggobi moves through different mappings of each variable, you get a constantly changing view of what's happening, as if you were rotating an object and looking at it from different angles. If that's confusing, have a look at the video further down the page...

Using the R package makes it all very easy:

gg <- ggobi( d )

This opens up a new GGobi window, with the points loaded in. It might look a bit like this:
Ggobi Initial

It's not very helpful to start with. The first thing I did is chose a new colour scheme. I'm going to use 6 different classes, so I need one with at least 6 different colours, and for this I'm using "dark2 6":

GGobi Choose Colour

Now, we can go back to R, and get it to colour the points according to their class. (See snippet at the bottom)

This makes the default colour 1, and other colours depending on whether the point is in the region, whether is is in the problem class, and whether it is baseline or predicted data. The last 3 lines hide all the data except those in the selected region or in the problem class. It looks like this for me:

GGobi Coloured Hidden

Now, we're starting to see a bit of structure. The key is:
* Orange points are in the region
* Mauve/purple points are predicted points in the problem class
* Magenta points are predicted points in the problem class, in the selected region - the most likely wierdos
* Green points are the problem class in baseline
* Yellow points are the problem class in baseline in the region (I don't think there are any of these)

Now, I choose Display -> 2D Tour, and select the input variables I want to explore. All of a sudden, everything starts moving!

GGobi Tour Start

What I'm interested in is the problem points in the predicted set compared to the baseline, so I can hide everything else for now:

	shadowed(gg[1]) <- T
	shadowed(gg[1])[d<em class="error">TeX Embedding failed!</em>Class, sample, size=nsamples,replace=TRUE)
	stratified <- d[ melt( unlist(stratIndex) )<em class="error">TeX Embedding failed!</em>Type <- "Stratified"
	region <- d[d<em class="error">TeX Embedding failed!</em>LON < MAXLON & d<em class="error">TeX Embedding failed!</em>LAT > MINLAT,]
	region<em class="error">TeX Embedding failed!</em>Class==80,]
	e80 <- c80[sample( nrow(c80), size=extra80, replace=TRUE),]
	e80<em class="error">TeX Embedding failed!</em>Type <- factor( res<em class="error">TeX Embedding failed!</em>Type == "Region"] <- 2
	glyph_colour(gg[1])[d<em class="error">TeX Embedding failed!</em>Scenario == "2080"] <- 3
	glyph_colour(gg[1])[d<em class="error">TeX Embedding failed!</em>Class == 80 & d<em class="error">TeX Embedding failed!</em>Class == 80 & d<em class="error">TeX Embedding failed!</em>Type == "Region" & d<em class="error">TeX Embedding failed!</em>Scenario == "Baseline"] <- 6
 
	shadowed(gg[1]) <- T
	shadowed(gg[1])[d<em class="error">TeX Embedding failed!</em>Class == 80] <- F