Here's a video which summarises some data visualisation I've been doing recently. The rest of this post explains the story behind it.
== Background ==
I'm doing some work with [http://www.sruc.ac.uk/asoteriades Andreas Diomedes Soteriades] and [http://www.ed.ac.uk/schools-departments/geosciences/people?indv=1578&cw_xml=person.html Marc Metzger], looking at global responses to climate change. I can't be more specific until we've finished the paper, but the general overview is that we're taking current data, classifying it, and then trying to classify the results of a bunch of climate models.
We're working with gridded global data, at 5 minute resolution, so we have 2 million cells, covering the land surface of the globe. We have:
* 4 climatic variables for each cell
* a classification for each cell into one of 125 classes
These make up our training set, as the variables used here are not the same variables as were used to do the classification.
We also have:
* 4 climatic variables for each cell, for a number of future scenarios, based on a bunch of different climate models
It's these data that we now want to classify according to the original schema. So, we need to train a classifier on the training set, and then run the classifier over the future scenarios to generate our output. This is a fairly big dataset - 2M points, 125 classes - so we need something simple and fast.
We started with a [http://en.wikipedia.org/wiki/Naive_bayes Naive Bayes] approach - it's fast and quite robust, and doesn't need tuning - and got 76% classification accuracy on current data. Not perfect, but good enough to be going on with. The rest of this post is about how we discovered that it was limited, and why.
When we looked at the initial outputs, it seemed fairly reasonable - classes moved around in the future in ways we were expecting. However, for some outputs, we realised that there were huge regions of a single class, in a place where we really wouldn't expect to find it:
What's all that blue stuff doing in the middle of Africa - that looks weird! Note: it only really looks weird if you have some idea what the classes are, but it's still the same colour over the middle of africa as over bits of Northern Europe, which is a bit worrying.
So, we needed to find out why this was happening, is it a problem with the classification, or a natural effect of the data? It's DataViz time!
== Setup ==
For this, I used three main tools:
* [http://cran.r-project.org/ R], the statistical language
* [http://had.co.nz/ggplot2/ ggplot2], a really lovely graph package
* [http://www.ggobi.org/ ggobi], a multidimensional visualisation program, which can talk to R
If you're following along at home I'm going to assume that you've installed them all, and done library(ggplot2);library(rggobi), so it's all ready to go. I will put up links to data files once it's published, so for now you'll just have to follow the code and use your own data.
First up, I read the original data and the future scenario data into data.frames, and made sure the format matched:
> class <- read.csv("data/output.csv") > head( origA ) V1 V2 V3 V4 LON LAT Class ID 1 9927 1707 279 11967 -33.37559 83.62503 11 11 2 9927 1707 279 11959 -33.20892 83.62503 11 13 3 9896 1708 279 11965 -33.12559 83.62503 11 14
Now, since the datasets are very large, I needed to create a subset to explore. To get a good range of points, I took:
* 100 points from each class, randomly scattered over the globe
* all the points from a region around Africa where weird things were happening
* another 1000 points sampled from the class which was looking strange (class 80)
(see snippet at bottom)
I did this both for the baseline data and the classified future data, so I could compare the samples. Each one looked a bit like this:
> clasSamp <- getRegionAndStrata( class ) > ggplot( clasSamp, aes( x=LON,y=LAT,color=Class ) ) + geom_point(size=0.8)
Then, I added a column to say which set each point was from, and combined them:
> clasSamp<em class="error">TeX Embedding failed!</em>Scenario <- "Baseline" > d <- rbind( clasSamp, basSamp )
== Initial Exploration ==
Next, the data goes into GGobi. GGobi is designed to help with visualising and understanding large, multivariate datasets. There are a bunch of different tools it has for displaying the data, but to start with I used a 2D tour, along with some automatic brushing to highlight certain points. The 2D tour takes multidimensional data, and projects it down into 2D, but it does it by mapping a combination of input variables onto X and Y, and then changes that combination over time. As ggobi moves through different mappings of each variable, you get a constantly changing view of what's happening, as if you were rotating an object and looking at it from different angles. If that's confusing, have a look at the video further down the page...
Using the R package makes it all very easy:
gg <- ggobi( d )
It's not very helpful to start with. The first thing I did is chose a new colour scheme. I'm going to use 6 different classes, so I need one with at least 6 different colours, and for this I'm using "dark2 6":
Now, we can go back to R, and get it to colour the points according to their class. (See snippet at the bottom)
This makes the default colour 1, and other colours depending on whether the point is in the region, whether is is in the problem class, and whether it is baseline or predicted data. The last 3 lines hide all the data except those in the selected region or in the problem class. It looks like this for me:
Now, we're starting to see a bit of structure. The key is:
* Orange points are in the region
* Mauve/purple points are predicted points in the problem class
* Magenta points are predicted points in the problem class, in the selected region - the most likely wierdos
* Green points are the problem class in baseline
* Yellow points are the problem class in baseline in the region (I don't think there are any of these)
Now, I choose Display -> 2D Tour, and select the input variables I want to explore. All of a sudden, everything starts moving!
What I'm interested in is the problem points in the predicted set compared to the baseline, so I can hide everything else for now:
shadowed(gg) <- T shadowed(gg)[d<em class="error">TeX Embedding failed!</em>Class, sample, size=nsamples,replace=TRUE) stratified <- d[ melt( unlist(stratIndex) )<em class="error">TeX Embedding failed!</em>Type <- "Stratified" region <- d[d<em class="error">TeX Embedding failed!</em>LON < MAXLON & d<em class="error">TeX Embedding failed!</em>LAT > MINLAT,] region<em class="error">TeX Embedding failed!</em>Class==80,] e80 <- c80[sample( nrow(c80), size=extra80, replace=TRUE),] e80<em class="error">TeX Embedding failed!</em>Type <- factor( res<em class="error">TeX Embedding failed!</em>Type == "Region"] <- 2 glyph_colour(gg)[d<em class="error">TeX Embedding failed!</em>Scenario == "2080"] <- 3 glyph_colour(gg)[d<em class="error">TeX Embedding failed!</em>Class == 80 & d<em class="error">TeX Embedding failed!</em>Class == 80 & d<em class="error">TeX Embedding failed!</em>Type == "Region" & d<em class="error">TeX Embedding failed!</em>Scenario == "Baseline"] <- 6 shadowed(gg) <- T shadowed(gg)[d<em class="error">TeX Embedding failed!</em>Class == 80] <- F