Submitted by dave on
In the last post, talking about my work with Andreas and Marc, we saw how a bit of data visualisation helped to understand why some output was looking funny, and how the choice of classifier contributed to some strange behaviour. In that case, we were really lucky to spot it. What we'd like now is some better ways to spot and avoid these problems in the future, so this post is about another way to have a look at this data, and make sure it looks reasonable - some "defensive visualisation".
Contents |
Class Means [edit]
What was the real "problem" with the original classification? Really, it was that points were getting assigned to classes which were far away (in the space of input variables). Staying with visualisation (rather than statistics), it might be useful to look at the distance between points and the mean of the class they are assigned to. So, again with R, ggplot and ggobi, what we can do is:
- take the baseline data
- compute the means of each class
- display the distance between each point and the mean of its class
I'm going to use "error" here to talk about the difference between a point and the mean of the class it is assigned to.
This should give us some idea of the structure of the classes. It's quite hard stuff to visualise, because there are 125 classes and 4 variables, so there's a lot of data to see.
Computing Means and Differences [edit]
First, we want to add differences from class means to a set of datapoints, so we can do the comparisons
#Using the data from last time, and the function below means <- calculateClassMeans( origA ) #Read a file with 500 points from each class data <- read.csv("data/Current500Strat.csv") # and reformat it to match output data data <- alterBaseline( data ) #Create a data.frame of just the classes of all the points in the sample classes <- data.frame( Class=data$Class ) #Merge this with the means, so for each point, it has the class and the class means pointMeans <- merge( classes, means, by="Class", all.x=T ) #For each variable, make a column in the data with the difference between that point and the class mean for( v in vars ) data[[paste("DIFF",v,sep="_")]] = abs( data[[v]] - pointMeans[[v]] )
So, now we have a data.frame with all the data, and the differences between each point and the mean.
Displaying Baseline Data [edit]
To display the baseline errors per class, pull out just the necessary data (this is also the function getDiffData() below), and send it in to ggplot using melt():
diffData <- data[,c("Class","newpointid","DIFF_AI_AGG_5M", "DIFF_PET_AGG_5M", "DIFF_TAB0_AGG_5", "DIFF_TSD_AGG_5M")]
diffDataM <- melt( diffData,id.vars=c("newpointid","Class" ) )
#this has "newpointid","Class","variable","value" columns, ready for ggplot
ggplot( diffDataM, aes(x=Class,y=value,alpha=0.002) ) + geom_point() +
facet_grid( variable ~ ., scales="free_y" ) +
opts(axis.text.x = theme_text(angle = 90, hjust = 0))This looks fairly reasonable for the baseline data. There are a few disconnected groups of points, but nothing which looks too strange:
Point Differences from Class Means (links to full size version)
Comparing with Predicted Data [edit]
Now, lets compare the baseline errors with the errors in the predicted data. Differences from the means of the predicted data are computed exactly the same as the baseline data above. There are now two variables: "baseline" is the baseline differences and "classified" is the errors on the predicted data. These can be combined into a single data.frame:
baseDisp <- getDiffData( baseline ) baseDisp$Type <- "Baseline" classDisp <- getDiffData( classified ) classDisp$Type <- "Classified" displayData <- rbind( baseDisp, classDisp )
This can be fed into ggplot to make a comparative graph:
ggplot( melt( displayData,id.vars=c("newpointid","Class","Type" ) ), aes(x=Class,y=value,colour=Type) ) +
#Plot points with jitter, to avoid overplotting
geom_jitter(position=position_jitter(height=0,width=0.3),alpha=0.2,size=0.5) +
#Rotate text so it is readable
opts(axis.text.x = theme_text(angle = 90, hjust = 0)) +
#Do a plot for each variable, stacked on top of each other, with a different y axis for each plot
facet_grid( variable ~ ., scales="free_y" )
Comparison of Error in baseline and future data (links to full size version) (baseline points are red, predicted points are blue)
NOTE: It is really hard to avoid overplotting, and at the same time have visible points in sparse areas. Twiddling the alpha and size values above can make a huge difference. Unfortunately, it is very slow to plot, so experimentation is a bit painful.
Interpretation [edit]
What can we tell from this?
- bars which are all red are classes which don't exist in the classified data, i.e. they disappear in the future (e.g. class 12)
- bars which are all blue either there is very low error in the baseline data, and more in the classified data. This is not particularly worrying on its own.
- bars which are tall have a lot of error. For example, class 1 and 2 both have a lot of error in the baseline *and* the classified data for the first variable (but very little for the third!).
- bars which start red and turn blue indicate that there is more error in the classified data (i.e. class 40, variable 2). Again, not necessarily a problem - we probably expect and increase in error, especially since variables can get more extreme in the future, meaning that there are not classes that represent them accurately.
- bars where the red and blue are separated, however are a big problem! This means that in the predictions, the class has much more error than the baseline, and the more distinct they are, the more it starts to feel like something strange is going on.
So from this, we can see lots of warning signs. The most pronounced is class 80 - the problem class from last week! - which has a huge increase in overall error, and some really strong clusters of points for variables 1 and 3. Also, 67 looks quite dodgy, as does 95, and 16, 18 are a little bit worrying too.
So, although this is not a formal statistical analysis, if we'd done this in the first place, we would have caught the fact that some of the classes are behaving strangely in the predicted data, and saved a bunch of time in our analysis.
Now, when we try new classification methods, we can compare the error distribution, and have an immediate sense of whether anything funny is going on, and whether we are doing better than the original classification.
Code Snippets [edit]
Calculate the means of each class in the data.frame data. mvars is list of variable names to make the means for. (It's a bit ugly, I'm sure there's a more Rish way to do this, but it works)
calculateClassMeans <- function( data, mvars=vars )
{
data$Class <- factor(data$Class)
means <- NULL
for( c in levels(data$Class) )
means <- rbind( means, colMeans( data[data$Class == c,mvars] ) )
means <- as.data.frame( means )
means$Class <- levels(data$Class)
means
}Get just the difference data, and plot a comparison between baseline and classified data:
compareDifferences <- function( baseline, classified )
{
baseDisp <- getDiffData( baseline )
baseDisp$Type <- "Baseline"
classDisp <- getDiffData( classified )
classDisp$Type <- "Classified"
displayData <- rbind( baseDisp, classDisp )
ggplot( melt( displayData,id.vars=c("newpointid","Class","Type" ) ), aes(x=Class,y=value,colour=Type) ) +
geom_jitter(position=position_jitter(height=0,width=0.3),alpha=0.1,size=1.3) +
opts(axis.text.x = theme_text(angle = 90, hjust = 0)) +
facet_wrap( ~ variable, scales="free_y" )
}
getDiffData <- function(data)
{
data[,c("Class","newpointid","DIFF_AI_AGG_5M", "DIFF_PET_AGG_5M", "DIFF_TAB0_AGG_5", "DIFF_TSD_AGG_5M")]
}