OpenFlights and MongoDB

I often find that I want to do more advanced queries of airports/cities than search engines allow, e.g. find me large cities which I can fly to from both these places and not have to travel more than 50km at the end. I'm using this as an excuse to learn:

  • Scala - it sounds cool, the concurrency looks really nice, could work nicely for agent interaction and it's JVM based so I feel at home
  • MongoDB - seems interesting, has geospatial support
  • Lift - it's about time I got into a web framework again (since handrolling stuff in perl years ago...)

I found the very exciting OpenFlight data, which should form the main body of data to query.

Importing data into MongoDB

OpenFlights provides .dat files, which are CSV files with all the info in. I've added headers to these in line with their description, e.g.

ID, Name, City, Country, IATA-FAA, ICAO, Latitude, Longitude, Altitude, Timezone, DST
1,"Goroka","Goroka","Papua New Guinea","GKA","AYGA",-6.081689,145.391881,5282,10,"U"
2,"Madang","Madang","Papua New Guinea","MAG","AYMD",-5.207083,145.7887,20,10,"U"

Assuming that you have mongodb running, these can be loaded into collections as follows:

>mongoimport -d airportdb -c airports --type csv --headerline airports.dat then

> db.airports.find().limit(2).forEach( printjson )
{
	"_id" : ObjectId("4d933b7f732fd2527b726f6c"),
	"ID" : 1,
	"Name" : "Goroka",
	"City" : "Goroka",
	"Country" : "Papua New Guinea",
	"IATA-FAA" : "GKA",
	"ICAO" : "AYGA",
	"Latitude" : -6.081689,
	"Longitude" : 145.391881,
	"Altitude" : 5282,
	"Timezone" : 10,
	"DST" : "U"
}
 
...

I've now got 3 databases: airports, airlines and routes.

In order to use the geospatial side of MongoDB, we need each object to have a subobject with lat/long in. This turns out to be a little more involved than I was expecting, but a good way to get the hang of map/reduce:

m = function() { emit( this._id, this ) }
r = function(key, values){ values[0].loc = { lat: values[0].Latitude, lon: values[0].Longitude }; return values[0] }
db.airports.mapReduce( m, r, {out: "airports"});

This is a very simple version of map reduce, where:

  • the map function just returns the whole object
  • the reduce function adds the subfield, and then returns the whole object
  • mapReduce uses both of these, and stores the result in the collection "airports", which is where it came from. This overwrites the original data with the original data + additions.

Slightly convoluted, but it does the job. Also, since this is only really happening on import, speed isn't too much of an issue.

  • UPDATE* This returns objects as { key: ObjectID, value: (actual object) }, so it's not a direct replacement for the original collection. I could put it into a temporary collection and iterate (we're not supposed to do db access in map/reduce any more). But, in the spirit of learning, here's an alternative approach:

makeCoord = function(e) { e.loc={ lat: e.Latitude, lon: e.Longitude }; db.airports.save( e ); }
db.airports.find().forEach( makeCoord )

This feels a bit closer to what I originally had in mind, but it seems to be slower; I think it has to pull every object in, modify it and send it back. Now I can find the airports close to Edinburgh:

edin = { lat: 55.57, lon:-3.13 }
db.airports.find( { loc: { $near : edin, $maxDistance:1 } } )
{ "_id" : ObjectId("4d9359ed732fd2527b74a278"), "ID" : 7388, "Name" : "Edinburgh Waverly Station", "City" : "Edinburgh", "Country" : "United Kingdom", "IATA-FAA" : "", "ICAO" : "\N", "Latitude" : 55.952, "Longitude" : -3.189, "Altitude" : 0, "Timezone" : 0, "DST" : "U", "field11" : 75.1487722, "field12" : 1, "field13" : -5, "field14" : "A", "field15" : 389, "field16" : -55.829722, "field17" : 1873, "field18" : -3, "field19" : "U", "loc" : { "lat" : 55.952, "lon" : -3.189 } }
{ "_id" : ObjectId("4d9359ed732fd2527b748d3f"), "ID" : 535, "Name" : "Edinburgh", "City" : "Edinburgh", "Country" : "United Kingdom", "IATA-FAA" : "EDI", "ICAO" : "EGPH", "Latitude" : 55.95, "Longitude" : -3.3725, "Altitude" : 135, "Timezone" : 0, "DST" : "E", "loc" : { "lat" : 55.95, "lon" : -3.3725 } }

OK, so the database seems to contain Waverley station, but it's generally working.


Now we can start getting into actual data manipulation. First, I can make a collection of destination arrays - the airports you can reach from each airport listed:

makeDest = function (source) {
    var dst = db.routes.distinct("Destination", {Source:source});
    var o = {Source:source, Dests:dst};
    db.destinations.save(o);
}
db.routes.distinct( "Source" ).forEach( makeDest )

This creates a collection of objects with an airport name and an array of destinations reachable. Given this, I can compute the airports which you can reach from both startpoints:

fromBoth = function (a, b) {
    ad = db.destinations.findOne({Source:a}).Dests;
    bd = db.destinations.findOne({Source:b}).Dests;
    return ad.filter(function (n) {return bd.indexOf(n) != -1;});
}
fromBoth( "EDI", "CPH" );
[
	"AGP",
	"ALC",
	"AMS",
...

This is quite basic, but it's a good start. Next, I'll look at putting this into a Lift webapp...

Powered by Drupal