OpenFlights and MongoDB

Error message

Notice: Undefined index: en in drutex_node_view() (line 85 of /home/davemr/mo-seph.com/sites/all/modules/drutex/drutex.module).

I often find that I want to do more advanced queries of airports/cities than search engines allow, e.g. find me large cities which I can fly to from both these places and not have to travel more than 50km at the end. I'm using this as an excuse to learn:

  • Scala - it sounds cool, the concurrency looks really nice, could work nicely for agent interaction and it's JVM based so I feel at home
  • MongoDB - seems interesting, has geospatial support
  • Lift - it's about time I got into a web framework again (since handrolling stuff in perl years ago...)

I found the very exciting OpenFlight data, which should form the main body of data to query.

Importing data into MongoDB[edit | edit source]

OpenFlights provides .dat files, which are CSV files with all the info in. I've added headers to these in line with their description, e.g.

ID, Name, City, Country, IATA-FAA, ICAO, Latitude, Longitude, Altitude, Timezone, DST
1,"Goroka","Goroka","Papua New Guinea","GKA","AYGA",-6.081689,145.391881,5282,10,"U"
2,"Madang","Madang","Papua New Guinea","MAG","AYMD",-5.207083,145.7887,20,10,"U"

Assuming that you have mongodb running, these can be loaded into collections as follows:

>mongoimport -d airportdb -c airports --type csv --headerline airports.dat then

> db.airports.find().limit(2).forEach( printjson )
{
 "_id" : ObjectId("4d933b7f732fd2527b726f6c"),
 "ID" : 1,
 "Name" : "Goroka",
 "City" : "Goroka",
 "Country" : "Papua New Guinea",
 "IATA-FAA" : "GKA",
 "ICAO" : "AYGA",
 "Latitude" : -6.081689,
 "Longitude" : 145.391881,
 "Altitude" : 5282,
 "Timezone" : 10,
 "DST" : "U"
}
 
...

I've now got 3 databases: airports, airlines and routes.

In order to use the geospatial side of MongoDB, we need each object to have a subobject with lat/long in. This turns out to be a little more involved than I was expecting, but a good way to get the hang of map/reduce:

m = function() { emit( this._id, this ) }
r = function(key, values){ values[0].loc = { lat: values[0].Latitude, lon: values[0].Longitude }; return values[0] }
db.airports.mapReduce( m, r, {out: "airports"});

This is a very simple version of map reduce, where:

  • the map function just returns the whole object
  • the reduce function adds the subfield, and then returns the whole object
  • mapReduce uses both of these, and stores the result in the collection "airports", which is where it came from. This overwrites the original data with the original data + additions.

Slightly convoluted, but it does the job. Also, since this is only really happening on import, speed isn't too much of an issue.

  • UPDATE* This returns objects as { key: ObjectID, value: (actual object) }, so it's not a direct replacement for the original collection. I could put it into a temporary collection and iterate (we're not supposed to do db access in map/reduce any more). But, in the spirit of learning, here's an alternative approach:

makeCoord = function(e) { e.loc={ lat: e.Latitude, lon: e.Longitude }; db.airports.save( e ); }
db.airports.find().forEach( makeCoord )

This feels a bit closer to what I originally had in mind, but it seems to be slower; I think it has to pull every object in, modify it and send it back. Now I can find the airports close to Edinburgh:

edin = { lat: 55.57, lon:-3.13 }
db.airports.find( { loc: { $near : edin, $maxDistance:1 } } )
{ "_id" : ObjectId("4d9359ed732fd2527b74a278"), "ID" : 7388, "Name" : "Edinburgh Waverly Station", "City" : "Edinburgh", "Country" : "United Kingdom", "IATA-FAA" : "", "ICAO" : "\\N", "Latitude" : 55.952, "Longitude" : -3.189, "Altitude" : 0, "Timezone" : 0, "DST" : "U", "field11" : 75.1487722, "field12" : 1, "field13" : -5, "field14" : "A", "field15" : 389, "field16" : -55.829722, "field17" : 1873, "field18" : -3, "field19" : "U", "loc" : { "lat" : 55.952, "lon" : -3.189 } }
{ "_id" : ObjectId("4d9359ed732fd2527b748d3f"), "ID" : 535, "Name" : "Edinburgh", "City" : "Edinburgh", "Country" : "United Kingdom", "IATA-FAA" : "EDI", "ICAO" : "EGPH", "Latitude" : 55.95, "Longitude" : -3.3725, "Altitude" : 135, "Timezone" : 0, "DST" : "E", "loc" : { "lat" : 55.95, "lon" : -3.3725 } }

OK, so the database seems to contain Waverley station, but it's generally working.

Now we can start getting into actual data manipulation. First, I can make a collection of destination arrays - the airports you can reach from each airport listed:

makeDest = function (source) {
 var dst = db.routes.distinct("Destination", {Source:source});
 var o = {Source:source, Dests:dst};
 db.destinations.save(o);
}
db.routes.distinct( "Source" ).forEach( makeDest )

This creates a collection of objects with an airport name and an array of destinations reachable. Given this, I can compute the airports which you can reach from both startpoints:

fromBoth = function (a, b) {
 ad = db.destinations.findOne({Source:a}).Dests;
 bd = db.destinations.findOne({Source:b}).Dests;
 return ad.filter(function (n) {return bd.indexOf(n) != -1;});
}
fromBoth( "EDI", "CPH" );
[
 "AGP",
 "ALC",
 "AMS",
...

This is quite basic, but it's a good start. Next, I'll look at putting this into a Lift webapp...

Project Type: