Map-Reduce Ingest of GDELT

This tutorial shows how to:

  1. Ingest big GDELT data files into a GeoMesa Accumulo table via a Hadoop Map/Reduce job.
  2. Leverage GeoServer to query and visualize the data.

Prerequisites

Warning

You will need access to a Hadoop 2.2 or better installation as well as an Accumulo 1.7 or 1.8 database.

You will also need:

  • access to an Accumulo user that has both create-table and write permissions,
  • an instance of GeoServer 2.9.1 with the GeoMesa plugin installed,
  • Java JDK 8,
  • Apache Maven 3.2.2 or better, and
  • a git client.

Obtaining GDELT data

The GDELT Event database provides a comprehensive time- and location-indexed archive of events reported in broadcast, print, and web news media worldwide from 1979 to today. You can download raw GDELT data files at http://data.gdeltproject.org/events/index.html. For example, the following Bash commands use wget to download all event files for 2013 and 2014 (5 GB compressed, 30 GB uncompressed) and md5sum to verify each by their MD5 hash:

$ mkdir gdelt && cd gdelt
$ wget http://data.gdeltproject.org/events/md5sums
$ for file in `cat md5sums | cut -d' ' -f3 | grep '^201[34]'` ; do wget http://data.gdeltproject.org/events/$file ; done
$ md5sum -c md5sums 2>&1 | grep '^201[34]'

Note: the whole GDELT data set is very large, over 100 GB compressed and over 1 TB uncompressed. We download all data for 2013-2014 as we look at events in the Ukraine in this time period in the ANALYZE section below. However, you may load a smaller dataset for demonstration purposes; data for a typical month is about ~2 GB uncompressed, but you will see fewer data in the analysis below. In any case, omit the aggregated, reduced GDELT data in GDELT.MASTERREDUCEDV2.1979-2013.zip, which has fewer columns than the fully-described event data in the other files, and is not supported by the ingest code described below.

In the directory with the *.zip data files, use the following command to unpack and aggregate them into a file called gdelt.tsv on the Hadoop file system:

$ (ls -1 *.zip | xargs -n 1 zcat) | hadoop fs -put - /gdelt/uncompressed/gdelt.tsv

Note: the individual files, when uncompressed, have *.csv extensions, but the data contained within them are tab-separated values, hence the *.tsv extension. See the GDELT raw data file documentation for more information on the format of these files.

Download and Build the Tutorial

Pick a reasonable directory on your machine, and run:

$ git clone https://github.com/geomesa/geomesa-tutorials.git
$ cd geomesa-tutorials

Note

You may need to download a particular release of the tutorials project to target a particular GeoMesa release. See About Tutorial Versions.

To build, run

$ mvn clean install -pl geomesa-examples-gdelt

Note

Ensure that the version of Accumulo, Hadoop, etc in the root pom.xml match your environment.

Note

Depending on the version, you may also need to build GeoMesa locally. Instructions can be found under Installation.

Running the Ingest

Use hadoop jar to launch the Map/Reduce ingest job:

$ hadoop jar geomesa-examples-gdelt/target/geomesa-examples-gdelt-<version>.jar \
    com.example.geomesa.gdelt.GDELTIngest            \
    -instanceId <accumulo-instance-id>               \
    -zookeepers <zookeeper-hosts-string>             \
    -user <username> -password <password>            \
    -auths <comma-separated-authorization-string>    \
    -tableName gdelt -featureName event              \
    -ingestFile hdfs:///gdelt/uncompressed/gdelt.tsv

Note that authorizations are optional. Unless you know that your table already exists with explicit authorizations, or that it will be created with default authorizations, you probably want to omit this parameter.

DataStore Initialization

GeoTools uses a SimpleFeatureType to represent the schema for individual SimpleFeatures created from the GDELT data. We can easily create a schema for the GDELT SimpleFeatureType using the GeoTools DataUtilities class. The schema string is a comma separated list of attribute descriptors of the form “:”, e.g. “Year:Integer”. Some attributes may have a third term with an appended “hint”, e.g. “geom:Point:srid=4236”, and the default geometry attribute is often prepended with an asterisk. For example, a complete schema string for a SimpleFeatureType describing a city with a latitude/longitude point, a name, and a population might be “*geom:Point:srid=4326,cityname:String,population:Integer”.

This is how GDELTIngest.java creates the SimpleFeatureType for the GDELT event data:

static List<String> attributes = Lists.newArrayList(
    "GLOBALEVENTID:Integer",
    "SQLDATE:Date",
    "MonthYear:Integer",
    "Year:Integer",
    "FractionDate:Float",
    //...
    "*geom:Point:srid=4326"
);
String spec = Joiner.on(",").join(attributes);
SimpleFeatureType featureType = DataUtilities.createType(name, spec);

After we create the GDELT feature type, we have to tell GeoMesa which field to use for the date index. We specify this field using the SimpleFeatureType’s user data.

// This tells GeoMesa to use this attribute as the indexed date field
featureType.getUserData().put(SimpleFeatureTypes.DEFAULT_DATE_KEY, "SQLDATE");

Finally, we create the new feature type in GeoMesa as follows.

ds.createSchema(featureType);

Mapper

In the setup method of the Mapper class, we create a FeatureBuilder for the GDELT SimpleFeatureType we created in the initialization.

String featureName = context.getConfiguration().get(GDELTIngest.FEATURE_NAME);
SimpleFeatureType featureType = GDELTIngest.buildGDELTFeatureType(featureName);
featureBuilder = new SimpleFeatureBuilder(featureType);

The input to the map method is a single line of the GDELT TSV file. We split the line on tabs and extract the attributes of the data. We parse the latitude and longitude field to set the default geometry of our SimpleFeature.

GeoTools provides common conversions for most data types and some date formats. However, any attribute strings that will not convert automatically into the specified class need to be explicitly set on the SimpleFeature. See “SQLDATE” below.

featureBuilder.reset();
featureBuilder.addAll(attributes);

Double lat = Double.parseDouble(attributes[LATITUDE_COL_IDX]);
Double lon = Double.parseDouble(attributes[LONGITUDE_COL_IDX]);
Geometry geom = geometryFactory.createPoint(new Coordinate(lon, lat));
SimpleFeature simpleFeature = featureBuilder.buildFeature(attributes[ID_COL_IDX]);
simpleFeature.setAttribute("SQLDATE", formatter.parse(attributes[DATE_COL_IDX]));
simpleFeature.setDefaultGeometry(geom);

We leverage the GeoMesaOutputFormat in order to write SimpleFeatures to Accumulo. Once we have created the SimpleFeature, all we have to do is write it to the output context:

context.write(new Text(), simpleFeature);

Analyze

GeoServer Setup

First, follow Installing GeoMesa Accumulo in GeoServer to set up the GeoMesa GeoServer plugin if you haven’t done so.

Register the GeoMesa DataStore with GeoServer

Click “Stores” and “Add new Store”. If you do not see the Accumulo Feature Data Store listed under Vector Data Sources, ensure the plugin is in the right directory and restart GeoServer.

Registering new Data Store

Registering new Data Store

Register the newly created Accumulo table using the same parameters specified in the command line above. (If you use a workspace:layer name other than “geomesa:gdelt”, you will need to change the WMS requests that follow.)

Registering new Accumulo Feature Data Store

Registering new Accumulo Feature Data Store

Publish layer

After registering the DataStore, click to publish the layer. You will be taken to the Edit Layer screen. In the Data pane, enter values for the bounding boxes. For the whole world, use [-180,-90,180,90]. In the Dimensions pane, enable Time and select the SQLDATE field. You will also need to specify a presentation for time - use List as a default.

Enable Time for the Layer

Enable Time for the Layer

Query

Let’s look at events in the Ukraine from 2013 until April 30, 2014. The red squares are the default styling that GeoServer applies to point data.

http://localhost:8080/geoserver/wms?service=WMS&version=1.1.0&request=GetMap&layers=geomesa:gdelt&styles=&bbox=31.6,44,37.4,47.75&width=1200&height=600&srs=EPSG:4326&format=application/openlayers&TIME=2013-01-01T00:00:00.000Z/2014-04-30T23:00:00.000Z
Showing all GDELT events from Jan 1, 2013 to April 30, 2014

Showing all GDELT events from Jan 1, 2013 to April 30, 2014

The above map is using the Stamen Toner layer as a base layer. For more information about adding multiple layers into one group see the GeoServer documentation.

Filter

Let’s narrow our results. GDELT labels events with CAMEO (Conflict and Mediation Event Observations) codes. The CAMEO code for events of type ‘THREATEN’ starts with ‘13’. We can filter down to these events using the drop down in GeoServer’s OpenLayers preview.

Open GeoServer Toggle Options Toolbar

Open GeoServer Toggle Options Toolbar

Enter CQL Filter into Toolbar

Enter CQL Filter into Toolbar

Let’s use a custom icon to display THREATEN events, by adding an SLD style to the layer. Add the SLD file threat.sld to GeoServer (See the GeoServer documentation for more information about adding SLD files. For the ExternalGraphic in the SLD to work, move the image file to the specified location in your GeoServer installation.

http://localhost:8080/geoserver/wms?service=WMS&version=1.1.0&request=GetMap&layers=geomesa:gdelt&CQL_FILTER=EventRootCode=13&styles=threat&bbox=31.6,44,37.4,47.75&width=1200&height=600&srs=EPSG:4326&format=application/openlayers&TIME=2013-01-01T00:00:00.000Z/2014-04-30T23:00:00.000Z
Showing GDELT events with CAMEO root code THREATEN from Jan 1, 2013 to April 30, 2014

Showing GDELT events with CAMEO root code THREATEN from Jan 1, 2013 to April 30, 2014

Heatmaps

Use a heatmap to more clearly visualize multiple events in the same location or high volume of data in general.

Note

The heatmap style requires that geomesa-process-wps be installed in your GeoServer, as described in GeoMesa Process.

Add the SLD file heatmap.sld to GeoServer.

In the request below, the heatmap is before the points layer so that the points will be overlaid and not hidden. Notice the “&env=radiusPixels:30” in the URL; this is SLD variable substitution, and will replace the default value assigned in the SLD.

http://localhost:8080/geoserver/wms?service=WMS&version=1.1.0&request=GetMap&layers=geomesa:gdelt,geomesa:gdelt&CQL_FILTER=include;EventRootCode=13&styles=heatmap,threat&bbox=31.6,44,37.4,47.75&width=1200&height=600&srs=EPSG:4326&format=application/openlayers&TIME=2013-01-01T00:00:00.000Z/2014-04-30T23:00:00.000Z&env=radiusPixels:30
Showing heatmap with event overlay of GDELT events with CAMEO root code THREATEN from Jan 1, 2013 to April 30, 2014

Showing heatmap with event overlay of GDELT events with CAMEO root code THREATEN from Jan 1, 2013 to April 30, 2014