10.8. Deploying GeoMesa Spark with Jupyter Notebook

Jupyter Notebook is a web-based application for creating interactive documents containing runnable code, visualizations, and text. Via the Apache Toree kernel, Jupyter can be used for preparing spatio-temporal analyses in Scala and submitting them in Spark. The guide below describes how to configure Jupyter with Spark 2.2.x, 2.3.x or 2.4.x, Scala 2.11, and GeoMesa.

Note

GeoMesa support for PySpark provides access to GeoMesa Accumulo data stores through the Spark Python API using Jupyter’s built in Python kernel. See GeoMesa PySpark.

10.8.1. Prerequisites

Spark 2.2.x, 2.3.x or 2.4.x should be installed, and the environment variable SPARK_HOME should be set. Spark 2.0 and above requires Scala version 2.11.

Python 2.7 or 3.x should be installed, along with the Python development tools. For example, for Python 2.7 on Ubuntu:

$ sudo apt-get install build-essential python-dev

For Python 2.7 on CentOS:

$ sudo yum groupinstall 'Development Tools'
$ sudo yum install python-devel

Building the Toree kernel for Spark 2.2.x, 2.3.x or 2.4.x requires Git, GNU Make, Docker, and SBT.

10.8.2. Installing Jupyter

Jupyter may be installed via pip (for Python 2.7) or pip3 (for Python 3.x):

$ pip install --upgrade jupyter

or

$ pip3 install --upgrade jupyter

10.8.3. Installing the Toree Kernel

Toree version 0.2.0-dev1 supports Spark 2.0 and above, and should be built from source as described via its README file. Check out the project from GitHub:

$ git clone https://github.com/apache/incubator-toree.git
$ cd incubator-toree
$ git checkout v0.2.0-dev1

Build the project using make; as described in Prerequisites, Docker and SBT are required:

$ make clean
$ make release

The built Toree distribution can then be installed via pip (or pip3):

$ pip install --upgrade ./dist/toree-pip/toree-0.2.0.dev1.tar.gz
$ pip freeze | grep toree
toree==0.2.0.dev1

10.8.4. Configure Toree and GeoMesa

If you have the GeoMesa Accumulo distribution installed at GEOMESA_ACCUMULO_HOME as described in Setting up the Accumulo Command Line Tools, you can run the following example script to configure Toree with GeoMesa version VERSION:

#!/bin/sh

# bundled GeoMesa Accumulo Spark and Spark SQL runtime JAR
# (contains geomesa-accumulo-spark, geomesa-spark-core, geomesa-spark-sql, and dependencies)
jars="file://$GEOMESA_ACCUMULO_HOME/dist/spark/geomesa-accumulo-spark-runtime_2.11-$VERSION.jar"

# uncomment to use the converter RDD provider
#jars="$jars,file://$GEOMESA_ACCUMULO_HOME/lib/geomesa-spark-converter_2.11-$VERSION.jar"

# uncomment to work with shapefiles (requires $GEOMESA_ACCUMULO_HOME/bin/install-jai.sh)
#jars="$jars,file://$GEOMESA_ACCUMULO_HOME/lib/jai_codec-1.1.3.jar"
#jars="$jars,file://$GEOMESA_ACCUMULO_HOME/lib/jai_core-1.1.3.jar"
#jars="$jars,file;//$GEOMESA_ACCUMULO_HOME/lib/jai_imageio-1.1.jar"

jupyter toree install \
    --replace \
    --user \
    --kernel_name "GeoMesa Spark $VERSION" \
    --spark_home=${SPARK_HOME} \
    --spark_opts="--master yarn --jars $jars"

Note

The JARs specified will be in the respective target directory of each module of the source distribution if you built GeoMesa from source.

You may also consider adding geomesa-tools-2.11-$VERSION-data.jar to include prepackaged converters for publicly available data sources (as described in Prepackaged Converter Definitions), geomesa-jupyter-leaflet-2.11-$VERSION.jar to include an interface for the Leaflet spatial visualization library (see Leaflet for Visualization, next), and/or geomesa-jupyter-vegas-2.11-$VERSION.jar to use the Vegas data plotting library (see Vegas for Plotting, below).

10.8.5. Leaflet for Visualization

The following sample notebook shows how you can use Leaflet for data visualization:

classpath.addRepository("http://download.osgeo.org/webdav/geotools")
classpath.addRepository("http://central.maven.org/maven2")
classpath.addRepository("https://repo.locationtech.org/content/repositories/geomesa-releases")
classpath.addRepository("file:///home/username/.m2/repository")
classpath.add("org.locationtech.jts" % "jts" % "1.13")
classpath.add("org.locationtech.geomesa" % "geomesa-accumulo-datastore" % "1.3.0")
classpath.add("org.apache.accumulo" % "accumulo-core" % "1.6.4")
classpath.add("org.locationtech.geomesa" % "geomesa-jupyter" % "1.3.0")

import org.locationtech.geomesa.jupyter.Jupyter._

implicit val displayer: String => Unit = display.html(_)

import scala.collection.JavaConversions._
import org.locationtech.geomesa.accumulo.data.AccumuloDataStoreParams._
import org.locationtech.geomesa.utils.geotools.Conversions._

val params = Map(
        ZookeepersParam.key -> "ZOOKEEPERS",
        InstanceIdParam.key -> "INSTANCE",
        UserParam.key       -> "USER_NAME",
        PasswordParam.key   -> "USER_PASS",
        CatalogParam.key    -> "CATALOG")

val ds = org.geotools.data.DataStoreFinder.getDataStore(params)
val ff = org.geotools.factory.CommonFactoryFinder.getFilterFactory2
val fs = ds.getFeatureSource("twitter")

val filt = ff.and(
    ff.between(ff.property("dtg"), ff.literal("2016-01-01"), ff.literal("2016-05-01")),
    ff.bbox("geom", -80, 37, -75, 40, "EPSG:4326"))
val features = fs.getFeatures(filt).features.take(10).toList

displayer(L.render(Seq(WMSLayer(name="ne_10m_roads",namespace="NAMESPACE"),
                       Circle(-78.0,38.0,1000,  StyleOptions(color="yellow",fillColor="#63A",fillOpacity=0.5)),
                       Circle(-78.0,45.0,100000,StyleOptions(color="#0A5" ,fillColor="#63A",fillOpacity=0.5)),
                       SimpleFeatureLayer(features)
                      )))
../../_images/jupyter-leaflet.png

10.8.5.1. Adding Layers to a Map and Displaying in the Notebook

The following snippet is an example of rendering dataframes in Leaflet in a Jupyter notebook:

implicit val displayer: String => Unit = { s => kernel.display.content("text/html", s) }

val function = """
function(feature) {
  switch (feature.properties.plane_type) {
    case "A388": return {color: "#1c2957"}
    default: return {color: "#cdb87d"}
  }
}
"""

val sftLayer = time { L.DataFrameLayerNonPoint(flights_over_state, "__fid__", L.StyleOptionFunction(function)) }
val apLayer = time { L.DataFrameLayerPoint(flyovers, "origin", L.StyleOptions(color="#1c2957", fillColor="#cdb87d"), 2.5) }
val stLayer = time { L.DataFrameLayerNonPoint(queryOnStates, "ST", L.StyleOptions(color="#1c2957", fillColor="#cdb87d", fillOpacity= 0.45)) }
displayer(L.render(Seq[L.GeoRenderable](sftLayer,stLayer,apLayer),zoom = 1, path = "path/to/files"))
../../_images/jupyter-leaflet-layer.png

10.8.5.2. StyleOptionFunction

This case class allows you to specify a Javascript function to perform styling. The anonymous function that you will pass takes a feature as an argument and returns a Javascript style object. An example of styling based on a specific property value is provided below:

function(feature) {
  switch(feature.properties.someProp) {
    case "someValue": return { color: "#ff0000" }
    default         : return { color: "#0000ff" }
  }
}

The following table provides options that might be of interest:

Option Type Description
color String Stroke color
weight Number Stroke width in pixels
opacity Number Stroke opacity
fillColor String Fill color
fillOpacity Number Fill opacity

Note: Options are comma-separated (i.e. { color: "#ff0000", fillColor: "#0000ff" })

10.8.6. Vegas for Plotting

The Vegas library may be used with GeoMesa, Spark, and Toree in Jupyter to plot quantitative data. The geomesa-jupyter-vegas module builds a shaded JAR containing all of the dependencies needed to run Vegas in Jupyter+Toree. This module must be built from source, using the vegas profile:

$ mvn clean install -Pvegas -pl geomesa-jupyter/geomesa-jupyter-vegas

This will build geomesa-jupyter-vegas_2.11-$VERSION.jar in the target directory of the module, and should be added to the list of JARs in the jupyter toree install command described in Configure Toree and GeoMesa:

jars="$jars,file:///path/to/geomesa-jupyter-vegas_2.11-$VERSION.jar"
# then continue with "jupyter toree install" as before

To use Vegas within Jupyter, load the appropriate libraries and a displayer:

import vegas._
import vegas.render.HTMLRenderer._
import vegas.sparkExt._

implicit val displayer: String => Unit = { s => kernel.display.content("text/html", s) }

Then use the withDataFrame method to plot data in a DataFrame:

Vegas("Simple bar chart").
  withDataFrame(df).
  encodeX("a", Ordinal).
  encodeY("b", Quantitative).
  mark(Bar).
  show(displayer)

10.8.7. Running Jupyter

For public notebooks, you should configure Jupyter to use a password and bind to a public IP address. To run Jupyter with the GeoMesa Spark kernel:

$ jupyter notebook

Note

Long-lived processes should probably be hosted in screen, systemd, or supervisord.

Your notebook server should launch and be accessible at http://localhost:8888/ (or the address and port you bound the server to).