11.7. GeoMesa PySpark

GeoMesa provides integration with the Spark Python API for accessing data in GeoMesa data stores.

11.7.1. Prerequisites

  • Spark 2.4.x, 3.0.x or 3.1.x should be installed.

  • Python 2.7 or 3.x should be installed.

  • pip or pip3 should be installed.

  • conda-pack is optional.

11.7.2. Installation

The geomesa_pyspark package is not available for download. Build the artifact locally with the profile -Ppython. Then install using pip or pip3 as below. You will also need an appropriate geomesa-spark-runtime JAR. We assume the use of Accumulo here, but you may alternatively use any of the providers outlined in Spatial RDD Providers.

mvn clean install -Ppython
pip3 install geomesa-spark/geomesa_pyspark/target/geomesa_pyspark-$VERSION.tar.gz
cp  geomesa-accumulo/geomesa-accumulo-spark-runtime-accumulo2/target/geomesa-accumulo-spark-runtime-accumulo2_${VERSION}.jar /path/to/

Alternatively, you can use conda-pack to bundle the dependencies for your project. This may be more appropriate if you have additional dependencies.

export ENV_NAME=geomesa-pyspark

conda create --name $ENV_NAME -y python=3.7
conda activate $ENV_NAME

pip install geomesa-spark/geomesa_pyspark/target/geomesa_pyspark-$VERSION.tar.gz
# Install additional dependencies using conda or pip here

conda pack -o environment.tar.gz
cp geomesa-accumulo/geomesa-accumulo-spark-runtime-accumulo2/target/geomesa-accumulo-spark-runtime-accumulo2_${VERSION}.jar /path/to/

Warning

conda-pack currently has issues with Python 3.8, and pyspark has issues with Python 3.9, hence the explicit use of Python 3.7

11.7.3. Using GeoMesa PySpark

You may then access Spark using a Yarn master by default. Importantly, because of the way the geomesa_pyspark library interacts with the underlying Java libraries, you must set up the GeoMesa configuration before referencing the pyspark library.

import geomesa_pyspark
conf = geomesa_pyspark.configure(
    jars=['/path/to/geomesa-accumulo-spark-runtime-accumulo2_${VERSION}.jar'],
    packages=['geomesa_pyspark','pytz'],
    spark_home='/path/to/spark/').\
    setAppName('MyTestApp')

conf.get('spark.master')
# u'yarn'

from pyspark.sql import SparkSession

spark = ( SparkSession
    .builder
    .config(conf=conf)
    .enableHiveSupport()
    .getOrCreate()
)

Alternatively, if you used conda-pack then you do not need to set up the GeoMesa configuration as above, but you must start pyspark or your application as follows, updating paths as required:

PYSPARK_DRIVER_PYTHON=/opt/anaconda3/envs/$ENV_NAME/bin/python PYSPARK_PYTHON=./environment/bin/python pyspark \
--jars /path/to/geomesa-accumulo-spark-runtime_${VERSION}.jar \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--master yarn --deploy-mode client --archives environment.tar.gz#environment

At this point you are ready to create a dict of connection parameters to your Accumulo data store and get a spatial data frame.

params = {
    "accumulo.instance.id": "myInstance",
    "accumulo.zookeepers": "zoo1,zoo2,zoo3",
    "accumulo.user": "user",
    "accumulo.password": "password",
    "accumulo.catalog": "myCatalog"
}
feature = "mySchema"
df = ( spark
    .read
    .format("geomesa")
    .options(**params)
    .option("geomesa.feature", feature)
    .load()
)

df.createOrReplaceTempView("tbl")
spark.sql("show tables").show()

# Count features in a bounding box.
spark.sql("""
select count(*)
from tbl
where st_contains(st_makeBBOX(-72.0, 40.0, -71.0, 41.0), geom)
""").show()

GeoMesa PySpark can also be used in the absence of a GeoMesa data store. Registering user-defined types and functions can be done manually by invoking geomesa_pyspark.init_sql() on the Spark session object:

geomesa_pyspark.init_sql(spark)

You can terminate the Spark job on YARN using spark.stop().

11.7.4. Jupyter

To use the geomesa_pyspark package within Jupyter, you only needs a Python2 or Python3 kernel, which is provided by default. Substitute the appropriate Spark home and runtime JAR paths in the above code blocks. Be sure the GeoMesa Accumulo client and server side versions match, as described in Installing GeoMesa Accumulo.