GeoMesa Spark provides capabilities to run geospatial analysis jobs on the distributed, large-scale data processing engine Apache Spark. It provides interfaces for Spark to ingest and analyze geospatial data stored in GeoMesa data stores.
GeoMesa provides Spark integration at several different levels. At the lowest level
geomesa-spark-jts module (see Spark JTS), which contains user-defined spatial types
and functions. This module can easily be included in other projects that want to
work with geometries in Spark, as it only depends on the JTS library.
geomesa-spark-core module (see Spark Core) is an extension for Spark that takes
Query objects as input and produces resilient distributed datasets
RDDs) containing serialized versions of SimpleFeatures. Multiple
backends that target different types of feature stores are available,
including ones for Accumulo, HBase, FileSystem, Kudu, files readable by the GeoMesa Convert library,
and any generic GeoTools
geomesa-spark-sql module (see SparkSQL) builds on top of the core module
to convert between
DataFrames. GeoMesa SparkSQL pushes down
filtering logic from SQL queries and converts them into GeoTools
which are then passed to the
GeoMesaSpark object provided by GeoMesa Spark Core.
Finally, bindings are provided for integration with the Spark Python API. See GeoMesa PySpark for details.
A stack composed of a distributed data store such as Accumulo, GeoMesa, the GeoMesa Spark libraries, Spark, and the Jupyter interactive notebook application (see above) provides a complete large-scale geospatial data analysis platform.
See GeoMesa Spark: Basic Analysis for a tutorial on analyzing data with GeoMesa Spark.