.. _spatial_rdd_providers: Spatial RDD Providers --------------------- .. _accumulo_rdd_provider: Accumulo RDD Provider ^^^^^^^^^^^^^^^^^^^^^ The ``AccumuloSpatialRDDProvider`` is a spatial RDD provider for Accumulo data stores. The core code is in the ``geomesa-accumulo-spark`` module, and the shaded JAR-with-dependencies are available in the ``geomesa-accumulo-spark-runtime-accumulo1`` and ``geomesa-accumulo-spark-runtime-accumulo2`` modules. .. note:: The GeoMesa Spark runtime JARs are convenient bundles of all the required dependencies for each data store. There are two Accumulo Spark runtime JARs, one for Accumulo 1.x (``geomesa-accumulo-spark-runtime-accumulo1``) and one for Accumulo 2.x (``geomesa-accumulo-spark-runtime-accumulo2``). Make sure that you use the JAR corresponding to your Accumulo version. This provider can read from and write to a GeoMesa ``AccumuloDataStore``. The configuration parameters are the same as those passed to ``DataStoreFinder.getDataStore()``. See :ref:`accumulo_parameters` for details. The feature type to access in GeoMesa is passed as the type name of the query passed to the ``rdd()`` method. For example, to load an ``RDD`` of features of type ``gdelt`` from the ``geomesa`` Accumulo table: .. code-block:: scala import org.apache.hadoop.conf.Configuration import org.geotools.data.Query import org.locationtech.geomesa.spark.GeoMesaSpark val params = Map( "accumulo.instance.id" -> "mycloud", "accumulo.user" -> "user", "accumulo.password" -> "password", "accumulo.zookeepers" -> "zoo1,zoo2,zoo3", "accumulo.catalog" -> "geomesa") val query = new Query("gdelt") val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query) .. _hbase_rdd_provider: HBase RDD Provider ^^^^^^^^^^^^^^^^^^ The ``HBaseSpatialRDDProvider`` is a spatial RDD provider for HBase data stores. The core code is in the ``geomesa-hbase-spark`` module, and the shaded JAR-with-dependencies (which contains all the required dependencies for execution) is available in the ``geomesa-hbase-spark-runtime-hbase1`` and ``geomesa-hbase-spark-runtime-hbase2`` modules. .. note:: The GeoMesa Spark runtime JARs are convenient bundles of all the required dependencies for each data store. There are two HBase Spark runtime JARs, one for HBase 1.x (``geomesa-hbase-spark-runtime-hbase1``) and one for HBase 2.x (``geomesa-hbase-spark-runtime-hbase2``). Make sure that you use the JAR corresponding to your HBase version. This provider can read from and write to a GeoMesa ``HBaseDataStore``. The configuration parameters are the same as those passed to ``DataStoreFinder.getDataStore()``. See :ref:`hbase_parameters` for details. .. note:: Connecting to HBase generally requires the ``hbase-site.xml`` file to be available on the Spark classpath. This may be accomplished by specifying it with ``--jars``. For example: .. code-block:: bash $ spark-shell --jars file:///opt/geomesa/dist/spark/geomesa-hbase-spark-runtime-hbase1_2.11-${VERSION}.jar,file:///usr/lib/hbase/conf/hbase-site.xml Alternatively, you may specify the zookeepers in the data store parameter map. However, this may not work for every HBase setup. The feature type to access in GeoMesa is passed as the type name of the query passed to the ``rdd()`` method. For example, to load an ``RDD`` of features of type ``gdelt`` from the ``geomesa`` HBase table: .. code-block:: scala import org.apache.hadoop.conf.Configuration import org.geotools.data.Query import org.locationtech.geomesa.spark.GeoMesaSpark val params = Map("hbase.zookeepers" -> "zoo1,zoo2,zoo3", "hbase.catalog" -> "geomesa") val query = new Query("gdelt") val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query) .. _fsds_rdd_provider: FileSystem RDD Provider ^^^^^^^^^^^^^^^^^^^^^^^ The ``FileSystemRDDProvider`` is a spatial RDD provider for GeoMesa file system data stores. The core code is in the ``geomesa-fs-spark`` module, and the shaded JAR-with-dependencies (which contains all the required dependencies for execution) is available in the ``geomesa-fs-spark-runtime`` module. This provider can read from and write to a GeoMesa ``FileSystemDataStore``. The configuration parameters are the same as those passed to ``DataStoreFinder.getDataStore()``. See :ref:`fsds_parameters` for details. The feature type to access in GeoMesa is passed as the type name of the query passed to the ``rdd()`` method. For example, to load an ``RDD`` of features of type ``gdelt`` from an s3 bucket: .. code-block:: scala import org.apache.hadoop.conf.Configuration import org.geotools.data.Query import org.locationtech.geomesa.spark.GeoMesaSpark val params = Map("fs.path" -> "s3a://mybucket/geomesa/datastore") val query = new Query("gdelt") val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query) See :ref:`fsds_sparksql_example` for an example of using SparkSQL with the FileSystem data store. .. _converter_rdd_provider: Converter RDD Provider ^^^^^^^^^^^^^^^^^^^^^^ The ``ConverterSpatialRDDProvider`` is provided by the ``geomesa-spark-converter`` module. ``ConverterSpatialRDDProvider`` reads features from one or more data files in formats readable by the :ref:`converters` library, including delimited and fixed-width text, Avro, JSON, and XML files. It takes the following configuration parameters: * ``geomesa.converter`` - the converter definition as a Typesafe Config string * ``geomesa.converter.inputs`` - input file paths, comma-delimited * ``geomesa.sft`` - the ``SimpleFeatureType``, as a spec string, configuration string, or environment lookup name * ``geomesa.sft.name`` - (optional) the name of the ``SimpleFeatureType`` Consider the example data described in the :ref:`convert_example_usage` section of the :ref:`converters` documentation. If the file ``example.csv`` contains the example data, and ``example.conf`` contains the Typesafe configuration file for the converter, the following Scala code can be used to load this data into an ``RDD``: .. code-block:: scala import com.typesafe.config.ConfigFactory import org.apache.hadoop.conf.Configuration import org.geotools.data.Query import org.locationtech.geomesa.spark.GeoMesaSpark val exampleConf = ConfigFactory.load("example.conf").root().render() val params = Map( "geomesa.converter" -> exampleConf, "geomesa.converter.inputs" -> "example.csv", "geomesa.sft" -> "phrase:String,dtg:Date,geom:Point:srid=4326", "geomesa.sft.name" -> "example") val query = new Query("example") val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query) It is also possible to load the prepackaged converters for public data sources (GDELT, GeoNames, etc.) via Maven or SBT. See :ref:`prepackaged_converters` for more details. .. warning:: ``ConvertSpatialRDDProvider`` is read-only, and does not support writing features to data files. .. _geotools_rdd_provider: GeoTools RDD Provider ^^^^^^^^^^^^^^^^^^^^^ ``GeoToolsSpatialRDDProvider`` is provided by the ``geomesa-gt-spark`` module. ``GeoToolsSpatialRDDProvider`` generates and saves ``RDD``\ s of features stored in a generic GeoTools ``DataStore``. The configuration parameters passed are the same as those passed to ``DataStoreFinder.getDataStore()`` to create the data store of interest, plus a required boolean parameter called "geotools" to indicate to the SPI to load ``GeoToolsSpatialRDDProvider``. For example, to use the `Postgis DataStore`_ with GeoMesa Spark, do the following: .. code-block:: scala import org.apache.hadoop.conf.Configuration import org.geotools.data.Query import org.locationtech.geomesa.spark.GeoMesaSpark val params = Map( "geotools" -> "true", "dbtype" -> "postgis", "host" -> "localhost", "user" -> "postgres", "passwd" -> "postgres", "port" -> "5432", "database" -> "example") val query = new Query("locations") val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query) .. _Postgis DataStore: http://docs.geotools.org/stable/userguide/library/jdbc/postgis.html The name of the feature type to access in the data store is passed as the type name of the query passed to the ``rdd()`` method. In the example above, this is "locations". .. warning:: Do not use the GeoTools RDD provider with a GeoMesa data store that has a provider implementation. The providers described above provide additional optimizations to improve read and write performance. If your data store supports it, use the ``save()`` method to save features: .. code-block:: scala GeoMesaSpark(params).save(rdd, params, "locations")