.. _spatial_rdd_providers:

Spatial RDD Providers
---------------------

.. _accumulo_rdd_provider:

Accumulo RDD Provider
^^^^^^^^^^^^^^^^^^^^^

The ``AccumuloSpatialRDDProvider`` is a spatial RDD provider for Accumulo data stores. The core code is in
the ``geomesa-accumulo-spark`` module, and the shaded JAR-with-dependencies (which contains all the required
dependencies for execution) is available in the ``geomesa-accumulo-spark-runtime`` module.

This provider can read from and write to a GeoMesa ``AccumuloDataStore``. The configuration parameters
are the same as those passed to ``DataStoreFinder.getDataStore()``. See :ref:`accumulo_parameters` for details.

The feature type to access in GeoMesa is passed as the type name of the query passed
to the ``rdd()`` method. For example, to load an ``RDD`` of features of type ``gdelt``
from the ``geomesa`` Accumulo table:

.. code-block:: scala

    val params = Map(
      "accumulo.instance.id" -> "mycloud",
      "accumulo.user"        -> "user",
      "accumulo.password"    -> "password",
      "accumulo.zookeepers"  -> "zoo1,zoo2,zoo3",
      "accumulo.catalog"     -> "geomesa")
    val query = new Query("gdelt")
    val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query)

.. _hbase_rdd_provider:

HBase RDD Provider
^^^^^^^^^^^^^^^^^^

The ``HBaseSpatialRDDProvider`` is a spatial RDD provider for HBase data stores. The core code is in
the ``geomesa-hbase-spark`` module, and the shaded JAR-with-dependencies (which contains all the required
dependencies for execution) is available in the ``geomesa-hbase-spark-runtime`` module.

This provider can read from and write to a GeoMesa ``HBaseDataStore``. The configuration parameters
are the same as those passed to ``DataStoreFinder.getDataStore()``. See :ref:`hbase_parameters` for details.

.. note::

    Connecting to HBase generally requires the ``hbase-site.xml`` file to be available on the Spark classpath.
    This may be accomplished by specifying it with ``--jars``. For example:

    .. code-block:: bash

        $ spark-shell --jars file:///opt/geomesa/dist/spark/geomesa-hbase-spark-runtime_2.11-${VERSION}.jar,file:///usr/lib/hbase/conf/hbase-site.xml

    Alternatively, you may specify the zookeepers in the data store parameter map. However, this may not work
    for every HBase setup.


The feature type to access in GeoMesa is passed as the type name of the query passed
to the ``rdd()`` method. For example, to load an ``RDD`` of features of type ``gdelt``
from the ``geomesa`` HBase table:

.. code-block:: scala

    val params = Map("hbase.zookeepers" -> "zoo1,zoo2,zoo3", "hbase.catalog" -> "geomesa")
    val query = new Query("gdelt")
    val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query)

.. _fsds_rdd_provider:

FileSystem RDD Provider
^^^^^^^^^^^^^^^^^^^^^^^

The ``FileSystemRDDProvider`` is a spatial RDD provider for GeoMesa file system data stores. The core code is in
the ``geomesa-fs-spark`` module, and the shaded JAR-with-dependencies (which contains all the required
dependencies for execution) is available in the ``geomesa-fs-spark-runtime`` module.

This provider can read from and write to a GeoMesa ``FileSystemDataStore``. The configuration parameters
are the same as those passed to ``DataStoreFinder.getDataStore()``. See :ref:`fsds_parameters` for details.

The feature type to access in GeoMesa is passed as the type name of the query passed
to the ``rdd()`` method. For example, to load an ``RDD`` of features of type ``gdelt``
from an s3 bucket:

.. code-block:: scala

    val params = Map("fs.path" -> "s3a://mybucket/geomesa/datastore")
    val query = new Query("gdelt")
    val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query)

See :ref:`fsds_sparksql_example` for an example of using SparkSQL with the FileSystem data store.

.. _converter_rdd_provider:

Converter RDD Provider
^^^^^^^^^^^^^^^^^^^^^^

The ``ConverterSpatialRDDProvider`` is provided by the ``geomesa-spark-converter`` module.

``ConverterSpatialRDDProvider`` reads features from one or more data files in formats
readable by the :ref:`converters` library, including delimited and fixed-width text,
Avro, JSON, and XML files. It takes the following configuration parameters:

 * ``geomesa.converter`` - the converter definition as a Typesafe Config string
 * ``geomesa.converter.inputs`` - input file paths, comma-delimited
 * ``geomesa.sft`` - the ``SimpleFeatureType``, as a spec string, configuration string, or environment lookup name
 * ``geomesa.sft.name`` - (optional) the name of the ``SimpleFeatureType``

Consider the example data described in the :ref:`convert_example_usage` section of the
:ref:`converters` documentation. If the file ``example.csv`` contains the
example data, and ``example.conf`` contains the Typesafe configuration file for the
converter, the following Scala code can be used to load this data into an ``RDD``:

.. code-block:: scala

    val exampleConf = ConfigFactory.load("example.conf").root().render()
    val params = Map(
      "geomesa.converter"        -> exampleConf,
      "geomesa.converter.inputs" -> "example.csv",
      "geomesa.sft"              -> "phrase:String,dtg:Date,geom:Point:srid=4326",
      "geomesa.sft.name"         -> "example")
    val query = new Query("example")
    val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query)

It is also possible to load the prepackaged converters for public data sources
(GDELT, GeoNames, etc.) via Maven or SBT. See :ref:`prepackaged_converters` for more
details.

.. warning::

    ``ConvertSpatialRDDProvider`` is read-only, and does not support writing features
    to data files.

.. _geotools_rdd_provider:

GeoTools RDD Provider
^^^^^^^^^^^^^^^^^^^^^

``GeoToolsSpatialRDDProvider`` is provided by the ``geomesa-spark-geotools`` module.

``GeoToolsSpatialRDDProvider`` generates and saves ``RDD``\ s of features stored in
a generic GeoTools ``DataStore``. The configuration parameters passed are the same as
those passed to ``DataStoreFinder.getDataStore()`` to create the data store of interest,
plus a required boolean parameter called "geotools" to indicate to the SPI to load
``GeoToolsSpatialRDDProvider``. For example, the `CSVDataStore`_ described in the
`GeoTools ContentDataStore tutorial`_ takes a single parameter called "file". To use
this data store with GeoMesa Spark, do the following:

.. code-block:: scala

    val params = Map(
      "geotools" -> "true",
      "file"     -> "locations.csv")
    val query = new Query("locations")
    val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query)

.. _GeoTools ContentDataStore tutorial: http://docs.geotools.org/latest/userguide/tutorial/datastore/index.html

.. _CSVDataStore: http://docs.geotools.org/latest/userguide/tutorial/datastore/read.html

The name of the feature type to access in the data store is passed as the type name of the
query passed to the ``rdd()`` method. In the example of the `CSVDataStore`_, this is the
basename of the filename passed as an argument.

.. warning::

    Do not use the GeoTools RDD provider with a GeoMesa data store that has a provider implementation.
    The providers described above provide additional optimizations to improve read and write performance.

If your data store supports it, use the ``save()`` method to save features:

.. code-block:: scala

    GeoMesaSpark(params).save(rdd, params, "locations")