.. _spatial_rdd_providers: Spatial RDD Providers --------------------- .. _accumulo_rdd_provider: Accumulo RDD Provider ^^^^^^^^^^^^^^^^^^^^^ The ``AccumuloSpatialRDDProvider`` is a spatial RDD provider for Accumulo data stores. The core code is in the ``geomesa-accumulo-spark`` module, and the shaded JAR-with-dependencies (which contains all the required dependencies for execution) is available in the ``geomesa-accumulo-spark-runtime`` module. This provider can read from and write to a GeoMesa ``AccumuloDataStore``. The configuration parameters are the same as those passed to ``DataStoreFinder.getDataStore()``. See :ref:`accumulo_parameters` for details. The feature type to access in GeoMesa is passed as the type name of the query passed to the ``rdd()`` method. For example, to load an ``RDD`` of features of type ``gdelt`` from the ``geomesa`` Accumulo table: .. code-block:: scala val params = Map( "accumulo.instance.id" -> "mycloud", "accumulo.user" -> "user", "accumulo.password" -> "password", "accumulo.zookeepers" -> "zoo1,zoo2,zoo3", "accumulo.catalog" -> "geomesa") val query = new Query("gdelt") val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query) .. _hbase_rdd_provider: HBase RDD Provider ^^^^^^^^^^^^^^^^^^ The ``HBaseSpatialRDDProvider`` is a spatial RDD provider for HBase data stores. The core code is in the ``geomesa-hbase-spark`` module, and the shaded JAR-with-dependencies (which contains all the required dependencies for execution) is available in the ``geomesa-hbase-spark-runtime`` module. This provider can read from and write to a GeoMesa ``HBaseDataStore``. The configuration parameters are the same as those passed to ``DataStoreFinder.getDataStore()``. See :ref:`hbase_parameters` for details. .. note:: Connecting to HBase generally requires the ``hbase-site.xml`` file to be available on the Spark classpath. This may be accomplished by specifying it with ``--jars``. For example: .. code-block:: bash $ spark-shell --jars file:///opt/geomesa/dist/spark/geomesa-hbase-spark-runtime_2.11-${VERSION}.jar,file:///usr/lib/hbase/conf/hbase-site.xml Alternatively, you may specify the zookeepers in the data store parameter map. However, this may not work for every HBase setup. The feature type to access in GeoMesa is passed as the type name of the query passed to the ``rdd()`` method. For example, to load an ``RDD`` of features of type ``gdelt`` from the ``geomesa`` HBase table: .. code-block:: scala val params = Map("hbase.zookeepers" -> "zoo1,zoo2,zoo3", "hbase.catalog" -> "geomesa") val query = new Query("gdelt") val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query) .. _fsds_rdd_provider: FileSystem RDD Provider ^^^^^^^^^^^^^^^^^^^^^^^ The ``FileSystemRDDProvider`` is a spatial RDD provider for GeoMesa file system data stores. The core code is in the ``geomesa-fs-spark`` module, and the shaded JAR-with-dependencies (which contains all the required dependencies for execution) is available in the ``geomesa-fs-spark-runtime`` module. This provider can read from and write to a GeoMesa ``FileSystemDataStore``. The configuration parameters are the same as those passed to ``DataStoreFinder.getDataStore()``. See :ref:`fsds_parameters` for details. The feature type to access in GeoMesa is passed as the type name of the query passed to the ``rdd()`` method. For example, to load an ``RDD`` of features of type ``gdelt`` from an s3 bucket: .. code-block:: scala val params = Map("fs.path" -> "s3a://mybucket/geomesa/datastore") val query = new Query("gdelt") val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query) See :ref:`fsds_sparksql_example` for an example of using SparkSQL with the FileSystem data store. .. _converter_rdd_provider: Converter RDD Provider ^^^^^^^^^^^^^^^^^^^^^^ The ``ConverterSpatialRDDProvider`` is provided by the ``geomesa-spark-converter`` module. ``ConverterSpatialRDDProvider`` reads features from one or more data files in formats readable by the :ref:`converters` library, including delimited and fixed-width text, Avro, JSON, and XML files. It takes the following configuration parameters: * ``geomesa.converter`` - the converter definition as a Typesafe Config string * ``geomesa.converter.inputs`` - input file paths, comma-delimited * ``geomesa.sft`` - the ``SimpleFeatureType``, as a spec string, configuration string, or environment lookup name * ``geomesa.sft.name`` - (optional) the name of the ``SimpleFeatureType`` Consider the example data described in the :ref:`convert_example_usage` section of the :ref:`converters` documentation. If the file ``example.csv`` contains the example data, and ``example.conf`` contains the Typesafe configuration file for the converter, the following Scala code can be used to load this data into an ``RDD``: .. code-block:: scala val exampleConf = ConfigFactory.load("example.conf").root().render() val params = Map( "geomesa.converter" -> exampleConf, "geomesa.converter.inputs" -> "example.csv", "geomesa.sft" -> "phrase:String,dtg:Date,geom:Point:srid=4326", "geomesa.sft.name" -> "example") val query = new Query("example") val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query) It is also possible to load the prepackaged converters for public data sources (GDELT, GeoNames, etc.) via Maven or SBT. See :ref:`prepackaged_converters` for more details. .. warning:: ``ConvertSpatialRDDProvider`` is read-only, and does not support writing features to data files. .. _geotools_rdd_provider: GeoTools RDD Provider ^^^^^^^^^^^^^^^^^^^^^ ``GeoToolsSpatialRDDProvider`` is provided by the ``geomesa-spark-geotools`` module. ``GeoToolsSpatialRDDProvider`` generates and saves ``RDD``\ s of features stored in a generic GeoTools ``DataStore``. The configuration parameters passed are the same as those passed to ``DataStoreFinder.getDataStore()`` to create the data store of interest, plus a required boolean parameter called "geotools" to indicate to the SPI to load ``GeoToolsSpatialRDDProvider``. For example, the `CSVDataStore`_ described in the `GeoTools ContentDataStore tutorial`_ takes a single parameter called "file". To use this data store with GeoMesa Spark, do the following: .. code-block:: scala val params = Map( "geotools" -> "true", "file" -> "locations.csv") val query = new Query("locations") val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query) .. _GeoTools ContentDataStore tutorial: http://docs.geotools.org/latest/userguide/tutorial/datastore/index.html .. _CSVDataStore: http://docs.geotools.org/latest/userguide/tutorial/datastore/read.html The name of the feature type to access in the data store is passed as the type name of the query passed to the ``rdd()`` method. In the example of the `CSVDataStore`_, this is the basename of the filename passed as an argument. .. warning:: Do not use the GeoTools RDD provider with a GeoMesa data store that has a provider implementation. The providers described above provide additional optimizations to improve read and write performance. If your data store supports it, use the ``save()`` method to save features: .. code-block:: scala GeoMesaSpark(params).save(rdd, params, "locations")