.. _parquet_converter: Parquet Converter ================= The Parquet converter handles data written by `Apache Parque `__. To use the Parquet converter, specify ``type = "parquet"`` in your converter definition. Configuration ------------- The Parquet converter supports parsing whole Parquet files. Due to the Parquet random-access API, the file path must be specified in the ``EvaluationContext``. Further, pure streaming conversion is not possible (i.e. using bash pipe redirection into the ``ingest`` or ``convert`` command). As Parquet does not define any object model, standard practice is to parse a Parquet file into Avro GenericRecords. The Avro GenericRecord being parsed is available to field transforms as ``$0``. Avro Paths ---------- Because Parquet files are converted into Avro records, it is possible to use Avro paths to select elements. See :ref:`avro_converter` for details on Avro paths. Note that the result of an Avro path expression will be typed appropriately according to the Parquet column type (e.g. String, Double, List, etc). .. _parquet_converter_functions: Parquet Transform Functions --------------------------- GeoMesa defines several Parquet-specific transform functions, in addition to the ones defined under :ref:`avro_converter_functions`. parquetPoint ^^^^^^^^^^^^ Description: Parses a nested Point structure from a Parquet record Usage: ``parquetPoint($ref, $pathString)`` * ``$ref`` - a reference object (Avro root record or extracted object) * ``pathString`` - forward-slash delimited path string. See `Avro Paths`, above The point function can parse GeoMesa-encoded Point columns, which consist of a Parquet group of two double-type columns named ``x`` and ``y``. parquetLineString ^^^^^^^^^^^^^^^^^ Description: Parses a nested LineString structure from a Parquet record Usage: ``parquetLineString($ref, $pathString)`` * ``$ref`` - a reference object (Avro root record or extracted object) * ``pathString`` - forward-slash delimited path string. See `Avro Paths`, above The linestring function can parse GeoMesa-encoded LineString columns, which consist of a Parquet group of two repeated double-type columns named ``x`` and ``y``. parquetPolygon ^^^^^^^^^^^^^^ Description: Parses a nested Polygon structure from a Parquet record Usage: ``parquetPolygon($ref, $pathString)`` * ``$ref`` - a reference object (Avro root record or extracted object) * ``pathString`` - forward-slash delimited path string. See `Avro Paths`, above The polygon function can parse GeoMesa-encoded Polygon columns, which consist of a Parquet group of two list-type columns named ``x`` and ``y``. The list elements are repeated double-type columns. parquetMultiPoint ^^^^^^^^^^^^^^^^^ Description: Parses a nested MultiPoint structure from a Parquet record Usage: ``parquetMultiPoint($ref, $pathString)`` * ``$ref`` - a reference object (Avro root record or extracted object) * ``pathString`` - forward-slash delimited path string. See `Avro Paths`, above The multi-point function can parse GeoMesa-encoded MultiPoint columns, which consist of a Parquet group of two repeated double-type columns named ``x`` and ``y``. parquetMultiLineString ^^^^^^^^^^^^^^^^^^^^^^ Description: Parses a nested MultiLineString structure from a Parquet record Usage: ``parquetMultiLineString($ref, $pathString)`` * ``$ref`` - a reference object (Avro root record or extracted object) * ``pathString`` - forward-slash delimited path string. See `Avro Paths`, above The multi-linestring function can parse GeoMesa-encoded MultiLineString columns, which consist of a Parquet group of two list-type columns named ``x`` and ``y``. The list elements are repeated double-type columns. parquetMultiPolygon ^^^^^^^^^^^^^^^^^^^ Description: Parses a nested MultiPolygon structure from a Parquet record Usage: ``parquetMultiPolygon($ref, $pathString)`` * ``$ref`` - a reference object (Avro root record or extracted object) * ``pathString`` - forward-slash delimited path string. See `Avro Paths`, above The multi-polygon function can parse GeoMesa-encoded MultiPolygon columns, which consist of a Parquet group of two list-type columns named ``x`` and ``y``. The list elements are also lists, and the nested list elements are repeated double-type columns. Example Usage ------------- For this example we'll consider the following JSON file: .. code-block:: json { "id": 1, "number": 123, "color": "red", "physical": { "weight": 127.5, "height": "5'11" }, "lat": 0, "lon": 0 } { "id": 2, "number": 456, "color": "blue", "physical": { "weight": 150, "height": "5'11" }, "lat": 1, "lon": 1 } { "id": 3, "number": 789, "color": "green", "physical": { "weight": 200.4, "height": "6'2" }, "lat": 4.4, "lon": 3.3 } This file can be converted to Parquet using Spark: .. code-block:: scala import org.apache.spark.sql.SparkSession val session = SparkSession.builder().appName("testSpark").master("local[*]").getOrCreate() val df = session.read.json("/tmp/example.json") df.write.option("compression","gzip").parquet("/tmp/example.parquet") The following SimpleFeatureType and converter would be sufficient to parse the resulting Parquet file: .. code-block:: json { "geomesa" : { "sfts" : { "example" : { "fields" : [ { "name" : "color", "type" : "String" } { "name" : "number", "type" : "Long" } { "name" : "height", "type" : "String" } { "name" : "weight", "type" : "Double" } { "name" : "geom", "type" : "Point", "srid" : 4326 } ] } }, "converters" : { "example" : { "type" : "parquet", "id-field" : "avroPath($0, '/id')", "fields" : [ { "name" : "color", "transform" : "avroPath($0,'/color')" }, { "name" : "number", "transform" : "avroPath($0,'/number')" }, { "name" : "height", "transform" : "avroPath($0,'/physical/height')" }, { "name" : "weight", "transform" : "avroPath($0,'/physical/weight')" }, { "name" : "geom", "transform" : "point(avroPath($0,'/lon'),avroPath($0,'/lat'))" } ], "options" : { "encoding" : "UTF-8", "error-mode" : "skip-bad-records", "parse-mode" : "incremental", "validators" : [ "index" ] } } } } }