9.10. XML Converter

The XML converter processes XML files.

9.10.1. Configuration

The XML converter supports the following configuration keys:

Key

Required

Type

Description

type

yes

String

Must be the string xml.

feature-path

no

String

An XPath expression used to select multiple features from a single document.

xpath-factory

no

String

An alternate javax.xml.xpath.XPathFactory to use for parsing.

xml-namespaces

no

Object

XML namespace prefix mappings. See Handling Namespaces with Saxon for details.

xsd

no

String

Name of an XSD file on the classpath used to validate input documents.

options.line-mode

no

String

The line format of the input being parsed - either multi (default) or single.

9.10.1.1. feature-path

Since a single XML document may contain multiple features, the XML converter supports an XPath expression pointing to each feature element.

9.10.1.2. xpath-factory

The XML converter will attempt to use the Saxon XML factory by default, if available. Saxon is generally much faster than the default Java implementation. If desired, a different parser may be specified by setting xpath-factory to the fully-qualified class name of a javax.xml.xpath.XPathFactory implementation.

9.10.1.3. xml-namespaces

If the XML source documents contain namespaces, you may need to define them through the xml-namespaces element. See Handling Namespaces with Saxon, below.

9.10.1.4. xsd

The XML converter supports schema validation through XSD documents. To enable validation, specify an XSD file using the xsd element. This file must be available on the classpath. Input documents which don’t validate against the XSD will raise an exception.

9.10.1.5. options.line-mode

Files can be parsed in either single-line mode or in multi-line mode. In single-line mode, each line of an input file should be a valid XML document; in multi-line mode, the entire input file should be a single valid XML document. In order to support XPath expressions, each XML document is fully parsed into memory. For large documents, this may take considerable time and memory. Thus, it is usually better to use single-line mode when possible. If nothing is specified, multi-line mode is used.

9.10.2. Field Configuration

The fields element in an XML converter supports additional keys:

Key

Description

path

An XPath expression used to extract the field value.

9.10.2.1. path

path should be a XPath expression, which may be relative to the feature-path (if defined, above) or absolute to the document root. The path expression will be evaluated to a string, and be available in the transform as $0.

9.10.3. Transform Functions

The transform element supports referencing the result of the path expression through $0. Each value will be a string.

In addition to the standard Transformation Functions, the XML converter provides the following XML-specific functions:

9.10.3.1. xpath

The xpath function allows you to evaluate an XPath expression outside the context of a field. Generally, the path expression is preferred when using XML fields, but the xpath function is useful for composite converter predicates, as described below.

The xpath function takes between 2 and 4 arguments. The first argument is the XPath expression, as a string. The second argument is the XML document on which to apply the XPath expression. Generally the document being considered is available as $0. The third argument is the fully-qualified class name of a javax.xml.xpath.XPathFactory implementation. If omitted, the default Saxon factory will be used. The fourth argument is a java.util.Map containing any XML namespaces that may be required for XPath evaluation, and may be omitted if namespaces are not used. Generally, the parseMap function (see List and Map Functions) will be required to create an appropriate namespace map. See Handling Namespaces with Saxon, below.

9.10.3.2. xmlToString

This will convert an XML element to a string. It can be useful for quickly representing a complex object, for example in order to create a feature ID based on the hash of a row.

9.10.4. Handling Namespaces with Saxon

Using the default Java XPath factory, XML namespaces can generally be ignored. However, the Saxon factory requires namespaces to be declared. You can accomplish this through the xml-namespaces configuration.

Example XML:

<?xml version="1.0"?>
<foo:doc xmlns:foo="http://example.com/foo" xmlns:bar="http://example.com/bar">
    <foo:DataSource>
        <foo:name>myxml</foo:name>
    </foo:DataSource>
    <foo:Feature>
        <foo:number>123</foo:number>
        <bar:geom>
            <bar:lat>12.23</bar:lat>
            <bar:lon>44.3</bar:lon>
        </bar:geom>
        <foo:color>red</foo:color>
        <foo:physical height="5'11" weight="127.5"/>
    </foo:Feature>
</foo:doc>

Config:

{
  type          = "xml"
  id-field      = "uuid()"
  feature-path  = "foo:Feature" // optional path to feature elements
  xsd           = "example.xsd" // optional xsd file to validate input
  xpath-factory = "net.sf.saxon.xpath.XPathFactoryImpl"
  options = {
    line-mode = "multi" // or "single"
  }
  xml-namespaces = {
    foo = "http://example.com/foo"
    bar = "http://example.com/bar"
  }
  fields = [
    { name = "number", path = "foo:number",           transform = "$0::integer"       }
    { name = "color",  path = "foo:color",            transform = "trim($0)"          }
    { name = "weight", path = "foo:physical/@weight", transform = "$0::double"        }
    { name = "source", path = "/foo:doc/foo:DataSource/foo:name/text()"               }
    { name = "lat",    path = "bar:geom/bar:lat",     transform = "$0::double"        }
    { name = "lon",    path = "bar:geom/bar:lon",     transform = "$0::double"        }
    { name = "geom",                                  transform = "point($lon, $lat)" }
  ]
}

9.10.5. XML Composite Converters

Composite converters can handle processing different XML formats in a single stream. To use a composite converter, specify type = "composite-xml" in your converter definition.

Composite converters can define top-level options, fields, etc, the same as a normal XML converter. These values will be inherited by each of the child converters. If each child is unique, then it is valid to not define anything at the top level.

Composite converters must define a converters element, which is an array of nested XML converter definitions. In addition to the standard configuration, each nested converter must have a predicate element that determines which converter to use for each XML document. The value passed into the predicate will be the parsed XML document (available as $0), so generally the predicate will make use of the xpath function (above). See Predicates for more details on predicates.

9.10.6. Example Usage

Assume the following SimpleFeatureType:

geomesa.sfts.example = {
  attributes = [
    { name = "number", type = "Integer" }
    { name = "color",  type = "String"  }
    { name = "weight", type = "Double"  }
    { name = "source", type = "String"  }
    { name = "geom",   type = "Point"   }
  ]
}

And the following XML document:

<?xml version="1.0"?>
<doc>
    <DataSource>
        <name>myxml</name>
    </DataSource>
    <Feature>
        <number>123</number>
        <geom>
            <lat>12.23</lat>
            <lon>44.3</lon>
        </geom>
        <color>red</color>
        <physical height="5'11" weight="127.5"/>
    </Feature>
    <Feature>
        <number>456</number>
        <geom>
            <lat>20.3</lat>
            <lon>33.2</lon>
        </geom>
        <color>blue</color>
        <physical height="h2" weight="150"/>
    </Feature>
</doc>

You could ingest with the following converter:

geomesa.converters.myxml = {
  type          = "xml"
  id-field      = "uuid()"
  feature-path  = "Feature" // optional path to feature elements
  xsd           = "example.xsd" // optional xsd file to validate input
  xpath-factory = "net.sf.saxon.xpath.XPathFactoryImpl"
  options = {
    line-mode = "multi" // or "single"
  }
  fields = [
    { name = "number", path = "number",           transform = "$0::integer"       }
    { name = "color",  path = "color",            transform = "trim($0)"          }
    { name = "weight", path = "physical/@weight", transform = "$0::double"        }
    { name = "source", path = "/doc/DataSource/name/text()"                       }
    { name = "lat",    path = "geom/lat",         transform = "$0::double"        }
    { name = "lon",    path = "geom/lon",         transform = "$0::double"        }
    { name = "geom",                              transform = "point($lon, $lat)" }
  ]
}