16.8. Parsing XML

The XML converter defines each field using XPath expressions. For XML documents with multiple features, the feature-path element can be used to select feature elements. In this case, the attribute paths will be relevant to the feature element. The optional xsd element can be used to validate input files against an XML schema.

By default, the XML converter will treat each line of input as a single XML document. The line-mode option can be used to parse the entire input as a single document instead of line-by-line. Note that multi-line parsing will read the entire input into memory, so should not be used with large files.

The XML converter will attempt to use the Saxon XPath factory if it is available. In GeoMesa tools, a script is provided to download saxon - bin/install-saxon.sh. To specify an alternate XPath factory, use the xpath-factory option. If the factory can not be loaded, the default Java factory will be used - note that this can be significantly slower.

Example XML:

<?xml version="1.0"?>
<doc>
    <DataSource>
        <name>myxml</name>
    </DataSource>
    <Feature>
        <number>123</number>
        <geom>
            <lat>12.23</lat>
            <lon>44.3</lon>
        </geom>
        <color>red</color>
        <physical height="5'11" weight="127.5"/>
    </Feature>
    <Feature>
        <number>456</number>
        <geom>
            <lat>20.3</lat>
            <lon>33.2</lon>
        </geom>
        <color>blue</color>
        <physical height="h2" weight="150"/>
    </Feature>
</doc>

Config:

{
  type          = "xml"
  id-field      = "uuid()"
  feature-path  = "Feature" // optional path to feature elements
  xsd           = "example.xsd" // optional xsd file to validate input
  xpath-factory = "net.sf.saxon.xpath.XPathFactoryImpl"
  options = {
    line-mode = "multi" // or "single"
  }
  fields = [
    { name = "number", path = "number",           transform = "$0::integer"       }
    { name = "color",  path = "color",            transform = "trim($0)"          }
    { name = "weight", path = "physical/@weight", transform = "$0::double"        }
    { name = "source", path = "/doc/DataSource/name/text()"                       }
    { name = "lat",    path = "geom/lat",         transform = "$0::double"        }
    { name = "lon",    path = "geom/lon",         transform = "$0::double"        }
    { name = "geom",                              transform = "point($lon, $lat)" }
  ]
}

16.8.1. Handling Namespaces with Saxon

Using the default XPath factory, XML namespaces can generally be ignored. However, the Saxon factory requires namespaces to be declared. You can accomplish this through the xml-namespaces configuration.

Example XML:

<?xml version="1.0"?>
<foo:doc xmlns:foo="http://example.com/foo" xmlns:bar="http://example.com/bar">
    <foo:DataSource>
        <foo:name>myxml</foo:name>
    </foo:DataSource>
    <foo:Feature>
        <foo:number>123</foo:number>
        <bar:geom>
            <bar:lat>12.23</bar:lat>
            <bar:lon>44.3</bar:lon>
        </bar:geom>
        <foo:color>red</foo:color>
        <foo:physical height="5'11" weight="127.5"/>
    </foo:Feature>
</foo:doc>

Config:

{
  type          = "xml"
  id-field      = "uuid()"
  feature-path  = "foo:Feature" // optional path to feature elements
  xsd           = "example.xsd" // optional xsd file to validate input
  xpath-factory = "net.sf.saxon.xpath.XPathFactoryImpl"
  options = {
    line-mode = "multi" // or "single"
  }
  xml-namespaces = {
    foo = "http://example.com/foo"
    bar = "http://example.com/bar"
  }
  fields = [
    { name = "number", path = "foo:number",           transform = "$0::integer"       }
    { name = "color",  path = "foo:color",            transform = "trim($0)"          }
    { name = "weight", path = "foo:physical/@weight", transform = "$0::double"        }
    { name = "source", path = "/foo:doc/foo:DataSource/foo:name/text()"               }
    { name = "lat",    path = "bar:geom/bar:lat",     transform = "$0::double"        }
    { name = "lon",    path = "bar:geom/bar:lon",     transform = "$0::double"        }
    { name = "geom",                                  transform = "point($lon, $lat)" }
  ]
}