Skip to content
Darin McBeath edited this page Feb 10, 2015 · 8 revisions

XQuery

The XQueryProcessor class defined in spark-xml-utils provides a collection of static methods that enable relatively easy processing of xquery expressions against a record. The record is assumed to be a string of xml.

Initialization

If any of the xquery expressions contain namespace prefixes, the XQueryProcessor will need to be initialized. The initialization is simply providing a HashMap of the prefix to namespace uri mappings. Below is a simple example.

	HashMap<String,String> pfxUriMap = new HashMap<String,String>();
	pfxUriMap.put("xocs"", "http://www.elsevier.com/xml/xocs/dtd");
	pfxUriMap.put(ja", "http://www.elsevier.com/xml/ja/dtd");
	pfxUriMap.put("si", "http://www.elsevier.com/xml/si/dtd");
	pfxUriMap.put("ehs", "http://www.elsevier.com/xml/ehs-book/dtd");
	pfxUriMap.put("bk", "http://www.elsevier.com/xml/bk/dtd");
	pfxUriMap.put("ce", "http://www.elsevier.com/xml/common/dtd");
	pfxUriMap.put("sb", "http://www.elsevier.com/xml/common/struct-bib/dtd");
	pfxUriMap.put("tb", "http://www.elsevier.com/xml/common/table/dtd");
	pfxUriMap.put("xlink", "http://www.w3.org/1999/xlink");
	pfxUriMap.put("mml", "http://www.w3.org/1998/Math/MathML");
	pfxUriMap.put("cals", "http://www.elsevier.com/xml/common/cals/dtd");
	XQueryProcessor.init(pfxUriMap);	

Evaluation

The result of an evaluation operation will be the result of the xquery expression (serialized as a string). The evaluation operation applies an xquery expression against a string. In the example below, the string being evaluated is "john" and the xquery expression applied is "for $i in /name[.='john'] return $i". In this example, the result of the evaluateString will be a "john".

XQueryProcessor.evaluateString("<name>john</name>", "for $i in /name[.='john'] return $i")

If there is an error encountered during the operation, the error will be logged but an exception will not be raised. Instead, a value of "<error/>" will be returned.

Clear

This is simply a helper method that would allow you to clear any namespace prefix/uri mappings that would have been cached. Typically, this would be followed by an XQueryProcessor.init() to re-initialize any mappings.

XQueryProcessor.clear()

Example Usage in Java Spark Application

A more complete XQuery example is provided in the code samples. The XQuery code sample will consist of an XQuery driver class (that will be executed on the master) and an XQuery worker class (that will be executed on the worker). By using these classes as a template, it should be straightforward to apply modifications to meet your needs.

Example Usage in Spark-Shell

Copy the spark-xml-utils.jar to the master node. We are assuming you are in the installation directory for Spark on the master and that you have copied the .jar file to the 'lib' folder under this location. Once this is done, execute the following command.

cd spark-install-dir
./bin/spark-shell --jars lib/spark-xml-utils.jar

Let's assume we have created a PairRDD by loading in a hadoop sequence file by executing a command like the following.

scala> val xmlKeyPair = sc.sequenceFile[String,String]("s3n://els-ats/darin/sd-xml/part*").cache()

Since our xquery expression will contain namespaces, we now need to initialize the Spark partitions associated with the xmlKeyPairRDD. This is done by executing the following commands. In the example below, the HashMap pfxuriMap contains the mappings of namespace prefixes to namespace uris.

scala> import com.elsevier.xml.XQueryProcessor
scala> import java.util.HashMap
scala> var pfxUriMap : HashMap[String,String] = new HashMap[String,String]()
scala> pfxUriMap.put("xocs", "http://www.elsevier.com/xml/xocs/dtd")
scala> XQueryProcessor.init(pfxUriMap)
scala> xmlKeyPair.foreachPartition(i => {XQueryProcessor.init(pfxUriMap)})

To create a new PairRDD from xmlKeyPair where we only get the xocs:meta section for these documents, we could use evaluateString and execute the following command.

scala> val metaXmlKeyPair = xmlKeyPair.mapValues(v => XQueryProcessor.evaluateString(v,"for $i in /xocs:doc/xocs:meta return $i"))

Keep in mind that the above statement (since it is a transformation) will not execute until an action occurs. For example, simply adding a count will force an action.

scala> metaXmlKeyPair.count
Clone this wiki locally