Skip to content
Darin McBeath edited this page Feb 10, 2015 · 8 revisions

XSLT

The XSLTProcessor class defined in spark-xml-utils provides a collection of static methods that enable relatively easy transformation of a record by applying a stylesheet. The record (and stylesheet)are assumed to be a string of xml.

Initialization

The initialization is simplying providing a name for a stylesheet and the actual stylesheet. For our purposes, we will assume our stylesheet is the simple one listed below and will be named 'srctitle'.

private static final String srctitleStylesheet = 
"<?xml version='1.0' encoding='UTF-8'?>" +
"<xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='2.0' xmlns:xocs='http://www.elsevier.com/xml/xocs/dtd'>" +
	"<xsl:output method='text' encoding='utf-8' indent='yes'/>" +
	"<xsl:template match='/xocs:doc/xocs:meta'>" +
		"<xsl:text>{ </xsl:text>" +
		"<xsl:text>'srctitle':'</xsl:text>" +
		"<xsl:value-of select='./xocs:srctitle/text()'/>" +
		"<xsl:text>'</xsl:text>" +
		"<xsl:text> }</xsl:text>" +
	"</xsl:template>"" +
	"<xsl:template match='text()'/>" +
"</xsl:stylesheet>";

XSLTProcessor.init("srctitle",srctitleStylesheet);

Transform

The result of an transform operation will be the result of applying a stylesheet (previously set in the initialization) against the content (a string of xml). In the example below, the stylesheet named 'srctitle' from the S3 bucket 'spark-stylesheets' will be applied to the string "This is a bunch of xml".

XSLTProcessor.transform("srctitle", "<xocs:doc xmlns:xocs='http://www.elsevier.com/xml/xocs/dtd'><xocs:meta><xocs:srctitle>Learnings in Spark</xocs:srctitle></xocs:meta></xocs:doc>")

If everything works as expected, the response should be:

{ 'srctitle':'Learnings in Spark' }

If there is an error encountered during the operation, the error will be logged but an exception will not be raised. Instead, a value of "" will be returned.

Clear

Since for performance reasons, stylesheets will be cached (as part of initialization), the clear operation provides an easy way to clear the cache. This is useful if one of the sylesheets stored in S3 might have been changed. Typically, this would be followed by an XSLTProcessor.init() to re-initialize any new stylesheets.

XSLTProcessor.clear()

Example Usage in Java Spark Application

A more complete XSLT example is provided in the code samples. The XSLT code sample will consist of an XSLT driver class (that will be executed on the master) and an XSLT worker class (that will be executed on the worker). By using these classes as a template, it should be straightforward to apply modifications to meet your needs.

Example Usage in Spark-Shell

Copy the spark-xml-utils.jar to the master node. We are assuming you are in the installation directory for Spark on the master and that you have copied the .jar file to the 'lib' folder under this location. Once this is done, execute the following command.

cd spark-install-dir
./bin/spark-shell --jars lib/spark-xml-utils.jar

Let's assume we have created a PairRDD by loading in a hadoop sequence file by executing a command like the following.

scala> val xmlKeyPair = sc.sequenceFile[String,String]("s3n://els-ats/darin/sd-xml/part*").cache()

We now need to initialize the Spark partitions associated with the xmlKeyPairRDD. This is done by executing the following commands.

scala> import com.elsevier.xml.XSLTProcessor

scala> xmlKeyPair.foreachPartition(i => {XSLTProcessor.init("srctitle")})

Now, we are ready to use transform. To create a new PairRDD from xmlKeyPair where we want to apply the stylesheet named 'srctitle' (initialized above)to every value in the PairRDD we could use the following command.

scala> val transformedXmlKeyPair = xmlKeyPair.mapValues(v => XSLTProcessor.transform("srctitle",v))

Keep in mind that the above statements (since they are transformations) will not execute until an action occurs. For example, simply adding a count will force an action.

scala> transformedXmlKeyPair.count 
Clone this wiki locally