Skip to content
Darin McBeath edited this page Oct 29, 2014 · 8 revisions

XSLT

The XsltProcessor class defined in spark-xml-utils provides a collection of static methods that enable relatively easy transformation of a record by applying a stylesheet. The record could be a string or the record could be an object contained in an S3 bucket.

Initialization

Since S3 is being used to store the stylesheets in a bucket, the AWS access keys must be set in the environment.

export AWS_ACCESS_KEY_ID="put-your-value-here"
export AWS_SECRET_ACCESS_KEY="put-your-value-here"

Keep in mind that the AWS access keys must have 'read' access for the AWS bucket containing any referenced stylesheets.

XSLTProcessor.init()

Transform

The result of an transform operation will be the result of applying a stylesheet against the content (serialized as a string). In the example below, the stylesheet named 'xmlMeta2json.xsl' from the S3 bucket 'spark-stylesheets' will be applied to the string "This is a bunch of xml".

XSLTProcessor.transform("spark-stylesheets", "xmlMeta2json.xsl", "<xml>This is a bunch of xml</xml>")

If there is an error encountered during the operation, the error will be logged but an exception will not be raised. Instead, a value of "" will be returned.

Clear

Since for performance reasons, stylesheets will be cached upon the first retrieval from S3, the clear operation provides an easy way to clear the cache. This is useful if one of the sylesheets stored in S3 might have been changed.

XSLTProcessor.clear()

Other Operations

There is another method for transform that will retrieve an object from an S3 bucket before applying the xslt transform. Since it is likely rare that this will be used, it is not covered. Feel free to look at the code for the usage.

Example Usage in Java Spark Application

A more complete XSLT example is provided in the code samples. The XSLT code sample will consist of an XSLT driver class (that will be executed on the master) and an XSLT worker class (that will be executed on the worker). By using these classes as a template, it should be straightforward to apply modifications to meet your needs.

Example Usage in Spark-Shell

Prior to starting the shell, we will want to export the AWS access keys (on the master node). This is needed because the stylesheets will be stored in an S3 bucket.

export AWS_ACCESS_KEY_ID="put-your-value-here"
export AWS_SECRET_ACCESS_KEY="put-your-value-here"

Next copy the spark-xml-utils.jar to the master node. We are assuming you are in the installation directory for Spark on the master and that you have copied the .jar file to the 'lib' folder under this location. Once this is done, execute the following command.

cd spark-install-dir
./bin/spark-shell --jars lib/spark-xml-utils.jar

Let's assume we have created a PairRDD by loading in a hadoop sequence file by executing a command like the following.

scala> val xmlKeyPair = sc.sequenceFile[String,String]("s3n://els-ats/darin/sd-xml/part*").cache()

We now need to initialize the Spark partitions associated with the xmlKeyPairRDD. This is done by executing the following commands.

scala> import com.elsevier.xml.XSLTProcessor
scala> val awsid = System.getenv("AWS_ACCESS_KEY_ID")
scala> val awskey = System.getenv("AWS_SECRET_ACCESS_KEY")
scala> xmlKeyPair.foreachPartition(i => {System.setProperty("AWS_ACCESS_KEY_ID",awsid); System.setProperty("AWS_SECRET_ACCESS_KEY",awskey); XSLTProcessor.init(); })

Now, we are ready to use transform. To create a new PairRDD from xmlKeyPair where we want to apply the stylesheet named 'xmlMeta2json.xsl' from the S3 bucket 'spark-stylesheets' to every value in the PairRDD we could use the following command.

scala> val transformedXmlKeyPair = xmlKeyPair.mapValues(v => XSLTProcessor.transform("spark-stylesheets","xmlMeta2json.xsl",v))

Keep in mind that the above statements (since they are transformations) will not execute until an action occurs. For example, simply adding a count will force an action.

scala> transformedXmlKeyPair.count 
Clone this wiki locally