-
Notifications
You must be signed in to change notification settings - Fork 11
xslt
The XSLTProcessor class defined in spark-xml-utils provides a collection of static methods that enable relatively easy transformation of a record by applying a stylesheet. The record (and stylesheet)are assumed to be a string of xml.
The initialization is simplying providing a name for a stylesheet and the actual stylesheet. For our purposes, we will assume our stylesheet is the simple one listed below and will be named 'srctitle'.
private static final String srctitleStylesheet =
"<?xml version='1.0' encoding='UTF-8'?>" +
"<xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='2.0' xmlns:xocs='http://www.elsevier.com/xml/xocs/dtd'>" +
"<xsl:output method='text' encoding='utf-8' indent='yes'/>" +
"<xsl:template match='/xocs:doc/xocs:meta'>" +
"<xsl:text>{ </xsl:text>" +
"<xsl:text>'srctitle':'</xsl:text>" +
"<xsl:value-of select='./xocs:srctitle/text()'/>" +
"<xsl:text>'</xsl:text>" +
"<xsl:text> }</xsl:text>" +
"</xsl:template>"" +
"<xsl:template match='text()'/>" +
"</xsl:stylesheet>";
XSLTProcessor.init("srctitle",srctitleStylesheet);
The result of an transform operation will be the result of applying a stylesheet (previously set in the initialization) against the content (a string of xml). In the example below, the stylesheet named 'srctitle' from the S3 bucket 'spark-stylesheets' will be applied to the string "This is a bunch of xml".
XSLTProcessor.transform("srctitle", "<xocs:doc xmlns:xocs='http://www.elsevier.com/xml/xocs/dtd'><xocs:meta><xocs:srctitle>Learnings in Spark</xocs:srctitle></xocs:meta></xocs:doc>")
If everything works as expected, the response should be:
{ 'srctitle':'Learnings in Spark' }
If there is an error encountered during the operation, the error will be logged but an exception will not be raised. Instead, a value of "" will be returned.
Since for performance reasons, stylesheets will be cached (as part of initialization), the clear operation provides an easy way to clear the cache. This is useful if one of the sylesheets stored in S3 might have been changed. Typically, this would be followed by an XSLTProcessor.init() to re-initialize any new stylesheets.
XSLTProcessor.clear()
A more complete XSLT example is provided in the code samples. The XSLT code sample will consist of an XSLT driver class (that will be executed on the master) and an XSLT worker class (that will be executed on the worker). By using these classes as a template, it should be straightforward to apply modifications to meet your needs.
Copy the spark-xml-utils.jar to the master node. We are assuming you are in the installation directory for Spark on the master and that you have copied the .jar file to the 'lib' folder under this location. Once this is done, execute the following command.
cd spark-install-dir
./bin/spark-shell --jars lib/spark-xml-utils.jar
Let's assume we have created a PairRDD by loading in a hadoop sequence file by executing a command like the following.
scala> val xmlKeyPair = sc.sequenceFile[String,String]("s3n://els-ats/darin/sd-xml/part*").cache()
We now need to initialize the Spark partitions associated with the xmlKeyPairRDD. This is done by executing the following commands.
scala> import com.elsevier.xml.XSLTProcessor
scala> xmlKeyPair.foreachPartition(i => {XSLTProcessor.init("srctitle")})
Now, we are ready to use transform. To create a new PairRDD from xmlKeyPair where we want to apply the stylesheet named 'srctitle' (initialized above)to every value in the PairRDD we could use the following command.
scala> val transformedXmlKeyPair = xmlKeyPair.mapValues(v => XSLTProcessor.transform("srctitle",v))
Keep in mind that the above statements (since they are transformations) will not execute until an action occurs. For example, simply adding a count will force an action.
scala> transformedXmlKeyPair.count