XSLT Stylesheet Performance on Big Ass Documents
Like it or not, when it comes type to start transforming XML datas folks turn to stylesheets. Sure, it'd be nice if we could write XmlReader/XmlWriter transforms or if one of these Streaming XML Transformation languages would really take off. But for now, you know it, and I know it - folks love their XSLT.
Anyway we had a large XML document that was on the order of 250megs, sometimes larger. It was running in a batch process using MSXSL.exe, a command-line tool that invokes the "newest" version of MSXML that's on your system, starting with MSXML4, the moving backwards to MSXML3 then 2.6. It was running out of memory sometimes using as much as a gig. It was also taking 15 minutes and more. It was written three years ago and was written in a very procedural way. XSLT is meant to be written in a more declarative way, with templates that match on the input elements as they find them.
- Original XSLT with MSXSL using MSXML4 – crashes memory exception
- Original XSLT with NXSLT 1.6 (.NET 1.1) – Private bytes level out around 1G
Source document load time: 16059.870 milliseconds
Stylesheet load/compile time: 204.672 milliseconds
Stylesheet execution time: 683552.000 milliseconds
This stylesheet wasn't very opmtized and was kinda:
<xsl:with-param name="Row" select="."/>
...which is sub-optimal. Not only that, but the variable Something is holding the results of the template rather than allowing it to "flow" out as data is transformed. This transform actually had two input files, the main one, and another small one that contained configuration and some other details that was selected into variable.
<xsl:variable name="Foo" select="document('foo.xml')"/>
The stylesheet was rewritten to be more template-focused ala:
<xsl:template match="Row" >
<xsl:apply-templates select="$x[@SomeID = $someID]"/>
After this change/re-write, the stylesheet was sped up by about 66% and didn't run out of memory. However, it was still using MSXSL and we wanted to try a few other processors. I did try Saxon and a few Java/C++ parsers but they ran out of memory, so don't pick on me for not including their numbers, as this post is primarily a test of the various Microsoft XSL/T options. All these timings are generated with the -t option that all these utilities support.
- Improved XSLT with MSXSL using MSXML4 – private bytes level out around 300M
Source document load time: 41920 milliseconds
Stylesheet document load time: 18.37 milliseconds
Stylesheet compile time: 3.692 milliseconds
Stylesheet execution time: 174327 milliseconds
- Improved XSLT with NXSLT 1.6 (.NET 1.1) – private bytes level out around 550M
Source document load time: 17893.370 milliseconds
Stylesheet load/compile time: 462.974 milliseconds
Stylesheet execution time: 629697.700 milliseconds
Interestly, but not unexpectedly, the .NET 1.1 XSLT transformations used by NXSLT are slower than the original unmanaged code in MSXML. A lot of XSLT wonks have apparently said, after the release of .NET 1.1, that when you have to do some hard-core (large) XSLT you should still use MSXSL.
We had two questions at this point - what if we used MSXML6? what if we used .NET 2.0 (whose XSLT engine was greatly improved)
However, MSXSL.exe hasn't been updated to support MSXML6 yet (the site says coming soon), and while I could go to a VBScript or whatever, I figured why not just add the support to the source of MSXSL (which is available here). I couldn't find the updated SDK header files for MSXML.H so I just hacked it together from the registry. The general gist is at the bottom of this post.
Anyway, I made a version of MSXSL that tries for MSXML6, and falls back to 4, etc. Then I got Oleg's NXSLT2 friendly command-line 2.0 stuff.
You may ask why I'm using this command-line stuff. Well, Oleg has kindly seen fit to maintain "command-line compatibility" with MSXSL.exe which makes swapping out command-line processors within our batch process very easy.
- Improved XSLT with NXSLT2 (.NET 2.0) - private bytes level out around 500M
Stylesheet load/compile time: 4596.000 milliseconds
Transformation time: 53248.000 milliseconds
Total execution time: 59064.000 milliseconds
- Improved XSLT with (custom) MSXSL using MSXML6 - private bytes level out around 300M
Source document load time: 33677 milliseconds
Stylesheet document load time: 4.685 milliseconds
Stylesheet compile time: 3.774 milliseconds
Stylesheet execution time: 200952 milliseconds
Nutshell: .NET 2.0 was 10x faster than .NET 1.1. MSXML6 was 15% slower than MSXML4. This of course, was with one specific funky stylesheet and one rather big ass file. Either way, we are sticking with the MSXML4 stuff for now, but looking forward to .NET 2.0's support for this particular style (pun intended) of madness.
Updating MSXSL to choose MSXML6: I cracked open the source for MSXSL. I couldn't find the new MSXML6.H so I added this to msxmlinf.hxx:
typedef class XSLTemplate60 XSLTemplate60;
typedef class DOMDocument60 DOMDocument60;
typedef class FreeThreadedDOMDocument60 FreeThreadedDOMDocument60;
Then I updated the static array and factory in msxmlinf.cxx to check for the version specific ProgID:
const MSXMLInfo::StaticInfo MSXMLInfo::s_staticInfo60 =
...along with a few other things. Email me if you want the source, I don't think I'm allowed to redist this. Anyway, when I ran it the first time I got a "Access Denied 0x80004005" and stared at it for a while. Andy Phenix said, "Didn't they tighten security and break some stuff in MSXML6?" This involved using IXMLDomDocument2 and explicitly allowing the document() function to load our 'foo.xml':
FreakingTrue.vt = VT_BOOL;
FreakingTrue.boolVal = VARIANT_TRUE;
Once we turned on the document() feature, everything worked great. However, I wasn't sure if MSXML4 or MSXML6 was doing the work. (I did filemon.exe and regmon.exe as well as procexp.exe and it WAS in fact loading msxml6.dll) I noticed some cleverness, again from Oleg that allows the XSLT stylesheet to actually detect what vendor and (if MSFT) version of the XSLT engine was being used. I'd reprint it, but you should go visit his site anyway.
Thanks to Krishnan and Andy for their hard work on the new stylesheet and performance testing.