Postprocessing AutoClosed SGML Tags with the SGMLReader
Chris Lovett's SGMLReader is an interesting and complex piece of work. It's more complex than my brain can hold, which is good, since he wrote it and not I. It's able to parse SGML documents like HTML. However, it derives from XmlReader, so it tries (and succeeds) to look like an XmlReader. As such, it Auto-Closes Tags. Remember that SGML doesn't have to have closing tags. Specifically, it doesn't need closing tags on primitive/simple types.
Sometimes I need to parse an OFX 1.x document, a financial format that is SGML like this:
<OFX>
<SIGNONMSGSRQV1>
<SONRQ>
<DTCLIENT>20060128101000
<USERID>654321
<USERPASS>123456
<LANGUAGE>ENG
<FI>
<ORG>Corillian
<FID>1001
</FI>
<APPID>MyApp
<APPVER>0500
</SONRQ>
...etc...
Notice that ORG and DTCLIENT and all the other simple types have no end tags, but complex types like FI and SONRQ do have end tags. The SgmlReader class attempts to automatically insert end tags (to close the element) as I use the XmlReader.Read() method to move through the document. However, he can't figure out where the right place for an end tag is until he sees an end elements go by. Then he says, oh, crap! There's </FI>! I need to empty my stack of start elements in reverse order. This is lovely for him, but gives me a document that looks (in memory) like this:
<OFX>
<SIGNONMSGSRQV1>
<SONRQ>
<DTCLIENT>20060128101000
<USERID>654321
<USERPASS>123456
<LANGUAGE>ENG
<FI>
<ORG>Corillian
<FID>1001</FID>
</ORG>
</FI>
</LANGUAGE>
</USERPASS>
</USERID>
</DTCLIENT>
...etc...
...which totally isn't the structure I'm looking for. I could write my own SgmlReader that knows more about OFX, but really, who has the time. So, my buddy Paul Gomes and I did this.
NOTE: There's one special tag in OFX called MSGBODY that is a simple type but always has an end tag, so we special cased that one. Notice also that we did all this WITHOUT changing the SgmlReader. It's just passed into the method as "reader."
protected internal static void AutoCloseElementsInternal(SgmlReader reader, XmlWriter writer)
{
object msgBody = reader.NameTable.Add("MSGBODY");
object previousElement = null;
Stack elementsWeAlreadyEnded = new Stack();
while (reader.Read())
{
switch ( reader.NodeType )
{
case XmlNodeType.Element:
previousElement = reader.LocalName;
writer.WriteStartElement(reader.LocalName);
break;
case XmlNodeType.Text:
if(Strings.IsNullOrEmpty(reader.Value) == false)
{
writer.WriteString( reader.Value.Trim());
if (previousElement != null && !previousElement.Equals(msgBody))
{
writer.WriteEndElement();
elementsWeAlreadyEnded.Push(previousElement);
}
}
else Debug.Assert(true, "big problems?");
break;
case XmlNodeType.EndElement:
if(elementsWeAlreadyEnded.Count > 0
&& Object.ReferenceEquals(elementsWeAlreadyEnded.Peek(),
reader.LocalName))
{
elementsWeAlreadyEnded.Pop();
}
else
{
writer.WriteEndElement();
}
break;
default:
writer.WriteNode(reader,false);
break;
}
}
}
We store the name of the most recently written start tag. If we write out a node of type XmlNodeType.Text, we push the start tag on a stack and immediately write out our own EndElement. Then, when we notice the SgmlReader starting to auto-close and send us synthetic EndElements, we ignore them if they are already at the top of our own stack. Otherwise, we let SgmlReader close non-synthetic EndElements.
The resulting OFX document now looks like this:
<OFX>
<SIGNONMSGSRQV1>
<SONRQ>
<DTCLIENT>20060128101000</DTCLIENT>
<USERID>411300</USERID>
<USERPASS>123456</USERPASS>
<LANGUAGE>ENG</LANGUAGE>
<FI>
<ORG>Corillian</ORG>
<FID>1001</FID>
</FI>
<APPID>MyApp</APPID>
<APPVER>0500</APPVER>
</SONRQ>
...etc...
...and we can deal with it just like any other Xml Fragment, in our case, just allowing it to continue along its way in the XmlReader/XmlWriter Pipeline.
Thanks to Craig Andera for the reminder about Object.ReferenceEquals(), it's nicer than elementsWeAlreadyEnded.Peek() == (object)reader.LocalName.
About Scott
Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.



About Newsletter