JS Ext

Wednesday, June 19, 2013

XML Parsers: Part 1

One of the junior developers on my team was trying to find an easier way of parsing xml.  His complaints were based on some of the code that he saw that was use DOM to traverse the xml structure.  The xml structure of that particular file was a little complicated, so the DOM traversal got complicated.  He brought up using some XPath-based parsers.  This is when I chimed in.  I gave him an overview of the 3 general types of xml parsers and gave him some of the history within our team.

Document Object Model (DOM)

The first parser I will talk about is the DOM-based parser.  This parser is easy to use (especially for the client-side web developers).  First, the entire xml file is parsed.  An intermediate representation is created.  This representation contains Nodes, Attributes, Comments and other objects that represent xml components.  Once the entire xml file is in the intermediate representation, you can start making calls like getElementById() and getElementsByTagName().  You can also call getChildNodes() and getParentNode().  You traverse the object tree, node by node.  I have found that this method of xml parsing is easy to learn but can be tedious.  It is also the middle of the road when it comes to performance.  This is something I will dive deeper into later.

XPath

The next type of parser is the XPath-based parser.  It is usually not a parser by itself.  Usually, XPath refers more to the traversal of the DOM object tree as opposed to actually parsing the xml.  XPath is really user friendly.  Writing an XPath based parser is really easy.  It is also very supportable, in the sense that the parsing code tends to be very readable.  The main problem with XPath is performance.  XPath is the slowest xml parsing method available.

Stream Parser

The final parser I'm going to talk about is the stream-based parser.  The most famous of these is the blazing fast Expat c library, but the Java SAX parsers are another example.  Stream parsers are the fastest parsers.  As the name implies, the parser executes WHILE the xml file is being parsed.  The other two parsers will read the entire xml file into memory.  They will store an object tree in memory for the entire file.  For junior developers, stream parsers are very difficult to understand.  When it comes to parsing complicated xml, the stream-based parsers become almost unreadable.  They are really fast, though.  In fact, the other two parsers are usually implemented under the hood using a stream parser.


Nine out of ten times, I will use a stream parser.  It is rare that I find xml formats that are really complicated.  The only time I don't use a stream parser is when performance is not a consideration.  There is an interesting side effect of constantly using stream parsers.  When I am designing the structure for a new xml format, I tend to make sure the structure supports stream parsers.  These formats tend to be easier to parse in all the parser types above.  They are easier to read and easier to understand.  In my opinion, using stream parsers makes you a better xml designer.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.