Processing Large XML Documents Using SAX
2.0
One of the more common questions I've been hearing
lately in conferences and reading on newsgroups is, "How do I process my (insert
very large number here) megabyte XML document using DOM?" Unfortunately, in most
cases the answer is, "You don't." The Document Object Model (and any other XML
parsing scheme that involves loading the entire document into memory) is
seriously constrained when it comes to processing large documents. Even in cases
where the entire document could be loaded into memory, it often isn't practical
or efficient to do so.
Granted, some applications (for instance performing
an XSLT transformation) could require random access to every element in a
document at any time. But what about simpler applications that just want to
perform small, localized transformations? Or a database import application that
just wants to store a certain set of elements in database tables as they're
encountered? Or a statistics-gathering application that just wants to count
features in a document without performing any real processing at all? Is there
an easier solution out there for this class of applications?
The Simple API for XML
The answer is a
resounding, "Yes." The Simple API for XML (better known as SAX) is a
lightweight, event-driven XML parsing API that was developed by the members of
the xml-dev mailing list (which is maintained by the Organization for the
Advancement of Structured Information Standards.) It was developed as an
alternative to the W3C DOM for applications such as those mentioned above.
Unlike DOM, SAX does not completely parse an entire document and create an
in-memory document tree. Instead, it progressively feeds event notifications to
the client application as XML document structures are recognized.
Don't miss the XTech2001 Conference this July at O'Reilly's Open
Source Convention. XTech2001 is the perfect setting for a serious technical
discussion between XML developers. Get down to work in the demonstration-based
tutorials or explore XML innovations in the sessions.
The reference implementation of SAX is written in the Java
programming language and resides on David Megginson's Web site. Although it is
principally a Java technology, versions of SAX have been developed in other
programming languages, such as Perl, Python, and C++. The latest version of
Microsoft's XML parser (MSXML 3.0) includes a full SAX 2.0 implementation that
can be used by any language that supports COM objects.
Note: SAX 1.0 vs. SAX 2.0
The SAX API
changed drastically between version 1.0 and 2.0, and as a result a number of
interfaces and classes that were part of 1.0 are deprecated in 2.0. Also, there
are several classes that exist only to facilitate porting 1.0 applications and
parser implementations to SAX 2.0.
This article only addresses new development for the SAX 2.0
library, and therefore doesn't mention a number of classes and interfaces that
are technically part of SAX 2.0, but should not need to be used by developers of
new applications.
A Quick Overview of
SAX
The reference implementation of SAX isn't actually a fully
functional XML parser in its own right. It is a set of Java interfaces and
helper classes that must be implemented by any parser that wants to be SAX 2.0
compliant. The following tables list the interfaces, classes, and exceptions
that comprise a basic SAX implementation:
Package org.xml.sax |
|
Interfaces |
|
Attributes
|
|
ContentHandler
|
|
DTDHandler
|
|
EntityResolver
|
|
ErrorHandler
|
|
Locator
|
|
XMLFilter
|
|
XMLReader
|
| |
|
Classes |
|
InputSource
|
| |
|
Exceptions |
|
SAXException
|
|
SAXNotRecognizedException
|
|
SAXNotSupportedException
|
|
SAXParseException
| |
Package org.xml.sax.helpers |
|
Classes |
|
AttributesImpl
|
|
DefaultHandler
|
|
LocatorImpl
|
|
ParserAdapter
|
|
ParserFactory
|
|
XMLFilterImpl
|
|
XMLReaderFactory
| |
Of these, the most important and immediately useful are the various
Handler interfaces. The ContentHandler interface specifies all of the callback
methods that will be used to deliver information about:
Start and end of a document
Start and end of an XML
element
Element namespaces and attributes
Namespace prefix
mapping scopes
Processing instructions
Character data and
ignorable whitespace
The ErrorHandler interface is called to notify the
client application about errors in the underlying XML document syntax or
validity. If an application does not register an error handler class, unexpected
parser behavior may result.
The DTDHandler interface is used to receive notifications about
notation and unparsed entity declarations in the document's DTD. It does not
provide information about other DTD constructs such as parameter entities,
element declarations, etc.
Scott Means has also written two other XML articles for
oreilly.com, What's New in the DOM Level 2 Core? and Converting Unstructured
Documents to XML. These hands-on articles cover the official W3C recommendation
for DOM Level 2 and the details of XML conversion using real-life examples.
For an application class to actually parse a document and
receive these various notification messages, it must:
Implement one or more of the Handler interfaces (ContentHandler,
ErrorHandler, or DTDHandler).
Create an XMLReader class instance,
possibly by using the XMLReaderFactory helper class.
Register an
instance of the SAX-aware class with the new XMLReader instance using the
various set*Handler() methods.
Call one of the parse() methods of the
XMLReader instance for each document to be processed.
Now, to illustrate
these principles, let's build a real application that would benefit from the
unique features of SAX. Let's go to the theater.
The Play's the Thing
When I went out
shopping on the Internet for sample documents for an application that would show
off the features of SAX, I found a collection of Shakespeare's plays that had
been encoded as XML documents by Jon Bosak. The link to the ZIP file containing
the plays, as well as links to the full source code for this sample, can be
found at the end of the article.
The plays are all encoded using the following simple
DTD:
<!-- DTD for Shakespeare J.
Bosak 1994.03.01, 1997.01.02 -->
<!-- Revised for
case sensitivity 1997.09.10 -->
<!-- Revised for XML 1.0 conformity
1998.01.27
(thanks to Eve Maler)
-->
<!ENTITY amp "&">
<!ELEMENT
PLAY (TITLE, FM, PERSONAE, SCNDESCR,
PLAYSUBT, INDUCT?, PROLOGUE?, ACT+,
EPILOGUE?)>
<!ELEMENT TITLE
(#PCDATA)>
<!ELEMENT FM
(P+)>
<!ELEMENT P
(#PCDATA)>
<!ELEMENT PERSONAE (TITLE, (PERSONA |
PGROUP)+)>
<!ELEMENT PGROUP (PERSONA+,
GRPDESCR)>
<!ELEMENT PERSONA
(#PCDATA)>
<!ELEMENT GRPDESCR (#PCDATA)>
<!ELEMENT
SCNDESCR (#PCDATA)>
<!ELEMENT PLAYSUBT
(#PCDATA)>
<!ELEMENT INDUCT (TITLE, SUBTITLE*,
(SCENE+|(SPEECH|STAGEDIR|SUBHEAD)+))>
<!ELEMENT
ACT (TITLE, SUBTITLE*, PROLOGUE?,
SCENE+, EPILOGUE?)>
<!ELEMENT
SCENE (TITLE, SUBTITLE*,
(SPEECH |
STAGEDIR | SUBHEAD)+)>
<!ELEMENT PROLOGUE (TITLE, SUBTITLE*,
(STAGEDIR | SPEECH)+)>
<!ELEMENT EPILOGUE
(TITLE, SUBTITLE*,
(STAGEDIR | SPEECH)+)>
<!ELEMENT SPEECH (SPEAKER+,
(LINE | STAGEDIR | SUBHEAD)+)>
<!ELEMENT SPEAKER
(#PCDATA)>
<!ELEMENT LINE (#PCDATA |
STAGEDIR)*>
<!ELEMENT STAGEDIR (#PCDATA)>
<!ELEMENT
SUBTITLE (#PCDATA)>
<!ELEMENT SUBHEAD
(#PCDATA)>
[Editor's note: Throughout this article, whenever a line of
code or data has been broken into two lines, the second line is
indented.]
If you're like me (and if you are, you have my sympathies)
you've probably asked yourself questions like, "Exactly how many lines did the
Earl of Northumberland speak in Act XI of Henry VI?" (28 lines) or, "How many
speeches were made by Julius Caesar in the eponymous play?" (39 speeches). Ok, I
have to admit that until I started this article I really hadn't given the matter
much thought. But since collecting statistical information from large documents
is a great application of SAX, now the world will know how many lines are in the
longest speech of The Merry Wives of Windsor (37 lines).
For each speaking part in the play and for the play as a whole,
I want my application to track:
The total number of lines spoken
The total number of
speeches
The average number of lines per speech
The minimum and
maximum number of lines per speech
The number of lines spoken in each
act of the play
I would also like to include the name of the play in the
output document.
The PlayStats Program
The implementation
of this program can be found in PlayStats.java (for program and examples,
download here). It is completely self-contained and includes a main() entry
point so it can be called from the command line. The basic usage of the program
is:
java
com.moonlightideas.examples.PlayStats.PlayStats
play_url1.xml
[play_url_2.xml ...]
The basic flow of the program is to process each of the play
documents provided on the command line. The statistics are cumulative, and once
the last play has been processed the results are output to System.out as, you
guessed it, an XML document.
Do-It-Yourself Data Structures
Writing a
program to retrieve information from an XML document using SAX requires a good
bit of thought and up-front design. Unlike APIs such as DOM, SAX doesn't build
an in-memory copy of your document that can be perused at your program's
leisure. It will notify you as bits and pieces of the document are recognized,
but it's up to your program to preserve any information that will be important
to it later. It also requires that the program be written to be aware of the
underlying structure of the source document. In many cases, assumptions need to
be made about the relationships between different elements. Also, since
notifications about element open tags, character data, and element close tags
arrive separately, it is usually necessary to preserve things such as attribute
values or character data in a temporary location until everything is available
to complete the processing.
One of the major features of SAX is that document content that
is not important to a particular application can easily be ignored. The elements
that our application will need to "listen" for are:
<PLAY>
<TITLE>
<ACT>
<SPEECH>
<SPEAKER>
<LINE>
The Simple API for XML
(better known as SAX) is a lightweight, event-driven XML parsing API that was
developed by the members of the xml-dev mailing
list.
For most of these elements, element nesting is
not important. However, the <TITLE> element can appear as a child of a
number of different elements within the document. Since we only want to preserve
the name of the play, we must be able to not only recognize a <TITLE> tag,
but the parent of the tag as well. A simple and flexible way to do this is to
maintain a stack with the current levels of element nesting. Every time a
startElement() notification is received, the current tag name is pushed on the
m_stackElements stack. It is then popped off of the stack in the endElement()
method. This arrangement allows for easy recognition of an element's parentage
at any point during parsing.
Tracking Document Features
Since it is
up to the SAX application to store and track document information, program data
structures tend to closely mirror the structure of the XML documents to be
processed. For example, to simplify and isolate the task of collecting
information about each of the speakers' lines, the PlayStats class uses an inner
class called the Persona class. Each individual speaker (as defined by the
<SPEAKER> tag) is mapped to a single Persona object instance. As the DTD
above shows, each <SPEECH> tag will have one or more <SPEAKER> tags.
The handleSpeaker() method, when it is called from the startElement()
notification, sets the m_sbChars class member to point to a new StringBuffer
instance so that the speaker's name will be captured by the characters()
notification method. Then, in the endElement() notification the name can be used
to look up the corresponding persona record (or create a new one if
necessary).
To keep on top of the latest developments in XML, visit the
O'Reilly Network's XML.com.
Once the closing tag of the <SPEECH> element is
recognized, the handleSpeech() method calls the addSpeechLines() method of the
persona object for each speaker in the current speakers list. It is also called
for the special global Persona object instance that represents the statistics
for the entire play (m_personTotal).
Outputting the Results
After all of the
play documents have been processed, the collected statistics are output to an
arbitrary print stream by the PlayStats.outputStats() method. The results
themselves are written as an XML document. The toString() method of the Persona
object returns a standard XML representation of the speaker statistics it
collects. In the case where the Persona object was instantiated without a
speaker's name, the enclosing <PERSONA> tag is omitted.
Wrapping Up
SAX is not a truly universal
solution to every type of XML application. In fact, for an application like a
WYSIWYG XML editor, it would be downright unusable. But in the proper
circumstances, it is definitely the best (and sometimes only) way to efficiently
process large volumes of XML data. Understanding the sequence of notifications
for a given XML document is critical when it comes to implementing the logic of
a SAX application. I hope this article and the accompanying sample code will
help make your first experience with SAX a pleasant one.
Links to Example Files:
PlayStats.java program and
examples
Source code for the sample class and a few sample runs.
The Plays of Shakespeare
XML versions of the plays of
Shakespeare, made available by Jon Bosak.
W. Scott Means has been a professional software developer since
1988 when he joined Microsoft Corporation at the age of 17. He was one of the
original developers of OS/2 1.1 and Windows NT and did some of the early work on
the Microsoft Network for the Advanced Technology and Business Development
group. Most recently, he served as the CEO of Enterprise Web Machines, a South
Carolina-based Internet infrastructure venture. He is currently writing
full-time and consulting on XML and Internet
topics.