Processing Large XML Documents Using SAX 2.0

2002-9-9 13:44:13【作者】 AMTeam.org 【进入论坛】
广告

Processing Large XML Documents Using SAX 2.0 

One of the more common questions I've been hearing lately in conferences and reading on newsgroups is, "How do I process my (insert very large number here) megabyte XML document using DOM?" Unfortunately, in most cases the answer is, "You don't." The Document Object Model (and any other XML parsing scheme that involves loading the entire document into memory) is seriously constrained when it comes to processing large documents. Even in cases where the entire document could be loaded into memory, it often isn't practical or efficient to do so.

Granted, some applications (for instance performing an XSLT transformation) could require random access to every element in a document at any time. But what about simpler applications that just want to perform small, localized transformations? Or a database import application that just wants to store a certain set of elements in database tables as they're encountered? Or a statistics-gathering application that just wants to count features in a document without performing any real processing at all? Is there an easier solution out there for this class of applications?

The Simple API for XML

The answer is a resounding, "Yes." The Simple API for XML (better known as SAX) is a lightweight, event-driven XML parsing API that was developed by the members of the xml-dev mailing list (which is maintained by the Organization for the Advancement of Structured Information Standards.) It was developed as an alternative to the W3C DOM for applications such as those mentioned above. Unlike DOM, SAX does not completely parse an entire document and create an in-memory document tree. Instead, it progressively feeds event notifications to the client application as XML document structures are recognized.

Don't miss the XTech2001 Conference this July at O'Reilly's Open Source Convention. XTech2001 is the perfect setting for a serious technical discussion between XML developers. Get down to work in the demonstration-based tutorials or explore XML innovations in the sessions.

The reference implementation of SAX is written in the Java programming language and resides on David Megginson's Web site. Although it is principally a Java technology, versions of SAX have been developed in other programming languages, such as Perl, Python, and C++. The latest version of Microsoft's XML parser (MSXML 3.0) includes a full SAX 2.0 implementation that can be used by any language that supports COM objects.

Note: SAX 1.0 vs. SAX 2.0

The SAX API changed drastically between version 1.0 and 2.0, and as a result a number of interfaces and classes that were part of 1.0 are deprecated in 2.0. Also, there are several classes that exist only to facilitate porting 1.0 applications and parser implementations to SAX 2.0.

This article only addresses new development for the SAX 2.0 library, and therefore doesn't mention a number of classes and interfaces that are technically part of SAX 2.0, but should not need to be used by developers of new applications.
 
A Quick Overview of SAX

The reference implementation of SAX isn't actually a fully functional XML parser in its own right. It is a set of Java interfaces and helper classes that must be implemented by any parser that wants to be SAX 2.0 compliant. The following tables list the interfaces, classes, and exceptions that comprise a basic SAX implementation:

Package org.xml.sax

Interfaces

Attributes

ContentHandler

DTDHandler

EntityResolver

ErrorHandler

Locator

XMLFilter

XMLReader

 

Classes

InputSource

 

Exceptions

SAXException

SAXNotRecognizedException

SAXNotSupportedException

SAXParseException

Package org.xml.sax.helpers

Classes

AttributesImpl

DefaultHandler

LocatorImpl

ParserAdapter

ParserFactory

XMLFilterImpl

XMLReaderFactory

Of these, the most important and immediately useful are the various Handler interfaces. The ContentHandler interface specifies all of the callback methods that will be used to deliver information about:

Start and end of a document

Start and end of an XML element

Element namespaces and attributes

Namespace prefix mapping scopes

Processing instructions

Character data and ignorable whitespace

The ErrorHandler interface is called to notify the client application about errors in the underlying XML document syntax or validity. If an application does not register an error handler class, unexpected parser behavior may result.

The DTDHandler interface is used to receive notifications about notation and unparsed entity declarations in the document's DTD. It does not provide information about other DTD constructs such as parameter entities, element declarations, etc.

Scott Means has also written two other XML articles for oreilly.com, What's New in the DOM Level 2 Core? and Converting Unstructured Documents to XML. These hands-on articles cover the official W3C recommendation for DOM Level 2 and the details of XML conversion using real-life examples.

For an application class to actually parse a document and receive these various notification messages, it must:

Implement one or more of the Handler interfaces (ContentHandler, ErrorHandler, or DTDHandler).

Create an XMLReader class instance, possibly by using the XMLReaderFactory helper class.

Register an instance of the SAX-aware class with the new XMLReader instance using the various set*Handler() methods.

Call one of the parse() methods of the XMLReader instance for each document to be processed.

Now, to illustrate these principles, let's build a real application that would benefit from the unique features of SAX. Let's go to the theater.

The Play's the Thing

When I went out shopping on the Internet for sample documents for an application that would show off the features of SAX, I found a collection of Shakespeare's plays that had been encoded as XML documents by Jon Bosak. The link to the ZIP file containing the plays, as well as links to the full source code for this sample, can be found at the end of the article.

The plays are all encoded using the following simple DTD:

<!-- DTD for Shakespeare    J. Bosak    1994.03.01, 1997.01.02 -->

<!-- Revised for case sensitivity 1997.09.10 -->

<!-- Revised for XML 1.0 conformity 1998.01.27

     (thanks to Eve Maler) -->

<!ENTITY amp "&#38;">

<!ELEMENT PLAY     (TITLE, FM, PERSONAE, SCNDESCR,

  PLAYSUBT, INDUCT?, PROLOGUE?, ACT+, EPILOGUE?)>

<!ELEMENT TITLE    (#PCDATA)>

<!ELEMENT FM       (P+)>

<!ELEMENT P        (#PCDATA)>

<!ELEMENT PERSONAE (TITLE, (PERSONA | PGROUP)+)>

<!ELEMENT PGROUP   (PERSONA+, GRPDESCR)>

<!ELEMENT PERSONA  (#PCDATA)>

<!ELEMENT GRPDESCR (#PCDATA)>

<!ELEMENT SCNDESCR (#PCDATA)>

<!ELEMENT PLAYSUBT (#PCDATA)>

<!ELEMENT INDUCT   (TITLE, SUBTITLE*,

  (SCENE+|(SPEECH|STAGEDIR|SUBHEAD)+))>

<!ELEMENT ACT      (TITLE, SUBTITLE*, PROLOGUE?,

  SCENE+, EPILOGUE?)>

<!ELEMENT SCENE    (TITLE, SUBTITLE*,

  (SPEECH | STAGEDIR | SUBHEAD)+)>

<!ELEMENT PROLOGUE (TITLE, SUBTITLE*,

  (STAGEDIR | SPEECH)+)>

<!ELEMENT EPILOGUE (TITLE, SUBTITLE*,

                (STAGEDIR | SPEECH)+)>

<!ELEMENT SPEECH   (SPEAKER+, (LINE | STAGEDIR | SUBHEAD)+)>

<!ELEMENT SPEAKER  (#PCDATA)>

<!ELEMENT LINE     (#PCDATA | STAGEDIR)*>

<!ELEMENT STAGEDIR (#PCDATA)>

<!ELEMENT SUBTITLE (#PCDATA)>

<!ELEMENT SUBHEAD  (#PCDATA)>

[Editor's note: Throughout this article, whenever a line of code or data has been broken into two lines, the second line is indented.]

If you're like me (and if you are, you have my sympathies) you've probably asked yourself questions like, "Exactly how many lines did the Earl of Northumberland speak in Act XI of Henry VI?" (28 lines) or, "How many speeches were made by Julius Caesar in the eponymous play?" (39 speeches). Ok, I have to admit that until I started this article I really hadn't given the matter much thought. But since collecting statistical information from large documents is a great application of SAX, now the world will know how many lines are in the longest speech of The Merry Wives of Windsor (37 lines).

For each speaking part in the play and for the play as a whole, I want my application to track:

The total number of lines spoken

The total number of speeches

The average number of lines per speech

The minimum and maximum number of lines per speech

The number of lines spoken in each act of the play

I would also like to include the name of the play in the output document.

The PlayStats Program

The implementation of this program can be found in PlayStats.java (for program and examples, download here). It is completely self-contained and includes a main() entry point so it can be called from the command line. The basic usage of the program is:

   java com.moonlightideas.examples.PlayStats.PlayStats play_url1.xml
                                [play_url_2.xml ...]

The basic flow of the program is to process each of the play documents provided on the command line. The statistics are cumulative, and once the last play has been processed the results are output to System.out as, you guessed it, an XML document.

Do-It-Yourself Data Structures

Writing a program to retrieve information from an XML document using SAX requires a good bit of thought and up-front design. Unlike APIs such as DOM, SAX doesn't build an in-memory copy of your document that can be perused at your program's leisure. It will notify you as bits and pieces of the document are recognized, but it's up to your program to preserve any information that will be important to it later. It also requires that the program be written to be aware of the underlying structure of the source document. In many cases, assumptions need to be made about the relationships between different elements. Also, since notifications about element open tags, character data, and element close tags arrive separately, it is usually necessary to preserve things such as attribute values or character data in a temporary location until everything is available to complete the processing.

One of the major features of SAX is that document content that is not important to a particular application can easily be ignored. The elements that our application will need to "listen" for are:

<PLAY>

<TITLE>

<ACT>

<SPEECH>

<SPEAKER>

<LINE> 
  
 The Simple API for XML (better known as SAX) is a lightweight, event-driven XML parsing API that was developed by the members of the xml-dev mailing list. 
 
 For most of these elements, element nesting is not important. However, the <TITLE> element can appear as a child of a number of different elements within the document. Since we only want to preserve the name of the play, we must be able to not only recognize a <TITLE> tag, but the parent of the tag as well. A simple and flexible way to do this is to maintain a stack with the current levels of element nesting. Every time a startElement() notification is received, the current tag name is pushed on the m_stackElements stack. It is then popped off of the stack in the endElement() method. This arrangement allows for easy recognition of an element's parentage at any point during parsing.

Tracking Document Features

Since it is up to the SAX application to store and track document information, program data structures tend to closely mirror the structure of the XML documents to be processed. For example, to simplify and isolate the task of collecting information about each of the speakers' lines, the PlayStats class uses an inner class called the Persona class. Each individual speaker (as defined by the <SPEAKER> tag) is mapped to a single Persona object instance. As the DTD above shows, each <SPEECH> tag will have one or more <SPEAKER> tags. The handleSpeaker() method, when it is called from the startElement() notification, sets the m_sbChars class member to point to a new StringBuffer instance so that the speaker's name will be captured by the characters() notification method. Then, in the endElement() notification the name can be used to look up the corresponding persona record (or create a new one if necessary).

To keep on top of the latest developments in XML, visit the O'Reilly Network's XML.com.

Once the closing tag of the <SPEECH> element is recognized, the handleSpeech() method calls the addSpeechLines() method of the persona object for each speaker in the current speakers list. It is also called for the special global Persona object instance that represents the statistics for the entire play (m_personTotal).

Outputting the Results

After all of the play documents have been processed, the collected statistics are output to an arbitrary print stream by the PlayStats.outputStats() method. The results themselves are written as an XML document. The toString() method of the Persona object returns a standard XML representation of the speaker statistics it collects. In the case where the Persona object was instantiated without a speaker's name, the enclosing <PERSONA> tag is omitted.

Wrapping Up

SAX is not a truly universal solution to every type of XML application. In fact, for an application like a WYSIWYG XML editor, it would be downright unusable. But in the proper circumstances, it is definitely the best (and sometimes only) way to efficiently process large volumes of XML data. Understanding the sequence of notifications for a given XML document is critical when it comes to implementing the logic of a SAX application. I hope this article and the accompanying sample code will help make your first experience with SAX a pleasant one.

Links to Example Files:

PlayStats.java program and examples

Source code for the sample class and a few sample runs.

The Plays of Shakespeare

XML versions of the plays of Shakespeare, made available by Jon Bosak.

W. Scott Means has been a professional software developer since 1988 when he joined Microsoft Corporation at the age of 17. He was one of the original developers of OS/2 1.1 and Windows NT and did some of the early work on the Microsoft Network for the Advanced Technology and Business Development group. Most recently, he served as the CEO of Enterprise Web Machines, a South Carolina-based Internet infrastructure venture. He is currently writing full-time and consulting on XML and Internet topics.

如果您希望与本文章的作者或其所在机构,进一步交流,请联系:姜小姐
jill.jiang@amt.com.cn | 021-51096826-112 | 在线联系
企业信息化杂谈[原创]信息化依赖症& 信息化恐..

患有依赖症典型症状是:什么事都想找个软件或者上个系统,总觉得离开了信息化就不能很好解决问题,解决办法也没有档次。

吕建伟 专栏职业经理人的矛与盾---走出软件..

员工讨论企业文化和发展战略,老板强调细节和执行力,职业经理人强调股权和控制权,这都不是份内之事。

CIO职场,强者生存?

在2008年,我们将继续看到CIO向商业运营方向发展。与此同时,我们也会看到商业管理人员将与技术管理人员一起竞争CIO岗位。 IT领导者的就职机会虽有不少,但其难度将会大幅提高。2……

防震减灾,IT当关

今天,任何的防震救灾体系,都离不开IT技术。地震观测台是数字化的,震害防御需要对以往的地震信息进行数据分析,应急救援要需要现代多样化的通讯技术。如果说,在许多行业,信息技术还只是一……