14. Processing XML Documents
![]() |
Consider the following XML document:
<tribe>
<teacher>
<person gender="M">
<last_name>dupont</last_name>
<first_name>jean</first_name>
<age>28</age>
This is a comment
</person>
<section>27</section>
</teacher>
<student>
<person gender="F">
<last name>martin</last name>
<first_name>Charline</first_name>
<age>22</age>
</person>
<education>IAIE degree</education>
</student>
</group>
We analyze this document to produce the following console output:
tribe
teacher
(person,(gender,M) )
last name
[dupont]
/last name
first name
[jean]
/first name
age
[28]
/age
/person
section
[27]
/section
/teacher
student
(person,(gender,F) )
name
[martin]
/last name
first name
[Charline]
/first name
age
[22]
/age
/person
education
[IAIE design]
/education
/student
/tribe
We need to know how to recognize:
- a start tag such as <training>;
- an end tag such as </teacher>;
- a start tag with attributes such as <person gender="F">;
- the body of a tag, such as "martin" in <name>martin</name>.
A program that parses XML code is called an XML parser. Two modules provide the functionality needed to parse XML code: xml.sax and xml.sax.handler.
The [xml.sax] module provides us with an XML parser using the following statement:
This parser parses the XML text sequentially. It calls user-defined methods on events:
- the startElement method on a start tag;
- the endElement method on a closing tag;
- the characters method on the body of a tag.
We need to tell the parser which class implements these methods:
We pass an instance of a class that implements the startElement, endElement, and characters methods to the parser’s setContentHandler method. The class used is a subclass of the xml.sax.handler.ContentHandler class. The preceding methods are called with parameters:
- name is the name of the start tag. attributes is the dictionary of the tag's attributes. Thus, for the tag <person sex="M">, we will have name="person" and attributes={'sex':'M'}
- name is the name of the end tag. Thus, for the </student> tag, we will have name='student'.
- data is the body of the tag. Thus, if the tag is
we will have data='\r\n dupont\r\n '. Generally, we will remove the whitespace preceding and following the data.
Now that this is explained, we can move on to the script for parsing an XML document:
# -*- coding=utf-8 -*-
import xml.sax, xml.sax.handler, re
# XML handling class
class XmlHandler(xml.sax.handler.ContentHandler):
# function called when an opening tag is encountered
def startElement(self, name, attributes):
global depth
# a sequence of spaces (indentation)
print " " * depth,
# attributes
specifications=""
for (attrib, value) in attributes.items():
details+="(%s,%s) " % (attrib,value)
# Display the tag name and any attributes
if details:
print "(%s,%s)" % (name,precisions)
else:
print name
# one more level in the tree
depth+=1
# Is this a data tag?
global dataTags, dataTag
if dataTags.has_key(name.lower()):
dataTag=1
# the function called when an end tag is encountered
def endElement(self, name):
# end tag
# indentation level
global depth
depth -= 1
# a sequence of spaces (indentation)
print " " * depth,
# tag name
print "/%s" % (name)
# the data display function
def characters(self, data):
# data
global dataTag
# Is the current tag a data tag?
if not dataTag:
return
# indentation level
global depth
# a sequence of spaces (indentation)
print " " * depth,
# the data is displayed
match = re.match(r"^\s*(.*)\s*$", data)
if match:
print "[%s]" % (match.groups()[0])
# end of data tag
dataTag=False
# ------------------------------------------- main
# the program
# data
file="data.xml" # the XML file
depth=0 # indentation level = depth in the tree structure
dataTags={"lastName":1,"firstName":1,"age":1,"section":1,"education":1}
dataTag=True # set to true, indicates that this is a data tag
# create an XML parser object
xml_parser = xml.sax.make_parser()
# the tag handler
xml_parser.setContentHandler(XmlHandler())
# processing the XML file
xml_parser.parse(file)
Notes:
- The script uses the function library from the xml.sax and xml.sax.handler modules (line 3);
- line 62: the parsed XML file;
- line 68: the XML parser;
- Line 70: The handler for events emitted by the parser will be an instance of the XmlHandler class;
- line 72: the parsing of the XML document begins;
- line 6: the class implementing the startElement, endElement, and characters methods. It is derived from the xml.sax.handler.ContentHandler class, which implements methods used by the parser;
- line 9: the startElement method;
- line 30: the endElement method;
- line 41: the characters method.
The results are those presented at the beginning of this paragraph.
