Skip to content

14. Processing XML Documents

  

Consider the following XML document:

<tribe>
  <teacher>
    <person gender="M">
      <last_name>dupont</last_name>
      <first_name>jean</first_name>
      <age>28</age>
      This is a comment
    </person>
    <section>27</section>
  </teacher>
  <student>
    <person gender="F">
      <last name>martin</last name>
      <first_name>Charline</first_name>
      <age>22</age>
    </person>
    <education>IAIE degree</education>
  </student>
</group>

We analyze this document to produce the following console output:

 tribe
  teacher
   (person,(gender,M) )
    last name
     [dupont]
    /last name
    first name
     [jean]
    /first name
    age
     [28]
    /age
   /person
   section
    [27]
   /section
  /teacher
  student
   (person,(gender,F) )
    name
     [martin]
    /last name
    first name
     [Charline]
    /first name
    age
     [22]
    /age
   /person
   education
    [IAIE design]
   /education
  /student
 /tribe

We need to know how to recognize:

  • a start tag such as <training>;
  • an end tag such as </teacher>;
  • a start tag with attributes such as <person gender="F">;
  • the body of a tag, such as "martin" in <name>martin</name>.

A program that parses XML code is called an XML parser. Two modules provide the functionality needed to parse XML code: xml.sax and xml.sax.handler.

The [xml.sax] module provides us with an XML parser using the following statement:

xml_parser=xml.sax.make_parser()

This parser parses the XML text sequentially. It calls user-defined methods on events:

  • the startElement method on a start tag;
  • the endElement method on a closing tag;
  • the characters method on the body of a tag.

We need to tell the parser which class implements these methods:

xml_parser.setContentHandler(XmlHandler())

We pass an instance of a class that implements the startElement, endElement, and characters methods to the parser’s setContentHandler method. The class used is a subclass of the xml.sax.handler.ContentHandler class. The preceding methods are called with parameters:

def startElement(self, name, attributes):
  • name is the name of the start tag. attributes is the dictionary of the tag's attributes. Thus, for the tag <person sex="M">, we will have name="person" and attributes={'sex':'M'}
def endElement(self, name):
  • name is the name of the end tag. Thus, for the </student> tag, we will have name='student'.
def characters(self, data):
  • data is the body of the tag. Thus, if the tag is
<name>
    dupont
</name>

we will have data='\r\n dupont\r\n '. Generally, we will remove the whitespace preceding and following the data.

Now that this is explained, we can move on to the script for parsing an XML document:


The program (xml_sax_01)


# -*- coding=utf-8 -*-

import xml.sax, xml.sax.handler, re

# XML handling class
class XmlHandler(xml.sax.handler.ContentHandler):

    # function called when an opening tag is encountered
    def startElement(self, name, attributes):
        global depth
        # a sequence of spaces (indentation)
        print " " * depth,
        # attributes
        specifications=""
        for (attrib, value) in attributes.items():
            details+="(%s,%s) " % (attrib,value)
        # Display the tag name and any attributes
        if details:
            print "(%s,%s)" % (name,precisions)
        else:
            print name
        # one more level in the tree
        depth+=1
        # Is this a data tag?
        global dataTags, dataTag
        if dataTags.has_key(name.lower()):
            dataTag=1

    # the function called when an end tag is encountered
    def endElement(self, name):
        # end tag
        # indentation level
        global depth
        depth -= 1
        # a sequence of spaces (indentation)
        print " " * depth,
        # tag name
        print "/%s" % (name)

    # the data display function
    def characters(self, data):
        # data
        global dataTag

        # Is the current tag a data tag?
        if not dataTag:
            return
        # indentation level
        global depth
        # a sequence of spaces (indentation)
        print " " * depth,
        # the data is displayed
        match = re.match(r"^\s*(.*)\s*$", data)
        if match:
            print "[%s]" % (match.groups()[0])
        # end of data tag
        dataTag=False

# ------------------------------------------- main  
# the program
# data
file="data.xml"       # the XML file
depth=0               # indentation level = depth in the tree structure
dataTags={"lastName":1,"firstName":1,"age":1,"section":1,"education":1}
dataTag=True     # set to true, indicates that this is a data tag

# create an XML parser object
xml_parser = xml.sax.make_parser()
# the tag handler
xml_parser.setContentHandler(XmlHandler())
# processing the XML file
xml_parser.parse(file)

Notes:

  • The script uses the function library from the xml.sax and xml.sax.handler modules (line 3);
  • line 62: the parsed XML file;
  • line 68: the XML parser;
  • Line 70: The handler for events emitted by the parser will be an instance of the XmlHandler class;
  • line 72: the parsing of the XML document begins;
  • line 6: the class implementing the startElement, endElement, and characters methods. It is derived from the xml.sax.handler.ContentHandler class, which implements methods used by the parser;
  • line 9: the startElement method;
  • line 30: the endElement method;
  • line 41: the characters method.

The results are those presented at the beginning of this paragraph.