14. Processing XML Documents
![]() |
Consider the following XML document:
<tribu>
<enseignant>
<personne sexe="M">
<nom>dupont</nom>
<prenom>jean</prenom>
<age>28</age>
ceci est un commentaire
</personne>
<section>27</section>
</enseignant>
<etudiant>
<personne sexe="F">
<nom>martin</nom>
<prenom>charline</prenom>
<age>22</age>
</personne>
<formation>dess IAIE</formation>
</etudiant>
</tribu>
We analyze this document to produce the following console output:
tribu
enseignant
(personne,(sexe,M) )
nom
[dupont]
/nom
prenom
[jean]
/prenom
age
[28]
/age
/personne
section
[27]
/section
/enseignant
etudiant
(personne,(sexe,F) )
nom
[martin]
/nom
prenom
[charline]
/prenom
age
[22]
/age
/personne
formation
[dess IAIE]
/formation
/etudiant
/tribu
We need to know how to recognize:
- a start tag such as <training>;
- an end tag such as </teacher>;
- a start tag with attributes such as <person gender="F">;
- the body of a tag, such as "martin" in <name>martin</name>.
A program that parses XML code is called an XML parser. Two modules provide the functionality needed to parse XML code: xml.sax and xml.sax.handler.
The [xml.sax] module provides us with an XML parser using the following statement:
This parser parses the XML text sequentially. It calls user-defined methods on events:
- the startElement method on a start tag;
- the endElement method on a closing tag;
- the characters method on the body of a tag.
We need to tell the parser which class implements these methods:
We pass an instance of a class that implements the startElement, endElement, and characters methods to the parser’s setContentHandler method. The class used is a subclass of the xml.sax.handler.ContentHandler class. The preceding methods are called with parameters:
- name is the name of the start tag. attributes is the dictionary of the tag's attributes. Thus, for the tag <person sex="M">, we will have name="person" and attributes={'sex':'M'}
- name is the name of the end tag. Thus, for the </student> tag, we will have name='student'.
- data is the body of the tag. Thus, if the tag is
we will have data='\r\n dupont\r\n '. Generally, we will remove the whitespace preceding and following the data.
Now that this is explained, we can move on to the script for parsing an XML document:
Notes:
- The script uses the function library from the xml.sax and xml.sax.handler modules (line 3);
- line 62: the parsed XML file;
- line 68: the XML parser;
- Line 70: The handler for events emitted by the parser will be an instance of the XmlHandler class;
- line 72: the parsing of the XML document begins;
- line 6: the class implementing the startElement, endElement, and characters methods. It is derived from the xml.sax.handler.ContentHandler class, which implements methods used by the parser;
- line 9: the startElement method;
- line 30: the endElement method;
- line 41: the characters method.
The results are those presented at the beginning of this paragraph.
