2016-06-18 · 4 min read

I’m often asked what’s the easiest and/or best way of handling XML files in Python. This is how I read, write and handle the contents of XML files in Python. A quick guide on how to use xml.etree.ElementTree.

As many things, XML is better explained with a practical example in mind. Let’s suppose that you have an XML file that encodes user membership in various groups. Something along the lines of:

<?xml version="1.0"?>
<membership>
    <users>
        <user name="john"/>
        <user name="charles"/>
        <user name="peter"/>
    </users>
    <groups>
        <group name="users">
            <user name="john"/>
            <user name="charles"/>
        </group>
        <group name="administrators">
            <user name="peter"/>
        </group>
    </groups>
</membership>

Creating an XML

xml.etree.ElementTree is a very nice module because it provides classes that let you describe XML from Python in a very similar way to what you would do when writting raw XML by hand. The following file can be created like this:

from xml.etree import ElementTree
from xml.etree.ElementTree import Element
from xml.etree.ElementTree import SubElement

# <membership/>
membership = Element('membership')
# <membership><users/>
users = SubElement(membership, 'users')
# <membership><users><user/>
SubElement(users, 'user', name='john')
SubElement(users, 'user', name='charles')
SubElement(users, 'user', name='peter')
# <membership><groups/>
groups = SubElement(membership, 'groups')
# <membership><groups><group/>
group = SubElement(groups, 'group', name='users')
# <membership><groups><group><user/>
SubElement(group, 'user', name='john')
SubElement(group, 'user', name='charles')
# <membership><groups><group/>
group = SubElement(groups, 'group', name='administrators')
# <membership><groups><group><user/>
SubElement(group, 'user', name='peter')

If Python let you indent freely, the syntax would have been even closer to what one would write directly in XML. In any event, because of how closely it resembles the target format, ElementTree can be considered to be a small domain-specific language. Writing this to a file can be done like this:

output_file = open('membership.xml', 'w')
output_file.write('<?xml version="1.0"?>')
output_file.write(ElementTree.tostring(membership))
output_file.close()

One thing you would notice is that the resulting membership.xml file has no new-lines or spacing. It’s all in a single line. This is valid XML but not very human-friendly. If you were to open it with a browser or any other XML editor, it would display it with better formatting.

Reading the XML file

Reading the XML file just created above is a simple task:

from xml.etree import ElementTree
document = ElementTree.parse('membership.xml')

document will have an object that is not exactly a node in the XML structure, but it provides a handful of functions to consume the contents of the element hierarchy parsed from the file. Which way you choose is largely a matter of taste and probably influenced by the task at hand. The following are examples:

users = document.find('users')

is equivalent to:

membership = document.getroot()
users = membership.find('users')

Finding specific elements

XML is a hierarchical structure. Depending on what you do, you may want to enforce certain hierarchy of elements when consuming the contents of the file. For example, we know that the membership.xml file expects users to be defined like membership -> users -> user. You can quickly get all the user nodes by doing this:

for user in document.findall('users/user'):
    print(user.attrib['name'])

Likewise, you can quickly get all the groups by doing this:

for group in document.findall('groups/group'):
    print(group.attrib['name'])

Iterating elements

Even after finding specific elements or entry points in the hierarchy, you will normally need to iterate the children of a given node. This can be done like this:

for group in document.findall( 'groups/group' ):
    print('Group:', group.attrib['name'])
    print('Users:')
    for node in group.getchildren():
        if node.tag == 'user':
            print('-', node.attrib['name'])

Other times, you may need to visit every single element in the hierarchy from any given starting point. There are two ways of doing it, one includes the starting element in the iteration, the other only its children. Subtle, but important difference, i.e.:

Iterate nodes including starting point:

users = document.find('users') 
for node in users.getiterator(): 
    print(node.tag, node.attrib, node.text, node.tail)

Produces this output:

users {} None None 
user {'name': 'john'} None None 
user {'name': 'charles'} None None 
user {'name': 'peter'} None None

Iterate only the children:

users = document.find('users') 
for node in users.getchildren(): 
    print(node.tag, node.attrib, node.text, node.tail)

Produces this output:

user {'name': 'john'} None None 
user {'name': 'charles'} None None 
user {'name': 'peter'} None None

Handling namespaces

Some XML files make use of namespaces to disambiguate element tags. For example, take XHTML, it uses http://www.w3.org/1999/xhtml as the namespace, i.e. the main element in the XML file reads like this:

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

When parsing this file with ElementTree, the following instruction would return None:

body = document.find('body')
print(type(body))

prints:

<type 'NoneType'>

which is not what was expected. The reason is that because of the user of the xmlns attribute in the <html/> element, all the tag names in all the elements would look like:

{http://www.w3.org/1999/xhtml}body

not simply:

body

The best way to handle this case is by using the QName class instead of a str when searching for tags based on name, e.g.:

from xml.etree.ElementTree import QName

namespace = 'http://www.w3.org/1999/xhtml'
body_tag = str(QName(namespace, 'body'))
body = document.find(body_tag)
print(type(body))

prints, as expected:

<type 'instance'>

Notice the use of namespace and body_tag, that would make it easier to construct other element tag names you may need to search, e.g.:

div_tag = str(QName(namespace, 'div'))

xml.etree.ElementTree is a nice and intuitive way of dealing with XML content. You can find out more at https://docs.python.org/2/library/xml.etree.elementtree.html

Easy XML in Python

Creating an XML

Reading the XML file

Finding specific elements

Iterating elements

Handling namespaces