I’m often asked what’s the easiest and/or best way of handling XML files in Python. This is how I read, write and handle the contents of XML files in Python. A quick guide on how to use xml.etree.ElementTree
.
As many things, XML is better explained with a practical example in mind. Let’s suppose that you have an XML file that encodes user membership in various groups. Something along the lines of:
<?xml version="1.0"?>
<membership>
<users>
<user name="john"/>
<user name="charles"/>
<user name="peter"/>
</users>
<groups>
<group name="users">
<user name="john"/>
<user name="charles"/>
</group>
<group name="administrators">
<user name="peter"/>
</group>
</groups>
</membership>
Creating an XML
xml.etree.ElementTree
is a very nice module because it provides classes that let you describe XML from Python in a very similar way to what you would do when writting raw XML by hand. The following file can be created like this:
from xml.etree import ElementTree
from xml.etree.ElementTree import Element
from xml.etree.ElementTree import SubElement
# <membership/>
membership = Element('membership')
# <membership><users/>
users = SubElement(membership, 'users')
# <membership><users><user/>
SubElement(users, 'user', name='john')
SubElement(users, 'user', name='charles')
SubElement(users, 'user', name='peter')
# <membership><groups/>
groups = SubElement(membership, 'groups')
# <membership><groups><group/>
group = SubElement(groups, 'group', name='users')
# <membership><groups><group><user/>
SubElement(group, 'user', name='john')
SubElement(group, 'user', name='charles')
# <membership><groups><group/>
group = SubElement(groups, 'group', name='administrators')
# <membership><groups><group><user/>
SubElement(group, 'user', name='peter')
If Python let you indent freely, the syntax would have been even closer to what one would write directly in XML. In any event, because of how closely it resembles the target format, ElementTree
can be considered to be a small domain-specific language. Writing this to a file can be done like this:
output_file = open('membership.xml', 'w')
output_file.write('<?xml version="1.0"?>')
output_file.write(ElementTree.tostring(membership))
output_file.close()
One thing you would notice is that the resulting membership.xml file has no new-lines or spacing. It’s all in a single line. This is valid XML but not very human-friendly. If you were to open it with a browser or any other XML editor, it would display it with better formatting.
Reading the XML file
Reading the XML file just created above is a simple task:
from xml.etree import ElementTree
document = ElementTree.parse('membership.xml')
document
will have an object that is not exactly a node in the XML structure, but it provides a handful of functions to consume the contents of the element hierarchy parsed from the file. Which way you choose is largely a matter of taste and probably influenced by the task at hand. The following are examples:
users = document.find('users')
is equivalent to:
membership = document.getroot()
users = membership.find('users')
Finding specific elements
XML is a hierarchical structure. Depending on what you do, you may want to enforce certain hierarchy of elements when consuming the contents of the file. For example, we know that the membership.xml file expects users to be defined like membership -> users -> user. You can quickly get all the user nodes by doing this:
for user in document.findall('users/user'):
print(user.attrib['name'])
Likewise, you can quickly get all the groups by doing this:
for group in document.findall('groups/group'):
print(group.attrib['name'])
Iterating elements
Even after finding specific elements or entry points in the hierarchy, you will normally need to iterate the children of a given node. This can be done like this:
for group in document.findall( 'groups/group' ):
print('Group:', group.attrib['name'])
print('Users:')
for node in group.getchildren():
if node.tag == 'user':
print('-', node.attrib['name'])
Other times, you may need to visit every single element in the hierarchy from any given starting point. There are two ways of doing it, one includes the starting element in the iteration, the other only its children. Subtle, but important difference, i.e.:
Iterate nodes including starting point:
users = document.find('users')
for node in users.getiterator():
print(node.tag, node.attrib, node.text, node.tail)
Produces this output:
users {} None None
user {'name': 'john'} None None
user {'name': 'charles'} None None
user {'name': 'peter'} None None
Iterate only the children:
users = document.find('users')
for node in users.getchildren():
print(node.tag, node.attrib, node.text, node.tail)
Produces this output:
user {'name': 'john'} None None
user {'name': 'charles'} None None
user {'name': 'peter'} None None
Handling namespaces
Some XML files make use of namespaces to disambiguate element tags. For example, take XHTML, it uses http://www.w3.org/1999/xhtml
as the namespace, i.e. the main element in the XML file reads like this:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
When parsing this file with ElementTree
, the following instruction would return None
:
body = document.find('body')
print(type(body))
prints:
<type 'NoneType'>
which is not what was expected. The reason is that because of the user of the xmlns attribute in the <html/>
element, all the tag names in all the elements would look like:
{http://www.w3.org/1999/xhtml}body
not simply:
body
The best way to handle this case is by using the QName
class instead of a str when searching for tags based on name, e.g.:
from xml.etree.ElementTree import QName
namespace = 'http://www.w3.org/1999/xhtml'
body_tag = str(QName(namespace, 'body'))
body = document.find(body_tag)
print(type(body))
prints, as expected:
<type 'instance'>
Notice the use of namespace and body_tag, that would make it easier to construct other element tag names you may need to search, e.g.:
div_tag = str(QName(namespace, 'div'))
xml.etree.ElementTree
is a nice and intuitive way of dealing with XML content. You can find out more at https://docs.python.org/2/library/xml.etree.elementtree.html