lxml—the good parts

This is currently a work in progress—stay tuned for more content.

lxml is an improved version of the core ElementTree API. It is incredibly useful for processing XML in Python, but it can sometimes be confusing to use. This is a primer on the ways to achieve some common tasks.

The standard I operate to with XML processing is that it should always be:

  • UTF-8 encoded
  • With an XML declaration
  • Pretty-printed for readability

Contents

Reading XML from a file

Reading XML from a file can be accomplished with the etree.parse() function. This returns a parsed ElementTree containing the contents of the parsed file:

from lxml import etree
etree.parse('myfile.xml')

As per the documentation, the source argument can be:

  • a file name/path
  • a file object
  • a file-like object
  • a URL using the HTTP or FTP protocol

Parsing will generally make sensible decisions regarding encoding, and if you have a correct XML declaration in your source files then you shouldn't have any issues. If you want to specify how the file is parsed, you can call parse() with the optional parser argument, passing an instance of etree.XMLParser():

from lxml import etree
parser = etree.XMLParser(encoding='latin1')
etree.parse('myfile.xml', parser=parser)

Reading XML from a string

If you already have a string containing your XML, you can parse it into an _Element using etree.fromstring(text, parser=None). As with parse(), you can optionally pass in an XMLParser:

from lxml import etree
etree.fromstring('<root><elem>foo</elem></root>')

text can either be a byte string or unicode string. As with parse() this will generally make sensible decisions regarding encoding.

If you'd prefer to have an ElementTree instead of an Element, see Converting _Element into an _ElementTree.

Writing XML to a file

If you have an ElementTree, you can write the XML to a file using the ElementTree.write() method. Generally you will want to specify the encoding, xml_declaration, and pretty_print arguments:

from lxml import etree
root = etree.parse('file.xml')
<!-- Perform operations here -->
root.write('file_out.xml', encoding='utf-8', xml_declaration=True, pretty_print=True)

This does not support the short_empty_elements argument from xml.etree.ElementTree.write(), and will always output self-closing tags for empty elements. If you need to write full tags, you can instead consider the write_c14n() method.

Writing XML as a string

Unsurprisingly, etree.fromstring() has a corresponding method etree.tostring(). If you have an _ElementTree or _Element, you can convert these to a byte string by passing them to this function:

from lxml import etree
root = etree.parse('file.xml')
etree.tostring(root, encoding='utf8')

Constructing an XML tree

If you don't have a file or string representation of your XML, you might want to construct it from scratch. You can achieve this with etree.Element():

from lxml import etree
el = etree.Element('foo', name='bar')
el.text = 'Hello'
subel = el.SubElement('qux')
subel.text = 'World'

This will result in the following XML fragment:

<foo name="bar">
  Hello<qux>World</qux>
</foo>

See the Element documentation for other useful methods for manipulating the tree, including append(), insert() and remove(). Note that once again we are working with an Element rather than an ElementTree.

Converting an ElementTree into an Element

Calling the getroot() method of an _ElementTree will return the root _Element:

from lxml import etree
root_element_tree = etree.parse('file.xml')
root_element = root_element_tree.getroot()

Converting an _Element into an _ElementTree

Calling the getroottree() method of an _Element will return the root _ElementTree:

from lxml import etree
root_element = etree.Element('foo')
root_element_tree = root_element_tree.getroottree()

Other references