This is currently a work in progress—stay tuned for more content.
lxml is an improved version of the core ElementTree API. It is incredibly useful for processing XML in Python, but it can sometimes be confusing to use. This is a primer on the ways to achieve some common tasks.
The standard I operate to with XML processing is that it should always be:
- UTF-8 encoded
- With an XML declaration
- Pretty-printed for readability
Contents
Reading XML from a file
Reading XML from a file can be accomplished with the etree.parse()
function. This returns a parsed ElementTree containing the contents of the parsed file:
from lxml import etree
etree.parse('myfile.xml')
As per the documentation, the source
argument can be:
- a file name/path
- a file object
- a file-like object
- a URL using the HTTP or FTP protocol
Parsing will generally make sensible decisions regarding encoding, and if you have a correct XML declaration in your source files then you shouldn't have any issues. If you want to specify how the file is parsed, you can call parse()
with the optional parser
argument, passing an instance of etree.XMLParser()
:
from lxml import etree
parser = etree.XMLParser(encoding='latin1')
etree.parse('myfile.xml', parser=parser)
Reading XML from a string
If you already have a string containing your XML, you can parse it into an _Element using etree.fromstring(text, parser=None)
. As with parse()
, you can optionally pass in an XMLParser
:
from lxml import etree
etree.fromstring('<root><elem>foo</elem></root>')
text
can either be a byte string or unicode string. As with parse()
this will generally make sensible decisions regarding encoding.
If you'd prefer to have an ElementTree instead of an Element, see Converting _Element into an _ElementTree.
Writing XML to a file
If you have an ElementTree, you can write the XML to a file using the ElementTree.write()
method. Generally you will want to specify the encoding
, xml_declaration
, and pretty_print
arguments:
from lxml import etree
root = etree.parse('file.xml')
<!-- Perform operations here -->
root.write('file_out.xml', encoding='utf-8', xml_declaration=True, pretty_print=True)
This does not support the short_empty_elements
argument from xml.etree.ElementTree.write()
, and will always output self-closing tags for empty elements. If you need to write full tags, you can instead consider the write_c14n()
method.
Writing XML as a string
Unsurprisingly, etree.fromstring()
has a corresponding method etree.tostring()
. If you have an _ElementTree or _Element, you can convert these to a byte string by passing them to this function:
from lxml import etree
root = etree.parse('file.xml')
etree.tostring(root, encoding='utf8')
Constructing an XML tree
If you don't have a file or string representation of your XML, you might want to construct it from scratch. You can achieve this with etree.Element()
:
from lxml import etree
el = etree.Element('foo', name='bar')
el.text = 'Hello'
subel = el.SubElement('qux')
subel.text = 'World'
This will result in the following XML fragment:
<foo name="bar">
Hello<qux>World</qux>
</foo>
See the Element documentation for other useful methods for manipulating the tree, including append()
, insert()
and remove()
. Note that once again we are working with an Element rather than an ElementTree.
Converting an ElementTree
into an Element
Calling the getroot()
method of an _ElementTree
will return the root _Element
:
from lxml import etree
root_element_tree = etree.parse('file.xml')
root_element = root_element_tree.getroot()
Converting an _Element
into an _ElementTree
Calling the getroottree()
method of an _Element
will return the root _ElementTree
:
from lxml import etree
root_element = etree.Element('foo')
root_element_tree = root_element_tree.getroottree()