There are well-known two ways to parse XML:
Pros and cons between these two ways are obvious, but there could be another way to parse XML: mix them.
The basic idea of this pulling DOM parser (which this module implements) is that the parser can consume the stream just in time when you actually reach the child node. There should be an assumption for that: parsed XML has a schema for it. If the document is schema-free, this heuristic approach loses the most of its efficiency.
So the parser should have the information about the schema of XML document it’d parser, and we can declare the schema by defining classes. It’s a thing like ORM for XML. For example, suppose there is a small XML document:
<?xml version="1.0"?>
<person version="1.0">
<name>Hong Minhee</name>
<url>http://dahlia.kr/</url>
<url>https://github.com/dahlia</url>
<url>https://bitbucket.org/dahlia</url>
<dob>1988-08-04</dob>
</person>
You can declare the schema for this like the following class definition:
class Person(DocumentElement):
__tag__ = 'person'
format_version = Attribute('version')
name = Text('name')
url = Child('url', URL, multiple=True)
dob = Child('dob', Date)
(collections.Sequence) The list of xml.sax parser implementations to try to import.
(str) The XML namespace name used for schema metadatq.
Declare possible element attributes as a descriptor.
Parameters: |
|
---|
Changed in version 0.2.0: The default option becomes to accept only callable objects. Below 0.2.0, default is not a function but a value which is simply used as it is.
(collections.Callable) The function that returns default value when the attribute is not present. The function takes an argument which is an Element instance.
Changed in version 0.2.0: It becomes to accept only callable objects. Below 0.2.0, default attribute is not a function but a value which is simply used as it is.
(bool) Whether it is required for the element.
Declare a possible child element as a descriptor.
In order to have Child of the element type which is not defined yet (or self-referential) pass the class name of the element type to contain. The name will be lazily evaluated e.g.:
class Person(Element):
'''Everyone can have their children, that also are a Person.'''
children = Child('child', 'Person', multiple=True)
Parameters: |
|
---|
Abstract base class for codecs to serialize Python values to be stored in XML and deserialize XML texts to Python values.
In most cases encoding and decoding are implementation details of format which is well-defined, so these two functions could be paired. The interface rely on that idea.
To implement a codec, you have to subclass Codec and override a pair of methods: encode() and decode().
Codec objects are acceptable by Attribute, Text, and Content (all they subclass CodecDescriptor).
Mixin class for descriptors that provide decoder() and encoder().
Attribute, Content and Text can take encoder and decoder functions for them. It’s used for encoding from Python values to XML string and decoding raw values from XML to natural Python representations.
It can take a codec, or encode and decode separately. (Of course they all can be present at a time.) In most cases, you’ll need only codec parameter that encoder and decoder are coupled:
Text('dob', Rfc3339(prefer_utc=True))
Encoders can be specified using encoder parameter of descriptor’s constructor, or encoder() decorator.
Decoders can be specified using decoder parameter of descriptor’s constructor, or decoder() decorator:
class Person(DocumentElement):
__tag__ = 'person'
format_version = Attribute('version')
name = Text('name')
url = Child('url', URL, multiple=True)
dob = Text('dob',
encoder=datetime.date.strftime.isoformat,
decoder=lambda s: datetime.date.strptime(s, '%Y-%m-%d'))
@format_version.encoder
def format_version(self, value):
return '.'.join(map(str, value))
@format_version.decoder
def format_version(self, value):
return tuple(map(int, value.split('.')))
Parameters: |
|
---|
Decode the given text as it’s programmed.
Parameters: | |
---|---|
Returns: | decoded value |
Note
Internal method.
Decorator which sets the decoder to the decorated function:
import datetime
class Person(DocumentElement):
'''Person.dob will be a datetime.date instance.'''
__tag__ = 'person'
dob = Text('dob')
@dob.decoder
def dob(self, dob_text):
return datetime.date.strptime(dob_text, '%Y-%m-%d')
>>> p = Person('<person><dob>1987-07-26</dob></person>')
>>> p.dob
datetime.date(1987, 7, 26)
If it’s applied multiple times, all decorated functions are piped in the order:
class Person(Element):
'''Person.age will be an integer.'''
age = Text('dob', decoder=lambda text: text.strip())
@age.decoder
def age(self, dob_text):
return datetime.date.strptime(dob_text, '%Y-%m-%d')
@age.decoder
def age(self, dob):
now = datetime.date.today()
d = now.month < dob.month or (now.month == dob.month and
now.day < dob.day)
return now.year - dob.year - d
>>> p = Person('<person>\n\t<dob>\n\t\t1987-07-26\n\t</dob>\n</person>')
>>> p.age
26
>>> datetime.date.today()
datetime.date(2013, 7, 30)
Note
This creates a copy of the descriptor instance rather than manipulate itself in-place.
Decorator which sets the encoder to the decorated function:
import datetime
class Person(DocumentElement):
'''Person.dob will be written to ISO 8601 format'''
__tag__ = 'person'
dob = Text('dob')
@dob.encoder
def dob(self, dob):
if not isinstance(dob, datetime.date):
raise TypeError('expected datetime.date')
return dob.strftime('%Y-%m-%d')
>>> isinstance(p, Person)
True
>>> p.dob
datetime.date(1987, 7, 26)
>>> ''.join(write(p, indent='', newline=''))
'<person><dob>1987-07-26</dob></person>'
If it’s applied multiple times, all decorated functions are piped in the order:
class Person(Element):
'''Person.email will have mailto: prefix when it's written
to XML.
'''
email = Text('email', encoder=lambda email: 'mailto:' + email)
@age.encoder
def email(self, email):
return email.strip()
@email.encoder
def email(self, email):
login, host = email.split('@', 1)
return login + '@' + host.lower()
>>> isinstance(p, Person)
True
>>> p.email
' earthreader@librelist.com '
>>> ''.join(write(p, indent='', newline=''))
>>> '<person><email>mailto:earthreader@librelist.com</email></person>')
Note
This creates a copy of the descriptor instance rather than manipulate itself in-place.
Rise when encoding/decoding between Python values and XML data goes wrong.
Declare possible text nodes as a descriptor.
Parameters: |
|
---|
Read raw value from XML, decode it, and then set the attribute for content of the given element to the decoded value.
Note
Internal method.
Event handler implementation for SAX parser.
It maintains the stack that contains parsing contexts of what element is lastly open, what descriptor is associated to the element, and the buffer for chunks of content characters the element has. Every context is represented as the namedtuple ParserContext.
Each time its events (startElement(), characters(), and endElement()) are called, it forwards the data to the associated descriptor. Descriptor subtypes implement start_element() method and end_element().
Rise when decoding XML data to Python values goes wrong.
Abstract base class for Child and Text.
Abstract method that is invoked when the parser meets an end of an element related to the descriptor. It will be called by ContentHandler.
Parameters: |
|
---|
(bool) Whether it can be zero or more for the element. If it’s True required has to be False.
(bool) Whether it is required for the element. If it’s True multiple has to be False.
(collections.Callable) An optional function to be used for sorting multiple elements. It has to take an element and return a value for sort key. It is the same to key option of sorted() built-in function.
It’s available only when multiple is True.
Use sort_reverse for descending order.
Note
It doesn’t guarantee that all elements must be sorted in runtime, but all elements become sorted when it’s written using write() function.
(bool) Whether to reverse elements when they become sorted. It is the same to reverse option of sorted() built-in function.
It’s available only when sort_key is present.
Abstract method that is invoked when the parser meets a start of an element related to the descriptor. It will be called by ContentHandler.
Parameters: | |
---|---|
Returns: | a value to reserve. it will be passed to reserved_value parameter of end_element() |
Error which rises when a schema has duplicate descriptors more than one for the same attribute, the same child element, or the text node.
The root element of the document.
(str) Every DocumentElement subtype has to define this attribute to the root tag name.
(str) A DocumentElement subtype may define this attribute to the XML namespace of the document element.
Represent an element in XML document.
It provides the default constructor which takes keywords and initializes the attributes by given keyword arguments. For example, the following code that uses the default constructor:
assert issubclass(Person, Element)
author = Person(
name='Hong Minhee',
url='http://dahlia.kr/'
)
is equivalent to the following code:
author = Person()
author.name = 'Hong Minhee'
author.url = 'http://dahlia.kr/'
Cast a value which isn’t an instance of the element type to the element type. It’s useful when a boxed element type could be more naturally represented using builtin type.
For example, Mark could be represented as a boolean value, and Text also could be represented as a string.
The following example shows how the element type can be automatically casted from string by implementing __coerce_from__() class method:
@classmethod
def __coerce_from__(cls, value):
if isinstance(value, str):
return Text(value=value)
raise TypeError('expected a string or Text')
Identify the entity object. It returns the entity object itself by default, but should be overridden.
Returns: | any value to identify the entity object |
---|
Merge two entities (self and other). It can return one of the two, or even a new entity object. This method is used by Session objects to merge conflicts between concurrent updates.
Parameters: | other (Element) – other entity to merge. it’s guaranteed that it’s older session’s (note that it doesn’t mean this entity is older than self, but the session’s last update is) |
---|---|
Returns: | on of the two, or even an new entity object that merges two entities |
Return type: | Element |
Note
The default implementation simply returns self. That means the entity of the newer session will always win unless the method is overridden.
List-like object to represent multiple chidren. It makes the parser to lazily consume the buffer when an element of a particular offset is requested.
You can extend methods or properties for a particular element type using element_list_for() class decorator e.g.:
@element_list_for(Link)
class LinkList(collections.Sequence):
'''Specialized ElementList for Link elements.'''
def filter_by_mimetype(self, mimetype):
'''Filter links by their mimetype.'''
return [link for link in self if link.mimetype == mimetype]
Extended methods/properties can be used for element lists for the type:
assert isinstance(feed.links, LinkList)
assert isinstance(feed.links, ElementList)
feed.links.filter_by_mimetype('text/html')
Consume the buffer for the parser. It returns a generator, so can be stopped using break statement by caller.
Note
Internal method.
Register specialized collections.Sequence type for a particular value_type.
An imperative version of :func`element_list_for()` class decorator.
Parameters: |
|
---|
(collections.MutableMapping) The internal table for specialized subtypes used by register_specialized_type() method and element_list_for() class decorator.
Rise when encoding Python values into XML data goes wrong.
Rise when an element is invalid according to the schema.
Error which rises when a schema definition has logical errors.
Descriptor that declares a possible child element that only cosists of character data. All other attributes and child nodes are ignored.
Parameters: |
|
---|
Completely load the given element.
Parameters: | element (Element) – an element loaded by read() |
---|
Class decorator which registers specialized ElementList subclass for a particular value_type e.g.:
@element_list_for(Link)
class LinkList(collections.Sequence):
'''Specialized ElementList for Link elements.'''
def filter_by_mimetype(self, mimetype):
'''Filter links by their mimetype.'''
return [link for link in self if link.mimetype == mimetype]
Parameters: | value_type (type) – a particular element type that specialized_type would be used for instead of default ElementList class. it has to be a subtype of Element |
---|
Index descriptors of the given element_type to make them easy to be looked up by their identifiers (pairs of XML namespace URI and tag name).
Parameters: | element_type (type) – a subtype of Element to index its descriptors |
---|
Note
Internal function.
Get the dictionary of Attribute descriptors of the given element_type.
Parameters: | element_type (type) – a subtype of Element to inspect |
---|---|
Returns: | a dictionary of attribute identifiers (pairs of xml namespace uri and xml attribute name) to pairs of instance attribute name and associated Attribute descriptor |
Return type: | collections.Mapping |
Note
Internal function.
Get the dictionary of Descriptor objects of the given element_type.
Parameters: | element_type (type) – a subtype of Element to inspect |
---|---|
Returns: | a dictionary of child node identifiers (pairs of xml namespace uri and tag name) to pairs of instance attribute name and associated Descriptor |
Return type: | collections.Mapping |
Note
Internal function.
Gets the Content descriptor of the given element_type.
Parameters: | element_type (type) – a subtype of Element to inspect |
---|---|
Returns: | a pair of instance attribute name and associated Content descriptor |
Return type: | tuple |
Note
Internal function.
Get the set of XML namespaces used in the given element_type, recursively including all child elements.
Parameters: | element_type (type) – a subtype of Element to inspect |
---|---|
Returns: | a set of uri strings of used all xml namespaces |
Return type: | collections.Set |
Note
Internal function.
Return whether the given element is not completely loaded by read() yet.
Parameters: | element (Element) – an element |
---|---|
Returns: | whether True if the given element is partially loaded |
Return type: | bool |
Initialize a document in read mode by opening the iterable of XML string.
with open('doc.xml', 'rb') as f:
read(Person, f)
Returned document element is not fully read but partially loaded into memory, and then lazily (and eventually) loaded when these are actually needed.
Parameters: |
|
---|---|
Returns: | initialized document element in read mode |
Return type: |
Validate the given element according to the schema.
from libearth.schema import IntegrityError, validate
try:
validate(element)
except IntegrityError:
print('the element {0!r} is invalid!'.format(element))
Parameters: |
|
---|---|
Returns: | True if the element is valid. False if the element is invalid and raise_error option is False` |
Raises IntegrityError: | |
when the element is invalid and raise_error option is True |
Write the given document to XML string. The return value is an iterator that yields chunks of an XML string.
with open('doc.xml', 'w') as f:
for chunk in write(document):
f.write(chunk)
Parameters: |
|
---|---|
Returns: | chunks of an XML string |
Return type: | collections.Iterable |