`libearth.schema` — Declarative schema for pulling DOM parser of XML¶

There are well-known two ways to parse XML:

Document Object Model: It reads the whole XML and then makes a tree in memory. You can easily treverse the document as a tree, but the parsing can’t be streamed. Moreover it uses memory for data you don’t use.
Simple API for XML: It’s an event-based sequential access parser. It means you need to listen events from it and then utilize its still unstructured data by yourself. In other words, you don’t need to pay memory to data you never use if you simply do nothing for them when you listen the event.

Pros and cons between these two ways are obvious, but there could be another way to parse XML: mix them.

The basic idea of this pulling DOM parser (which this module implements) is that the parser can consume the stream just in time when you actually reach the child node. There should be an assumption for that: parsed XML has a schema for it. If the document is schema-free, this heuristic approach loses the most of its efficiency.

So the parser should have the information about the schema of XML document it’d parser, and we can declare the schema by defining classes. It’s a thing like ORM for XML. For example, suppose there is a small XML document:

<?xml version="1.0"?>
<person version="1.0">
  <name>Hong Minhee</name>
  <url>http://dahlia.kr/</url>
  <url>https://github.com/dahlia</url>
  <url>https://bitbucket.org/dahlia</url>
  <dob>1988-08-04</dob>
</person>

You can declare the schema for this like the following class definition:

class Person(DocumentElement):
    __tag__ = 'person'
    format_version = Attribute('version')
    name = Text('name')
    url = Child('url', URL, multiple=True)
    dob = Child('dob', Date)

libearth.schema.PARSER_LIST = []¶: (collections.Sequence) The list of xml.sax parser implementations to try to import.

libearth.schema.SCHEMA_XMLNS = 'http://earthreader.org/schema/'¶: (str) The XML namespace name used for schema metadatq.

class libearth.schema.Attribute(name, codec=None, xmlns=None, required=False, default=None, encoder=None, decoder=None)¶

Declare possible element attributes as a descriptor.

Parameters:

name (str) – the XML attribute name
codec (Codec, collections.Callable) – an optional codec object to use. if it’s callable and not an instance of Codec, its return value will be used instead. it means this can take class object of Codec subtype that is not instantiated yet unless the constructor require any arguments
xmlns (str) – an optional XML namespace URI
required (bool) – whether the child is required or not. False by default
default (collections.Callable) – an optional function that returns default value when the attribute is not present. the function takes an argument which is an Element instance
encoder (collections.Callable) – an optional function that encodes Python value into XML text value e.g. str(). the encoder function has to take an argument
decoder (collections.Callable) – an optional function that decodes XML text value into Python value e.g. int(). the decoder function has to take a string argument

Changed in version 0.2.0: The default option becomes to accept only callable objects. Below 0.2.0, default is not a function but a value which is simply used as it is.

default = None¶: (collections.Callable) The function that returns default value when the attribute is not present. The function takes an argument which is an Element instance.

Changed in version 0.2.0: It becomes to accept only callable objects. Below 0.2.0, default attribute is not a function but a value which is simply used as it is.

key_pair = None¶: (tuple) The pair of (xmlns, name).

name = None¶: (str) The XML attribute name.

required = None¶: (bool) Whether it is required for the element.

xmlns = None¶: (str) The optional XML namespace URI.

class libearth.schema.Child(tag, element_type, xmlns=None, required=False, multiple=False, sort_key=None, sort_reverse=None)¶

Declare a possible child element as a descriptor.

In order to have Child of the element type which is not defined yet (or self-referential) pass the class name of the element type to contain. The name will be lazily evaluated e.g.:

class Person(Element):
    '''Everyone can have their children, that also are a Person.'''

    children = Child('child', 'Person', multiple=True)

Parameters:

tag (str) – the tag name
xmlns (str) – an optional XML namespace URI
element_type (type, str) – the type of child element(s). it has to be a subtype of Element. if it’s a string it means referring the class name which is going to be lazily evaluated
required (bool) – whether the child is required or not. it’s exclusive to multiple. False by default
multiple (bool) – whether the child can be multiple. it’s exclusive to required. False by default
sort_key (collections.Callable) – an optional function to be used for sorting multiple child elements. it has to take a child as Element and return a value for sort key. it is the same to key option of sorted() built-in function. note that it doesn’t guarantee that all elements must be sorted in runtime, but all elements become sorted when it’s written using write() function. it’s available only when multiple is True. use sort_reverse for descending order.
sort_reverse (bool) – ehether to reverse elements when they become sorted. it is the same to reverse option of sorted() built-in function. it’s available only when sort_key is present

element_type¶: (type) The class of this child can contain. It must be a subtype of Element.

class libearth.schema.Codec¶

Abstract base class for codecs to serialize Python values to be stored in XML and deserialize XML texts to Python values.

In most cases encoding and decoding are implementation details of format which is well-defined, so these two functions could be paired. The interface rely on that idea.

To implement a codec, you have to subclass Codec and override a pair of methods: encode() and decode().

Codec objects are acceptable by Attribute, Text, and Content (all they subclass CodecDescriptor).

decode(text)¶

Decode the given XML text to Python value.

Raises DecodeError:
Parameters:	text (`str`) – XML text to decode
Returns:	the decoded Python value
	when decoding the given XML `text` goes wrong

Note

Every Codec subtype has to override this method.

encode(value)¶

Encode the given Python value into XML text.

Raises EncodeError:
Parameters:	value – Python value to encode
Returns:	the encoded XML text
Return type:	`str`
	when encoding the given `value` goes wrong

Note

Every Codec subtype has to override this method.

class libearth.schema.CodecDescriptor(codec=None, encoder=None, decoder=None)¶

Mixin class for descriptors that provide decoder() and encoder().

Attribute, Content and Text can take encoder and decoder functions for them. It’s used for encoding from Python values to XML string and decoding raw values from XML to natural Python representations.

It can take a codec, or encode and decode separately. (Of course they all can be present at a time.) In most cases, you’ll need only codec parameter that encoder and decoder are coupled:

Text('dob', Rfc3339(prefer_utc=True))

Encoders can be specified using encoder parameter of descriptor’s constructor, or encoder() decorator.

Decoders can be specified using decoder parameter of descriptor’s constructor, or decoder() decorator:

class Person(DocumentElement):
    __tag__ = 'person'
    format_version = Attribute('version')
    name = Text('name')
    url = Child('url', URL, multiple=True)
    dob = Text('dob',
               encoder=datetime.date.strftime.isoformat,
               decoder=lambda s: datetime.date.strptime(s, '%Y-%m-%d'))

    @format_version.encoder
    def format_version(self, value):
        return '.'.join(map(str, value))

    @format_version.decoder
    def format_version(self, value):
        return tuple(map(int, value.split('.')))

Parameters:

codec (Codec, collections.Callable) – an optional codec object to use. if it’s callable and not an instance of Codec, its return value will be used instead. it means this can take class object of Codec subtype that is not instantiated yet unless the constructor require any arguments
encoder (collections.Callable) – an optional function that encodes Python value into XML text value e.g. str(). the encoder function has to take an argument
decoder (collections.Callable) – an optional function that decodes XML text value into Python value e.g. int(). the decoder function has to take a string argument

decode(text, instance)¶

Decode the given text as it’s programmed.

Parameters:	text (`str`) – the raw text to decode. xml attribute value or text node value in most cases instance (`Element`) – the instance that is associated with the descriptor
Returns:	decoded value

Note

Internal method.

decoder(function)¶

Decorator which sets the decoder to the decorated function:

import datetime

class Person(DocumentElement):
    '''Person.dob will be a datetime.date instance.'''

    __tag__ = 'person'
    dob = Text('dob')

    @dob.decoder
    def dob(self, dob_text):
        return datetime.date.strptime(dob_text, '%Y-%m-%d')

>>> p = Person('<person><dob>1987-07-26</dob></person>')
>>> p.dob
datetime.date(1987, 7, 26)

If it’s applied multiple times, all decorated functions are piped in the order:

class Person(Element):
    '''Person.age will be an integer.'''

    age = Text('dob', decoder=lambda text: text.strip())

    @age.decoder
    def age(self, dob_text):
        return datetime.date.strptime(dob_text, '%Y-%m-%d')

    @age.decoder
    def age(self, dob):
        now = datetime.date.today()
        d = now.month < dob.month or (now.month == dob.month and
                                      now.day < dob.day)
        return now.year - dob.year - d

>>> p = Person('<person>\n\t<dob>\n\t\t1987-07-26\n\t</dob>\n</person>')
>>> p.age
26
>>> datetime.date.today()
datetime.date(2013, 7, 30)

Note

This creates a copy of the descriptor instance rather than manipulate itself in-place.

encoder(function)¶

Decorator which sets the encoder to the decorated function:

import datetime

class Person(DocumentElement):
    '''Person.dob will be written to ISO 8601 format'''

    __tag__ = 'person'
    dob = Text('dob')

    @dob.encoder
    def dob(self, dob):
        if not isinstance(dob, datetime.date):
            raise TypeError('expected datetime.date')
        return dob.strftime('%Y-%m-%d')

>>> isinstance(p, Person)
True
>>> p.dob
datetime.date(1987, 7, 26)
>>> ''.join(write(p, indent='', newline=''))
'<person><dob>1987-07-26</dob></person>'

If it’s applied multiple times, all decorated functions are piped in the order:

class Person(Element):
    '''Person.email will have mailto: prefix when it's written
    to XML.

    '''

    email = Text('email', encoder=lambda email: 'mailto:' + email)

    @age.encoder
    def email(self, email):
        return email.strip()

    @email.encoder
    def email(self, email):
        login, host = email.split('@', 1)
        return login + '@' + host.lower()

>>> isinstance(p, Person)
True
>>> p.email
'  earthreader@librelist.com  '
>>> ''.join(write(p, indent='', newline=''))
>>> '<person><email>mailto:earthreader@librelist.com</email></person>')

Note

This creates a copy of the descriptor instance rather than manipulate itself in-place.

exception libearth.schema.CodecError¶: Rise when encoding/decoding between Python values and XML data goes wrong.

class libearth.schema.Content(codec=None, encoder=None, decoder=None)¶

Declare possible text nodes as a descriptor.

Parameters:

codec (Codec, collections.Callable) – an optional codec object to use. if it’s callable and not an instance of Codec, its return value will be used instead. it means this can take class object of Codec subtype that is not instantiated yet unless the constructor require any arguments
encoder (collections.Callable) – an optional function that encodes Python value into XML text value e.g. str(). the encoder function has to take an argument
decoder (collections.Callable) – an optional function that decodes XML text value into Python value e.g. int(). the decoder function has to take a string argument

read(element, value)¶: Read raw value from XML, decode it, and then set the attribute for content of the given element to the decoded value.

Note

Internal method.

class libearth.schema.ContentHandler(document)¶

Event handler implementation for SAX parser.

It maintains the stack that contains parsing contexts of what element is lastly open, what descriptor is associated to the element, and the buffer for chunks of content characters the element has. Every context is represented as the namedtuple ParserContext.

Each time its events (startElement(), characters(), and endElement()) are called, it forwards the data to the associated descriptor. Descriptor subtypes implement start_element() method and end_element().

exception libearth.schema.DecodeError¶: Rise when decoding XML data to Python values goes wrong.

class libearth.schema.Descriptor(tag, xmlns=None, required=False, multiple=False, sort_key=None, sort_reverse=None)¶

Abstract base class for Child and Text.

end_element(reserved_value, content)¶

Abstract method that is invoked when the parser meets an end of an element related to the descriptor. It will be called by ContentHandler.

Parameters:	reserved_value – the value `start_element()` method returned content (`str`) – the content text of the read element

key_pair = None¶: (tuple) The pair of (xmlns, tag).

multiple = None¶: (bool) Whether it can be zero or more for the element. If it’s True required has to be False.

required = None¶: (bool) Whether it is required for the element. If it’s True multiple has to be False.

sort_key = None¶

(collections.Callable) An optional function to be used for sorting multiple elements. It has to take an element and return a value for sort key. It is the same to key option of sorted() built-in function.

It’s available only when multiple is True.

Use sort_reverse for descending order.

Note

It doesn’t guarantee that all elements must be sorted in runtime, but all elements become sorted when it’s written using write() function.

sort_reverse = None¶

(bool) Whether to reverse elements when they become sorted. It is the same to reverse option of sorted() built-in function.

It’s available only when sort_key is present.

start_element(element, attribute)¶

Abstract method that is invoked when the parser meets a start of an element related to the descriptor. It will be called by ContentHandler.

Parameters:	element (`Element`) – the parent element of the read element attribute (`str`) – the attribute name of the descriptor
Returns:	a value to reserve. it will be passed to `reserved_value` parameter of `end_element()`

tag = None¶: (str) The tag name.

xmlns = None¶: (str) The optional XML namespace URI.

exception libearth.schema.DescriptorConflictError¶: Error which rises when a schema has duplicate descriptors more than one for the same attribute, the same child element, or the text node.

class libearth.schema.DocumentElement(_parent=None, **kwargs)¶

The root element of the document.

__tag__¶: (str) Every DocumentElement subtype has to define this attribute to the root tag name.

__xmlns__¶: (str) A DocumentElement subtype may define this attribute to the XML namespace of the document element.

class libearth.schema.Element(_parent=None, **attributes)¶

Represent an element in XML document.

It provides the default constructor which takes keywords and initializes the attributes by given keyword arguments. For example, the following code that uses the default constructor:

assert issubclass(Person, Element)

author = Person(
    name='Hong Minhee',
    url='http://dahlia.kr/'
)

is equivalent to the following code:

author = Person()
author.name = 'Hong Minhee'
author.url = 'http://dahlia.kr/'

classmethod __coerce_from__(value)¶

Cast a value which isn’t an instance of the element type to the element type. It’s useful when a boxed element type could be more naturally represented using builtin type.

For example, Mark could be represented as a boolean value, and Text also could be represented as a string.

The following example shows how the element type can be automatically casted from string by implementing __coerce_from__() class method:

@classmethod
def __coerce_from__(cls, value):
    if isinstance(value, str):
        return Text(value=value)
    raise TypeError('expected a string or Text')

__entity_id__()¶

Identify the entity object. It returns the entity object itself by default, but should be overridden.

Returns:	any value to identify the entity object

__merge_entities__(other)¶

Merge two entities (self and other). It can return one of the two, or even a new entity object. This method is used by Session objects to merge conflicts between concurrent updates.

Parameters:	other (`Element`) – other entity to merge. it’s guaranteed that it’s older session’s (note that it doesn’t mean this entity is older than `self`, but the session’s last update is)
Returns:	on of the two, or even an new entity object that merges two entities
Return type:	`Element`

Note

The default implementation simply returns self. That means the entity of the newer session will always win unless the method is overridden.

class libearth.schema.ElementList(element, descriptor, value_type=None)¶

List-like object to represent multiple chidren. It makes the parser to lazily consume the buffer when an element of a particular offset is requested.

You can extend methods or properties for a particular element type using element_list_for() class decorator e.g.:

@element_list_for(Link)
class LinkList(collections.Sequence):
    '''Specialized ElementList for Link elements.'''

    def filter_by_mimetype(self, mimetype):
        '''Filter links by their mimetype.'''
        return [link for link in self if link.mimetype == mimetype]

Extended methods/properties can be used for element lists for the type:

assert isinstance(feed.links, LinkList)
assert isinstance(feed.links, ElementList)
feed.links.filter_by_mimetype('text/html')

consume_buffer()¶: Consume the buffer for the parser. It returns a generator, so can be stopped using break statement by caller.

Note

Internal method.

classmethod register_specialized_type(value_type, specialized_type)¶

Register specialized collections.Sequence type for a particular value_type.

An imperative version of :func`element_list_for()` class decorator.

Parameters:	value_type (`type`) – a particular element type that `specialized_type` would be used for instead of default `ElementList` class. it has to be a subtype of `Element` specialized_type (`type`) – a `collections.Sequence` type which extends methods and properties for `value_type`

specialized_types = {<class 'libearth.feed.Link'>: (<class 'libearth.feed.LinkList'>, None)}¶: (collections.MutableMapping) The internal table for specialized subtypes used by register_specialized_type() method and element_list_for() class decorator.

exception libearth.schema.EncodeError¶: Rise when encoding Python values into XML data goes wrong.

exception libearth.schema.IntegrityError¶: Rise when an element is invalid according to the schema.

exception libearth.schema.SchemaError¶: Error which rises when a schema definition has logical errors.

class libearth.schema.Text(tag, codec=None, xmlns=None, required=False, multiple=False, encoder=None, decoder=None, sort_key=None, sort_reverse=None)¶

Descriptor that declares a possible child element that only cosists of character data. All other attributes and child nodes are ignored.

Parameters:

tag (str) – the XML tag name
codec (Codec, collections.Callable) – an optional codec object to use. if it’s callable and not an instance of Codec, its return value will be used instead. it means this can take class object of Codec subtype that is not instantiated yet unless the constructor require any arguments
xmlns (str) – an optional XML namespace URI
required (bool) – whether the child is required or not. it’s exclusive to multiple. False by default
multiple (bool) – whether the child can be multiple. it’s exclusive to required. False by default
encoder (collections.Callable) – an optional function that encodes Python value into XML text value e.g. str(). the encoder function has to take an argument
decoder (collections.Callable) – an optional function that decodes XML text value into Python value e.g. int(). the decoder function has to take a string argument
sort_key (collections.Callable) – an optional function to be used for sorting multiple child elements. it has to take a child as Element and return a value for sort key. it is the same to key option of sorted() built-in function. note that it doesn’t guarantee that all elements must be sorted in runtime, but all elements become sorted when it’s written using write() function. it’s available only when multiple is True. use sort_reverse for descending order.
sort_reverse (bool) – ehether to reverse elements when they become sorted. it is the same to reverse option of sorted() built-in function. it’s available only when sort_key is present

libearth.schema.complete(element)¶

Completely load the given element.

Parameters:	element (`Element`) – an element loaded by `read()`

class libearth.schema.element_list_for(value_type)¶

Class decorator which registers specialized ElementList subclass for a particular value_type e.g.:

@element_list_for(Link)
class LinkList(collections.Sequence):
    '''Specialized ElementList for Link elements.'''

    def filter_by_mimetype(self, mimetype):
        '''Filter links by their mimetype.'''
        return [link for link in self if link.mimetype == mimetype]

Parameters:	value_type (`type`) – a particular element type that `specialized_type` would be used for instead of default `ElementList` class. it has to be a subtype of `Element`

libearth.schema.index_descriptors(element_type)¶

Index descriptors of the given element_type to make them easy to be looked up by their identifiers (pairs of XML namespace URI and tag name).

Parameters:	element_type (`type`) – a subtype of `Element` to index its descriptors

Note

Internal function.

libearth.schema.inspect_attributes(element_type)¶

Get the dictionary of Attribute descriptors of the given element_type.

Parameters:	element_type (`type`) – a subtype of `Element` to inspect
Returns:	a dictionary of attribute identifiers (pairs of xml namespace uri and xml attribute name) to pairs of instance attribute name and associated `Attribute` descriptor
Return type:	`collections.Mapping`

Note

Internal function.

libearth.schema.inspect_child_tags(element_type)¶

Get the dictionary of Descriptor objects of the given element_type.

Parameters:	element_type (`type`) – a subtype of `Element` to inspect
Returns:	a dictionary of child node identifiers (pairs of xml namespace uri and tag name) to pairs of instance attribute name and associated `Descriptor`
Return type:	`collections.Mapping`

Note

Internal function.

libearth.schema.inspect_content_tag(element_type)¶

Gets the Content descriptor of the given element_type.

Parameters:	element_type (`type`) – a subtype of `Element` to inspect
Returns:	a pair of instance attribute name and associated `Content` descriptor
Return type:	`tuple`

Note

Internal function.

libearth.schema.inspect_xmlns_set(element_type)¶

Get the set of XML namespaces used in the given element_type, recursively including all child elements.

Parameters:	element_type (`type`) – a subtype of `Element` to inspect
Returns:	a set of uri strings of used all xml namespaces
Return type:	`collections.Set`

Note

Internal function.

libearth.schema.is_partially_loaded(element)¶

Return whether the given element is not completely loaded by read() yet.

Parameters:	element (`Element`) – an element
Returns:	whether `True` if the given `element` is partially loaded
Return type:	`bool`

libearth.schema.read(cls, iterable)¶

Initialize a document in read mode by opening the iterable of XML string.

with open('doc.xml', 'rb') as f:
    read(Person, f)

Returned document element is not fully read but partially loaded into memory, and then lazily (and eventually) loaded when these are actually needed.

Parameters:	cls (`type`) – a subtype of `DocumentElement` iterable (`collections.Iterable`) – chunks of XML string to read
Returns:	initialized document element in read mode
Return type:	`DocumentElement`

libearth.schema.validate(element, recurse=True, raise_error=True)¶

Validate the given element according to the schema.

from libearth.schema import IntegrityError, validate

try:
    validate(element)
except IntegrityError:
    print('the element {0!r} is invalid!'.format(element))

Raises IntegrityError:
Parameters:	element (`Element`) – the element object to validate recurse (`bool`) – recursively validate the whole tree (child nodes). `True` by default raise_error (`bool`) – raise exception when the `element` is invalid. if it’s `False` it returns `False` instead of raising an exception. `True` by default
Returns:	`True` if the `element` is valid. `False` if the `element` is invalid and `raise_error` option is False`
	when the `element` is invalid and `raise_error` option is `True`

class libearth.schema.write(document, validate=True, indent=' ', newline='n', canonical_order=False, hints=True, as_bytes=None)¶

Write the given document to XML string. The return value is an iterator that yields chunks of an XML string.

with open('doc.xml', 'w') as f:
    for chunk in write(document):
        f.write(chunk)

Parameters:	document (`DocumentElement`) – the document element to serialize validate (`bool`) – whether validate the `document` or not. `True` by default indent (`str`) – an optional string to be used for indent. default is four spaces (`' '`) newline (`str`) – an optional character to be used for newline. default is `'\n'` canonical_order (`bool`) – make the order of attributes and child nodes consistent to any python versions and implementations. useful for testing. `False` by default hints (`bool`) – export hint values as well. hints improves efficiency of `read()`. `True` by default as_bytes – return chunks as `bytes` (`str` in Python 2) if `True`. return chunks as `str` (`unicode` in Python 3) if `False`. return chunks as default string type (`str`) by default
Returns:	chunks of an XML string
Return type:	`collections.Iterable`

Previous topic

Next topic

This Page

`libearth.schema` — Declarative schema for pulling DOM parser of XML¶

libearth.schema — Declarative schema for pulling DOM parser of XML¶

`libearth.schema` — Declarative schema for pulling DOM parser of XML¶