Previous topic

libearth.sanitizer — Sanitize HTML tags

Next topic

libearth.session — Isolate data from other installations

This Page

libearth.schema — Declarative schema for pulling DOM parser of XML

There are well-known two ways to parse XML:

Document Object Model
It reads the whole XML and then makes a tree in memory. You can easily treverse the document as a tree, but the parsing can’t be streamed. Moreover it uses memory for data you don’t use.
Simple API for XML
It’s an event-based sequential access parser. It means you need to listen events from it and then utilize its still unstructured data by yourself. In other words, you don’t need to pay memory to data you never use if you simply do nothing for them when you listen the event.

Pros and cons between these two ways are obvious, but there could be another way to parse XML: mix them.

The basic idea of this pulling DOM parser (which this module implements) is that the parser can consume the stream just in time when you actually reach the child node. There should be an assumption for that: parsed XML has a schema for it. If the document is schema-free, this heuristic approach loses the most of its efficiency.

So the parser should have the information about the schema of XML document it’d parser, and we can declare the schema by defining classes. It’s a thing like ORM for XML. For example, suppose there is a small XML document:

<?xml version="1.0"?>
<person version="1.0">
  <name>Hong Minhee</name>
  <url>http://dahlia.kr/</url>
  <url>https://github.com/dahlia</url>
  <url>https://bitbucket.org/dahlia</url>
  <dob>1988-08-04</dob>
</person>

You can declare the schema for this like the following class definition:

class Person(DocumentElement):
    __tag__ = 'person'
    format_version = Attribute('version')
    name = Text('name')
    url = Child('url', URL, multiple=True)
    dob = Child('dob', Date)
libearth.schema.PARSER_LIST = []

(collections.Sequence) The list of xml.sax parser implementations to try to import.

class libearth.schema.Attribute(name, codec=None, xmlns=None, required=False, default=None, encoder=None, decoder=None)

Declare possible element attributes as a descriptor.

Parameters:
  • name (str) – the XML attribute name
  • codec (Codec, collections.Callable) – an optional codec object to use. if it’s callable and not an instance of Codec, its return value will be used instead. it means this can take class object of Codec subtype that is not instantiated yet unless the constructor require any arguments
  • xmlns (str) – an optional XML namespace URI
  • required (bool) – whether the child is required or not. False by default
  • encoder (collections.Callable) – an optional function that encodes Python value into XML text value e.g. str(). the encoder function has to take an argument
  • decoder (collections.Callable) – an optional function that decodes XML text value into Python value e.g. int(). the decoder function has to take a string argument
key_pair = None

(tuple) The pair of (xmlns, name).

name = None

(str) The XML attribute name.

required = None

(bool) Whether it is required for the element.

xmlns = None

(str) The optional XML namespace URI.

class libearth.schema.Child(tag, element_type, xmlns=None, required=False, multiple=False, sort_key=None, sort_reverse=None)

Declare a possible child element as a descriptor.

In order to have Child of the element type which is not defined yet (or self-referential) pass the class name of the element type to contain. The name will be lazily evaluated e.g.:

class Person(Element):
    '''Everyone can have their children, that also are a Person.'''

    children = Child('child', 'Person', multiple=True)
Parameters:
  • tag (str) – the tag name
  • xmlns (str) – an optional XML namespace URI
  • element_type (type, str) – the type of child element(s). it has to be a subtype of Element. if it’s a string it means referring the class name which is going to be lazily evaluated
  • required (bool) – whether the child is required or not. it’s exclusive to multiple. False by default
  • multiple (bool) – whether the child can be multiple. it’s exclusive to required. False by default
  • sort_key (collections.Callable) – an optional function to be used for sorting multiple child elements. it has to take a child as Element and return a value for sort key. it is the same to key option of sorted() built-in function. note that it doesn’t guarantee that all elements must be sorted in runtime, but all elements become sorted when it’s written using write() function. it’s available only when multiple is True. use sort_reverse for descending order.
  • sort_reverse (bool) – ehether to reverse elements when they become sorted. it is the same to reverse option of sorted() built-in function. it’s available only when sort_key is present
element_type

(type) The class of this child can contain. It must be a subtype of Element.

class libearth.schema.Codec

Abstract base class for codecs to serialize Python values to be stored in XML and deserialize XML texts to Python values.

In most cases encoding and decoding are implementation details of format which is well-defined, so these two functions could be paired. The interface rely on that idea.

To implement a codec, you have to subclass Codec and override a pair of methods: encode() and decode().

Codec objects are acceptable by Attribute, Text, and Content (all they subclass CodecDescriptor).

__weakref__

list of weak references to the object (if defined)

decode(text)

Decode the given XML text to Python value.

Parameters:text (str) – XML text to decode
Returns:the decoded Python value
Raises DecodeError:
 when decoding the given XML text goes wrong

Note

Every Codec subtype has to override this method.

encode(value)

Encode the given Python value into XML text.

Parameters:value – Python value to encode
Returns:the encoded XML text
Return type:str
Raises EncodeError:
 when encoding the given value goes wrong

Note

Every Codec subtype has to override this method.

class libearth.schema.CodecDescriptor(codec=None, encoder=None, decoder=None)

Mixin class for descriptors that provide decoder() and encoder().

Attribute, Content and Text can take encoder and decoder functions for them. It’s used for encoding from Python values to XML string and decoding raw values from XML to natural Python representations.

It can take a codec, or encode and decode separately. (Of course they all can be present at a time.) In most cases, you’ll need only codec parameter that encoder and decoder are coupled:

Text('dob', Rfc3339(prefer_utc=True))

Encoders can be specified using encoder parameter of descriptor’s constructor, or encoder() decorator.

Decoders can be specified using decoder parameter of descriptor’s constructor, or decoder() decorator:

class Person(DocumentElement):
    __tag__ = 'person'
    format_version = Attribute('version')
    name = Text('name')
    url = Child('url', URL, multiple=True)
    dob = Text('dob',
               encoder=datetime.date.strftime.isoformat,
               decoder=lambda s: datetime.date.strptime(s, '%Y-%m-%d'))

    @format_version.encoder
    def format_version(self, value):
        return '.'.join(map(str, value))

    @format_version.decoder
    def format_version(self, value):
        return tuple(map(int, value.split('.')))
Parameters:
  • codec (Codec, collections.Callable) – an optional codec object to use. if it’s callable and not an instance of Codec, its return value will be used instead. it means this can take class object of Codec subtype that is not instantiated yet unless the constructor require any arguments
  • encoder (collections.Callable) – an optional function that encodes Python value into XML text value e.g. str(). the encoder function has to take an argument
  • decoder (collections.Callable) – an optional function that decodes XML text value into Python value e.g. int(). the decoder function has to take a string argument
__weakref__

list of weak references to the object (if defined)

decode(text, instance)

Decode the given text as it’s programmed.

Parameters:
  • text (str) – the raw text to decode. xml attribute value or text node value in most cases
  • instance (Element) – the instance that is associated with the descriptor
Returns:

decoded value

Note

Internal method.

decoder(function)

Decorator which sets the decoder to the decorated function:

import datetime

class Person(DocumentElement):
    '''Person.dob will be a datetime.date instance.'''

    __tag__ = 'person'
    dob = Text('dob')

    @dob.decoder
    def dob(self, dob_text):
        return datetime.date.strptime(dob_text, '%Y-%m-%d')
>>> p = Person('<person><dob>1987-07-26</dob></person>')
>>> p.dob
datetime.date(1987, 7, 26)

If it’s applied multiple times, all decorated functions are piped in the order:

class Person(Element):
    '''Person.age will be an integer.'''

    age = Text('dob', decoder=lambda text: text.strip())

    @age.decoder
    def age(self, dob_text):
        return datetime.date.strptime(dob_text, '%Y-%m-%d')

    @age.decoder
    def age(self, dob):
        now = datetime.date.today()
        d = now.month < dob.month or (now.month == dob.month and
                                      now.day < dob.day)
        return now.year - dob.year - d
>>> p = Person('<person>\n\t<dob>\n\t\t1987-07-26\n\t</dob>\n</person>')
>>> p.age
26
>>> datetime.date.today()
datetime.date(2013, 7, 30)

Note

This creates a copy of the descriptor instance rather than manipulate itself in-place.

encoder(function)

Decorator which sets the encoder to the decorated function:

import datetime

class Person(DocumentElement):
    '''Person.dob will be written to ISO 8601 format'''

    __tag__ = 'person'
    dob = Text('dob')

    @dob.encoder
    def dob(self, dob):
        if not isinstance(dob, datetime.date):
            raise TypeError('expected datetime.date')
        return dob.strftime('%Y-%m-%d')
>>> isinstance(p, Person)
True
>>> p.dob
datetime.date(1987, 7, 26)
>>> ''.join(write(p, indent='', newline=''))
'<person><dob>1987-07-26</dob></person>'

If it’s applied multiple times, all decorated functions are piped in the order:

class Person(Element):
    '''Person.email will have mailto: prefix when it's written
    to XML.

    '''

    email = Text('email', encoder=lambda email: 'mailto:' + email)

    @age.encoder
    def email(self, email):
        return email.strip()

    @email.encoder
    def email(self, email):
        login, host = email.split('@', 1)
        return login + '@' + host.lower()
>>> isinstance(p, Person)
True
>>> p.email
'  earthreader@librelist.com  '
>>> ''.join(write(p, indent='', newline=''))
>>> '<person><email>mailto:earthreader@librelist.com</email></person>')

Note

This creates a copy of the descriptor instance rather than manipulate itself in-place.

exception libearth.schema.CodecError

Rise when encoding/decoding between Python values and XML data goes wrong.

class libearth.schema.Content(codec=None, encoder=None, decoder=None)

Declare possible text nodes as a descriptor.

Parameters:
  • codec (Codec, collections.Callable) – an optional codec object to use. if it’s callable and not an instance of Codec, its return value will be used instead. it means this can take class object of Codec subtype that is not instantiated yet unless the constructor require any arguments
  • encoder (collections.Callable) – an optional function that encodes Python value into XML text value e.g. str(). the encoder function has to take an argument
  • decoder (collections.Callable) – an optional function that decodes XML text value into Python value e.g. int(). the decoder function has to take a string argument
read(element, value)

Read raw value from XML, decode it, and then set the attribute for content of the given element to the decoded value.

Note

Internal method.

class libearth.schema.ContentHandler(document)

Event handler implementation for SAX parser.

It maintains the stack that contains parsing contexts of what element is lastly open, what descriptor is associated to the element, and the buffer for chunks of content characters the element has. Every context is represented as the namedtuple ParserContext.

Each time its events (startElement(), characters(), and endElement()) are called, it forwards the data to the associated descriptor. Descriptor subtypes implement start_element() method and end_element().

exception libearth.schema.DecodeError

Rise when decoding XML data to Python values goes wrong.

class libearth.schema.Descriptor(tag, xmlns=None, required=False, multiple=False, sort_key=None, sort_reverse=None)

Abstract base class for Child and Text.

__weakref__

list of weak references to the object (if defined)

end_element(reserved_value, content)

Abstract method that is invoked when the parser meets an end of an element related to the descriptor. It will be called by ContentHandler.

Parameters:
  • reserved_value – the value start_element() method returned
  • content (str) – the content text of the read element
key_pair = None

(tuple) The pair of (xmlns, tag).

multiple = None

(bool) Whether it can be zero or more for the element. If it’s True required has to be False.

required = None

(bool) Whether it is required for the element. If it’s True multiple has to be False.

sort_key = None

(collections.Callable) An optional function to be used for sorting multiple elements. It has to take an element and return a value for sort key. It is the same to key option of sorted() built-in function.

It’s available only when multiple is True.

Use sort_reverse for descending order.

Note

It doesn’t guarantee that all elements must be sorted in runtime, but all elements become sorted when it’s written using write() function.

sort_reverse = None

(bool) Whether to reverse elements when they become sorted. It is the same to reverse option of sorted() built-in function.

It’s available only when sort_key is present.

start_element(element, attribute)

Abstract method that is invoked when the parser meets a start of an element related to the descriptor. It will be called by ContentHandler.

Parameters:
  • element (Element) – the parent element of the read element
  • attribute (str) – the attribute name of the descriptor
Returns:

a value to reserve. it will be passed to reserved_value parameter of end_element()

tag = None

(str) The tag name.

xmlns = None

(str) The optional XML namespace URI.

exception libearth.schema.DescriptorConflictError

Error which rises when a schema has duplicate descriptors more than one for the same attribute, the same child element, or the text node.

class libearth.schema.DocumentElement(_parent=None, **kwargs)

The root element of the document.

__tag__

(str) Every DocumentElement subtype has to define this attribute to the root tag name.

__xmlns__

(str) A DocumentElement subtype may define this attribute to the XML namespace of the document element.

class libearth.schema.Element(_parent=None, **attributes)

Represent an element in XML document.

It provides the default constructor which takes keywords and initializes the attributes by given keyword arguments. For example, the following code that uses the default constructor:

assert issubclass(Person, Element)

author = Person(
    name='Hong Minhee',
    url='http://dahlia.kr/'
)

is equivalent to the following code:

author = Person()
author.name = 'Hong Minhee'
author.url = 'http://dahlia.kr/'
classmethod __coerce_from__(value)

Cast a value which isn’t an instance of the element type to the element type. It’s useful when a boxed element type could be more naturally represented using builtin type.

For example, Mark could be represented as a boolean value, and Text also could be represented as a string.

The following example shows how the element type can be automatically casted from string by implementing __coerce_from__() class method:

@classmethod
def __coerce_from__(cls, value):
    if isinstance(value, str):
        return Text(value=value)
    raise TypeError('expected a string or Text')
__entity_id__()

Identify the entity object. It returns the entity object itself by default, but should be overridden.

Returns:any value to identify the entity object
__merge_entities__(other)

Merge two entities (self and other). It can return one of the two, or even a new entity object. This method is used by Session objects to merge conflicts between concurrent updates.

Parameters:other (Element) – other entity to merge. it’s guaranteed that it’s older session’s (note that it doesn’t mean this entity is older than self, but the session’s last update is)
Returns:on of the two, or even an new entity object that merges two entities
Return type:Element

Note

The default implementation simply returns self. That means the entity of the newer session will always win unless the method is overridden.

class libearth.schema.ElementList(element, descriptor, value_type=None)

List-like object to represent multiple chidren. It makes the parser to lazily consume the buffer when an element of a particular offset is requested.

You can extend methods or properties for a particular element type using element_list_for() class decorator e.g.:

@element_list_for(Link)
class LinkList(collections.Sequence):
    '''Specialized ElementList for Link elements.'''

    def filter_by_mimetype(self, mimetype):
        '''Filter links by their mimetype.'''
        return [link for link in self if link.mimetype == mimetype]

Extended methods/properties can be used for element lists for the type:

assert isinstance(feed.links, LinkList)
assert isinstance(feed.links, ElementList)
feed.links.filter_by_mimetype('text/html')
consume_buffer()

Consume the buffer for the parser. It returns a generator, so can be stopped using break statement by caller.

Note

Internal method.

classmethod register_specialized_type(value_type, specialized_type)

Register specialized collections.Sequence type for a particular value_type.

An imperative version of :func`element_list_for()` class decorator.

Parameters:
  • value_type (type) – a particular element type that specialized_type would be used for instead of default ElementList class. it has to be a subtype of Element
  • specialized_type (type) – a collections.Sequence type which extends methods and properties for value_type
specialized_types = {<class 'libearth.feed.Link'>: (<class 'libearth.feed.LinkList'>, None)}

(collections.MutableMapping) The internal table for specialized subtypes used by register_specialized_type() method and element_list_for() class decorator.

exception libearth.schema.EncodeError

Rise when encoding Python values into XML data goes wrong.

exception libearth.schema.IntegrityError

Rise when an element is invalid according to the schema.

exception libearth.schema.SchemaError

Error which rises when a schema definition has logical errors.

__weakref__

list of weak references to the object (if defined)

class libearth.schema.Text(tag, codec=None, xmlns=None, required=False, multiple=False, encoder=None, decoder=None, sort_key=None, sort_reverse=None)

Descriptor that declares a possible child element that only cosists of character data. All other attributes and child nodes are ignored.

Parameters:
  • tag (str) – the XML tag name
  • codec (Codec, collections.Callable) – an optional codec object to use. if it’s callable and not an instance of Codec, its return value will be used instead. it means this can take class object of Codec subtype that is not instantiated yet unless the constructor require any arguments
  • xmlns (str) – an optional XML namespace URI
  • required (bool) – whether the child is required or not. it’s exclusive to multiple. False by default
  • multiple (bool) – whether the child can be multiple. it’s exclusive to required. False by default
  • encoder (collections.Callable) – an optional function that encodes Python value into XML text value e.g. str(). the encoder function has to take an argument
  • decoder (collections.Callable) – an optional function that decodes XML text value into Python value e.g. int(). the decoder function has to take a string argument
  • sort_key (collections.Callable) – an optional function to be used for sorting multiple child elements. it has to take a child as Element and return a value for sort key. it is the same to key option of sorted() built-in function. note that it doesn’t guarantee that all elements must be sorted in runtime, but all elements become sorted when it’s written using write() function. it’s available only when multiple is True. use sort_reverse for descending order.
  • sort_reverse (bool) – ehether to reverse elements when they become sorted. it is the same to reverse option of sorted() built-in function. it’s available only when sort_key is present
libearth.schema.complete(element)

Completely load the given element.

Parameters:element (Element) – an element loaded by read()
class libearth.schema.element_list_for(value_type)

Class decorator which registers specialized ElementList subclass for a particular value_type e.g.:

@element_list_for(Link)
class LinkList(collections.Sequence):
    '''Specialized ElementList for Link elements.'''

    def filter_by_mimetype(self, mimetype):
        '''Filter links by their mimetype.'''
        return [link for link in self if link.mimetype == mimetype]
Parameters:value_type (type) – a particular element type that specialized_type would be used for instead of default ElementList class. it has to be a subtype of Element
__weakref__

list of weak references to the object (if defined)

libearth.schema.index_descriptors(element_type)

Index descriptors of the given element_type to make them easy to be looked up by their identifiers (pairs of XML namespace URI and tag name).

Parameters:element_type (type) – a subtype of Element to index its descriptors

Note

Internal function.

libearth.schema.inspect_attributes(element_type)

Get the dictionary of Attribute descriptors of the given element_type.

Parameters:element_type (type) – a subtype of Element to inspect
Returns:a dictionary of attribute identifiers (pairs of xml namespace uri and xml attribute name) to pairs of instance attribute name and associated Attribute descriptor
Return type:collections.Mapping

Note

Internal function.

libearth.schema.inspect_child_tags(element_type)

Get the dictionary of Descriptor objects of the given element_type.

Parameters:element_type (type) – a subtype of Element to inspect
Returns:a dictionary of child node identifiers (pairs of xml namespace uri and tag name) to pairs of instance attribute name and associated Descriptor
Return type:collections.Mapping

Note

Internal function.

libearth.schema.inspect_content_tag(element_type)

Gets the Content descriptor of the given element_type.

Parameters:element_type (type) – a subtype of Element to inspect
Returns:a pair of instance attribute name and associated Content descriptor
Return type:tuple

Note

Internal function.

libearth.schema.inspect_xmlns_set(element_type)

Get the set of XML namespaces used in the given element_type, recursively including all child elements.

Parameters:element_type (type) – a subtype of Element to inspect
Returns:a set of uri strings of used all xml namespaces
Return type:collections.Set

Note

Internal function.

libearth.schema.is_partially_loaded(element)

Return whether the given element is not completely loaded by read() yet.

Parameters:element (Element) – an element
Returns:whether True if the given element is partially loaded
Return type:bool
libearth.schema.read(cls, iterable)

Initialize a document in read mode by opening the iterable of XML string.

with open('doc.xml', 'rb') as f:
    read(Person, f)

Returned document element is not fully read but partially loaded into memory, and then lazily (and eventually) loaded when these are actually needed.

Parameters:
  • cls (type) – a subtype of DocumentElement
  • iterable (collections.Iterable) – chunks of XML string to read
Returns:

initialized document element in read mode

Return type:

DocumentElement

libearth.schema.validate(element, recurse=True, raise_error=True)

Validate the given element according to the schema.

from libearth.schema import IntegrityError, validate

try:
    validate(element)
except IntegrityError:
    print('the element {0!r} is invalid!'.format(element))
Parameters:
  • element (Element) – the element object to validate
  • recurse (bool) – recursively validate the whole tree (child nodes). True by default
  • raise_error (bool) – raise exception when the element is invalid. if it’s False it returns False instead of raising an exception. True by default
Returns:

True if the element is valid. False if the element is invalid and raise_error option is False`

Raises IntegrityError:
 

when the element is invalid and raise_error option is True

class libearth.schema.write(document, validate=True, indent=' ', newline='n', canonical_order=False, as_bytes=None)

Write the given document to XML string. The return value is an iterator that yields chunks of an XML string.

with open('doc.xml', 'w') as f:
    for chunk in write(document):
        f.write(chunk)
Parameters:
  • document (DocumentElement) – the document element to serialize
  • validate (bool) – whether validate the document or not. True by default
  • indent (str) – an optional string to be used for indent. default is four spaces ('    ')
  • newline (str) – an optional character to be used for newline. default is '\n'
  • canonical_order (bool) – make the order of attributes and child nodes consistent to any python versions and implementations. useful for testing. False by default
  • as_bytes – return chunks as bytes (str in Python 2) if True. return chunks as str (unicode in Python 3) if False. return chunks as default string type (str) by default
Returns:

chunks of an XML string

Return type:

collections.Iterable

Fork me on GitHub