libearth.schema
— Declarative schema for pulling DOM parser of XML¶
There are well-known two ways to parse XML:
- Document Object Model
- It reads the whole XML and then makes a tree in memory. You can easily treverse the document as a tree, but the parsing can’t be streamed. Moreover it uses memory for data you don’t use.
- Simple API for XML
- It’s an event-based sequential access parser. It means you need to listen events from it and then utilize its still unstructured data by yourself. In other words, you don’t need to pay memory to data you never use if you simply do nothing for them when you listen the event.
Pros and cons between these two ways are obvious, but there could be another way to parse XML: mix them.
The basic idea of this pulling DOM parser (which this module implements) is that the parser can consume the stream just in time when you actually reach the child node. There should be an assumption for that: parsed XML has a schema for it. If the document is schema-free, this heuristic approach loses the most of its efficiency.
So the parser should have the information about the schema of XML document it’d parser, and we can declare the schema by defining classes. It’s a thing like ORM for XML. For example, suppose there is a small XML document:
<?xml version="1.0"?>
<person version="1.0">
<name>Hong Minhee</name>
<url>http://dahlia.kr/</url>
<url>https://github.com/dahlia</url>
<url>https://bitbucket.org/dahlia</url>
<dob>1988-08-04</dob>
</person>
You can declare the schema for this like the following class definition:
class Person(DocumentElement):
__tag__ = 'person'
format_version = Attribute('version')
name = Text('name')
url = Child('url', URL, multiple=True)
dob = Child('dob', Date)
-
libearth.schema.
PARSER_LIST
= []¶ (
collections.Sequence
) The list ofxml.sax
parser implementations to try to import.
-
libearth.schema.
SCHEMA_XMLNS
= 'http://earthreader.org/schema/'¶ (
str
) The XML namespace name used for schema metadatq.
-
class
libearth.schema.
Attribute
(name, codec=None, xmlns=None, required=False, default=None, encoder=None, decoder=None)¶ Declare possible element attributes as a descriptor.
Parameters: - name (
str
) – the XML attribute name - codec (
Codec
,collections.Callable
) – an optional codec object to use. if it’s callable and not an instance ofCodec
, its return value will be used instead. it means this can take class object ofCodec
subtype that is not instantiated yet unless the constructor require any arguments - xmlns (
str
) – an optional XML namespace URI - required (
bool
) – whether the child is required or not.False
by default - default (
collections.Callable
) – an optional function that returns default value when the attribute is not present. the function takes an argument which is anElement
instance - encoder (
collections.Callable
) – an optional function that encodes Python value into XML text value e.g.str()
. the encoder function has to take an argument - decoder (
collections.Callable
) – an optional function that decodes XML text value into Python value e.g.int()
. the decoder function has to take a string argument
Changed in version 0.2.0: The
default
option becomes to accept only callable objects. Below 0.2.0,default
is not a function but a value which is simply used as it is.-
default
= None¶ (
collections.Callable
) The function that returns default value when the attribute is not present. The function takes an argument which is anElement
instance.Changed in version 0.2.0: It becomes to accept only callable objects. Below 0.2.0,
default
attribute is not a function but a value which is simply used as it is.
- name (
-
class
libearth.schema.
Child
(tag, element_type, xmlns=None, required=False, multiple=False, sort_key=None, sort_reverse=None)¶ Declare a possible child element as a descriptor.
In order to have
Child
of the element type which is not defined yet (or self-referential) pass the class name of the element type to contain. The name will be lazily evaluated e.g.:class Person(Element): '''Everyone can have their children, that also are a Person.''' children = Child('child', 'Person', multiple=True)
Parameters: - tag (
str
) – the tag name - xmlns (
str
) – an optional XML namespace URI - element_type (
type
,str
) – the type of child element(s). it has to be a subtype ofElement
. if it’s a string it means referring the class name which is going to be lazily evaluated - required (
bool
) – whether the child is required or not. it’s exclusive tomultiple
.False
by default - multiple (
bool
) – whether the child can be multiple. it’s exclusive torequired
.False
by default - sort_key (
collections.Callable
) – an optional function to be used for sorting multiple child elements. it has to take a child asElement
and return a value for sort key. it is the same tokey
option ofsorted()
built-in function. note that it doesn’t guarantee that all elements must be sorted in runtime, but all elements become sorted when it’s written usingwrite()
function. it’s available only whenmultiple
isTrue
. usesort_reverse
for descending order. - sort_reverse (
bool
) – ehether to reverse elements when they become sorted. it is the same toreverse
option ofsorted()
built-in function. it’s available only whensort_key
is present
- tag (
-
class
libearth.schema.
Codec
¶ Abstract base class for codecs to serialize Python values to be stored in XML and deserialize XML texts to Python values.
In most cases encoding and decoding are implementation details of format which is well-defined, so these two functions could be paired. The interface rely on that idea.
To implement a codec, you have to subclass
Codec
and override a pair of methods:encode()
anddecode()
.Codec objects are acceptable by
Attribute
,Text
, andContent
(all they subclassCodecDescriptor
).
-
class
libearth.schema.
CodecDescriptor
(codec=None, encoder=None, decoder=None)¶ Mixin class for descriptors that provide
decoder()
andencoder()
.Attribute
,Content
andText
can takeencoder
anddecoder
functions for them. It’s used for encoding from Python values to XML string and decoding raw values from XML to natural Python representations.It can take a
codec
, orencode
anddecode
separately. (Of course they all can be present at a time.) In most cases, you’ll need onlycodec
parameter that encoder and decoder are coupled:Text('dob', Rfc3339(prefer_utc=True))
Encoders can be specified using
encoder
parameter of descriptor’s constructor, orencoder()
decorator.Decoders can be specified using
decoder
parameter of descriptor’s constructor, ordecoder()
decorator:class Person(DocumentElement): __tag__ = 'person' format_version = Attribute('version') name = Text('name') url = Child('url', URL, multiple=True) dob = Text('dob', encoder=datetime.date.strftime.isoformat, decoder=lambda s: datetime.date.strptime(s, '%Y-%m-%d')) @format_version.encoder def format_version(self, value): return '.'.join(map(str, value)) @format_version.decoder def format_version(self, value): return tuple(map(int, value.split('.')))
Parameters: - codec (
Codec
,collections.Callable
) – an optional codec object to use. if it’s callable and not an instance ofCodec
, its return value will be used instead. it means this can take class object ofCodec
subtype that is not instantiated yet unless the constructor require any arguments - encoder (
collections.Callable
) – an optional function that encodes Python value into XML text value e.g.str()
. the encoder function has to take an argument - decoder (
collections.Callable
) – an optional function that decodes XML text value into Python value e.g.int()
. the decoder function has to take a string argument
-
decode
(text, instance)¶ Decode the given
text
as it’s programmed.Parameters: Returns: decoded value
Note
Internal method.
-
decoder
(function)¶ Decorator which sets the decoder to the decorated function:
import datetime class Person(DocumentElement): '''Person.dob will be a datetime.date instance.''' __tag__ = 'person' dob = Text('dob') @dob.decoder def dob(self, dob_text): return datetime.date.strptime(dob_text, '%Y-%m-%d')
>>> p = Person('<person><dob>1987-07-26</dob></person>') >>> p.dob datetime.date(1987, 7, 26)
If it’s applied multiple times, all decorated functions are piped in the order:
class Person(Element): '''Person.age will be an integer.''' age = Text('dob', decoder=lambda text: text.strip()) @age.decoder def age(self, dob_text): return datetime.date.strptime(dob_text, '%Y-%m-%d') @age.decoder def age(self, dob): now = datetime.date.today() d = now.month < dob.month or (now.month == dob.month and now.day < dob.day) return now.year - dob.year - d
>>> p = Person('<person>\n\t<dob>\n\t\t1987-07-26\n\t</dob>\n</person>') >>> p.age 26 >>> datetime.date.today() datetime.date(2013, 7, 30)
Note
This creates a copy of the descriptor instance rather than manipulate itself in-place.
-
encoder
(function)¶ Decorator which sets the encoder to the decorated function:
import datetime class Person(DocumentElement): '''Person.dob will be written to ISO 8601 format''' __tag__ = 'person' dob = Text('dob') @dob.encoder def dob(self, dob): if not isinstance(dob, datetime.date): raise TypeError('expected datetime.date') return dob.strftime('%Y-%m-%d')
>>> isinstance(p, Person) True >>> p.dob datetime.date(1987, 7, 26) >>> ''.join(write(p, indent='', newline='')) '<person><dob>1987-07-26</dob></person>'
If it’s applied multiple times, all decorated functions are piped in the order:
class Person(Element): '''Person.email will have mailto: prefix when it's written to XML. ''' email = Text('email', encoder=lambda email: 'mailto:' + email) @age.encoder def email(self, email): return email.strip() @email.encoder def email(self, email): login, host = email.split('@', 1) return login + '@' + host.lower()
>>> isinstance(p, Person) True >>> p.email ' earthreader@librelist.com ' >>> ''.join(write(p, indent='', newline='')) >>> '<person><email>mailto:earthreader@librelist.com</email></person>')
Note
This creates a copy of the descriptor instance rather than manipulate itself in-place.
- codec (
-
exception
libearth.schema.
CodecError
¶ Rise when encoding/decoding between Python values and XML data goes wrong.
-
class
libearth.schema.
Content
(codec=None, encoder=None, decoder=None)¶ Declare possible text nodes as a descriptor.
Parameters: - codec (
Codec
,collections.Callable
) – an optional codec object to use. if it’s callable and not an instance ofCodec
, its return value will be used instead. it means this can take class object ofCodec
subtype that is not instantiated yet unless the constructor require any arguments - encoder (
collections.Callable
) – an optional function that encodes Python value into XML text value e.g.str()
. the encoder function has to take an argument - decoder (
collections.Callable
) – an optional function that decodes XML text value into Python value e.g.int()
. the decoder function has to take a string argument
-
read
(element, value)¶ Read raw
value
from XML, decode it, and then set the attribute for content of the givenelement
to the decoded value.Note
Internal method.
- codec (
-
class
libearth.schema.
ContentHandler
(document)¶ Event handler implementation for SAX parser.
It maintains the stack that contains parsing contexts of what element is lastly open, what descriptor is associated to the element, and the buffer for chunks of content characters the element has. Every context is represented as the namedtuple
ParserContext
.Each time its events (
startElement()
,characters()
, andendElement()
) are called, it forwards the data to the associated descriptor.Descriptor
subtypes implementstart_element()
method andend_element()
.
-
exception
libearth.schema.
DecodeError
¶ Rise when decoding XML data to Python values goes wrong.
-
class
libearth.schema.
Descriptor
(tag, xmlns=None, required=False, multiple=False, sort_key=None, sort_reverse=None)¶ Abstract base class for
Child
andText
.-
end_element
(reserved_value, content)¶ Abstract method that is invoked when the parser meets an end of an element related to the descriptor. It will be called by
ContentHandler
.Parameters: - reserved_value – the value
start_element()
method returned - content (
str
) – the content text of the read element
- reserved_value – the value
-
multiple
= None¶ (
bool
) Whether it can be zero or more for the element. If it’sTrue
required
has to beFalse
.
-
required
= None¶ (
bool
) Whether it is required for the element. If it’sTrue
multiple
has to beFalse
.
-
sort_key
= None¶ (
collections.Callable
) An optional function to be used for sorting multiple elements. It has to take an element and return a value for sort key. It is the same tokey
option ofsorted()
built-in function.It’s available only when
multiple
isTrue
.Use
sort_reverse
for descending order.Note
It doesn’t guarantee that all elements must be sorted in runtime, but all elements become sorted when it’s written using
write()
function.
-
sort_reverse
= None¶ (
bool
) Whether to reverse elements when they become sorted. It is the same toreverse
option ofsorted()
built-in function.It’s available only when
sort_key
is present.
-
start_element
(element, attribute)¶ Abstract method that is invoked when the parser meets a start of an element related to the descriptor. It will be called by
ContentHandler
.Parameters: Returns: a value to reserve. it will be passed to
reserved_value
parameter ofend_element()
-
-
exception
libearth.schema.
DescriptorConflictError
¶ Error which rises when a schema has duplicate descriptors more than one for the same attribute, the same child element, or the text node.
-
class
libearth.schema.
DocumentElement
(_parent=None, **kwargs)¶ The root element of the document.
-
__tag__
¶ (
str
) EveryDocumentElement
subtype has to define this attribute to the root tag name.
-
__xmlns__
¶ (
str
) ADocumentElement
subtype may define this attribute to the XML namespace of the document element.
-
-
class
libearth.schema.
Element
(_parent=None, **attributes)¶ Represent an element in XML document.
It provides the default constructor which takes keywords and initializes the attributes by given keyword arguments. For example, the following code that uses the default constructor:
assert issubclass(Person, Element) author = Person( name='Hong Minhee', url='http://dahlia.kr/' )
is equivalent to the following code:
author = Person() author.name = 'Hong Minhee' author.url = 'http://dahlia.kr/'
-
classmethod
__coerce_from__
(value)¶ Cast a value which isn’t an instance of the element type to the element type. It’s useful when a boxed element type could be more naturally represented using builtin type.
For example,
Mark
could be represented as a boolean value, andText
also could be represented as a string.The following example shows how the element type can be automatically casted from string by implementing
__coerce_from__()
class method:@classmethod def __coerce_from__(cls, value): if isinstance(value, str): return Text(value=value) raise TypeError('expected a string or Text')
-
__entity_id__
()¶ Identify the entity object. It returns the entity object itself by default, but should be overridden.
Returns: any value to identify the entity object
-
__merge_entities__
(other)¶ Merge two entities (
self
andother
). It can return one of the two, or even a new entity object. This method is used bySession
objects to merge conflicts between concurrent updates.Parameters: other ( Element
) – other entity to merge. it’s guaranteed that it’s older session’s (note that it doesn’t mean this entity is older thanself
, but the session’s last update is)Returns: on of the two, or even an new entity object that merges two entities Return type: Element
Note
The default implementation simply returns
self
. That means the entity of the newer session will always win unless the method is overridden.
-
classmethod
-
class
libearth.schema.
ElementList
(element, descriptor, value_type=None)¶ List-like object to represent multiple chidren. It makes the parser to lazily consume the buffer when an element of a particular offset is requested.
You can extend methods or properties for a particular element type using
element_list_for()
class decorator e.g.:@element_list_for(Link) class LinkList(collections.Sequence): '''Specialized ElementList for Link elements.''' def filter_by_mimetype(self, mimetype): '''Filter links by their mimetype.''' return [link for link in self if link.mimetype == mimetype]
Extended methods/properties can be used for element lists for the type:
assert isinstance(feed.links, LinkList) assert isinstance(feed.links, ElementList) feed.links.filter_by_mimetype('text/html')
-
consume_buffer
()¶ Consume the buffer for the parser. It returns a generator, so can be stopped using
break
statement by caller.Note
Internal method.
-
classmethod
register_specialized_type
(value_type, specialized_type)¶ Register specialized
collections.Sequence
type for a particularvalue_type
.An imperative version of :func`element_list_for()` class decorator.
Parameters: - value_type (
type
) – a particular element type thatspecialized_type
would be used for instead of defaultElementList
class. it has to be a subtype ofElement
- specialized_type (
type
) – acollections.Sequence
type which extends methods and properties forvalue_type
- value_type (
-
specialized_types
= {<class 'libearth.feed.Link'>: (<class 'libearth.feed.LinkList'>, None)}¶ (
collections.MutableMapping
) The internal table for specialized subtypes used byregister_specialized_type()
method andelement_list_for()
class decorator.
-
-
exception
libearth.schema.
EncodeError
¶ Rise when encoding Python values into XML data goes wrong.
-
exception
libearth.schema.
IntegrityError
¶ Rise when an element is invalid according to the schema.
-
exception
libearth.schema.
SchemaError
¶ Error which rises when a schema definition has logical errors.
-
class
libearth.schema.
Text
(tag, codec=None, xmlns=None, required=False, multiple=False, encoder=None, decoder=None, sort_key=None, sort_reverse=None)¶ Descriptor that declares a possible child element that only cosists of character data. All other attributes and child nodes are ignored.
Parameters: - tag (
str
) – the XML tag name - codec (
Codec
,collections.Callable
) – an optional codec object to use. if it’s callable and not an instance ofCodec
, its return value will be used instead. it means this can take class object ofCodec
subtype that is not instantiated yet unless the constructor require any arguments - xmlns (
str
) – an optional XML namespace URI - required (
bool
) – whether the child is required or not. it’s exclusive tomultiple
.False
by default - multiple (
bool
) – whether the child can be multiple. it’s exclusive torequired
.False
by default - encoder (
collections.Callable
) – an optional function that encodes Python value into XML text value e.g.str()
. the encoder function has to take an argument - decoder (
collections.Callable
) – an optional function that decodes XML text value into Python value e.g.int()
. the decoder function has to take a string argument - sort_key (
collections.Callable
) – an optional function to be used for sorting multiple child elements. it has to take a child asElement
and return a value for sort key. it is the same tokey
option ofsorted()
built-in function. note that it doesn’t guarantee that all elements must be sorted in runtime, but all elements become sorted when it’s written usingwrite()
function. it’s available only whenmultiple
isTrue
. usesort_reverse
for descending order. - sort_reverse (
bool
) – ehether to reverse elements when they become sorted. it is the same toreverse
option ofsorted()
built-in function. it’s available only whensort_key
is present
- tag (
-
libearth.schema.
complete
(element)¶ Completely load the given
element
.Parameters: element ( Element
) – an element loaded byread()
-
class
libearth.schema.
element_list_for
(value_type)¶ Class decorator which registers specialized
ElementList
subclass for a particularvalue_type
e.g.:@element_list_for(Link) class LinkList(collections.Sequence): '''Specialized ElementList for Link elements.''' def filter_by_mimetype(self, mimetype): '''Filter links by their mimetype.''' return [link for link in self if link.mimetype == mimetype]
Parameters: value_type ( type
) – a particular element type thatspecialized_type
would be used for instead of defaultElementList
class. it has to be a subtype ofElement
-
libearth.schema.
index_descriptors
(element_type)¶ Index descriptors of the given
element_type
to make them easy to be looked up by their identifiers (pairs of XML namespace URI and tag name).Parameters: element_type ( type
) – a subtype ofElement
to index its descriptorsNote
Internal function.
-
libearth.schema.
inspect_attributes
(element_type)¶ Get the dictionary of
Attribute
descriptors of the givenelement_type
.Parameters: element_type ( type
) – a subtype ofElement
to inspectReturns: a dictionary of attribute identifiers (pairs of xml namespace uri and xml attribute name) to pairs of instance attribute name and associated Attribute
descriptorReturn type: collections.Mapping
Note
Internal function.
Get the dictionary of
Descriptor
objects of the givenelement_type
.Parameters: element_type ( type
) – a subtype ofElement
to inspectReturns: a dictionary of child node identifiers (pairs of xml namespace uri and tag name) to pairs of instance attribute name and associated Descriptor
Return type: collections.Mapping
Note
Internal function.
-
libearth.schema.
inspect_content_tag
(element_type)¶ Gets the
Content
descriptor of the givenelement_type
.Parameters: element_type ( type
) – a subtype ofElement
to inspectReturns: a pair of instance attribute name and associated Content
descriptorReturn type: tuple
Note
Internal function.
-
libearth.schema.
inspect_xmlns_set
(element_type)¶ Get the set of XML namespaces used in the given
element_type
, recursively including all child elements.Parameters: element_type ( type
) – a subtype ofElement
to inspectReturns: a set of uri strings of used all xml namespaces Return type: collections.Set
Note
Internal function.
-
libearth.schema.
is_partially_loaded
(element)¶ Return whether the given
element
is not completely loaded byread()
yet.Parameters: element ( Element
) – an elementReturns: whether True
if the givenelement
is partially loadedReturn type: bool
-
libearth.schema.
read
(cls, iterable)¶ Initialize a document in read mode by opening the
iterable
of XML string.with open('doc.xml', 'rb') as f: read(Person, f)
Returned document element is not fully read but partially loaded into memory, and then lazily (and eventually) loaded when these are actually needed.
Parameters: - cls (
type
) – a subtype ofDocumentElement
- iterable (
collections.Iterable
) – chunks of XML string to read
Returns: initialized document element in read mode
Return type: - cls (
-
libearth.schema.
validate
(element, recurse=True, raise_error=True)¶ Validate the given
element
according to the schema.from libearth.schema import IntegrityError, validate try: validate(element) except IntegrityError: print('the element {0!r} is invalid!'.format(element))
Parameters: Returns: True
if theelement
is valid.False
if theelement
is invalid andraise_error
option isFalse`
Raises IntegrityError: when the
element
is invalid andraise_error
option isTrue
-
class
libearth.schema.
write
(document, validate=True, indent=' ', newline='n', canonical_order=False, hints=True, as_bytes=None)¶ Write the given
document
to XML string. The return value is an iterator that yields chunks of an XML string.with open('doc.xml', 'w') as f: for chunk in write(document): f.write(chunk)
Parameters: - document (
DocumentElement
) – the document element to serialize - validate (
bool
) – whether validate thedocument
or not.True
by default - indent (
str
) – an optional string to be used for indent. default is four spaces (' '
) - newline (
str
) – an optional character to be used for newline. default is'\n'
- canonical_order (
bool
) – make the order of attributes and child nodes consistent to any python versions and implementations. useful for testing.False
by default - hints (
bool
) – export hint values as well. hints improves efficiency ofread()
.True
by default - as_bytes – return chunks as
bytes
(str
in Python 2) ifTrue
. return chunks asstr
(unicode
in Python 3) ifFalse
. return chunks as default string type (str
) by default
Returns: chunks of an XML string
Return type: collections.Iterable
- document (