Enter search terms or a module, class or function name.
Libearth is the shared common library for various Earth Reader apps. Earth Reader try to support many platforms as possible (e.g. web, mobile apps, desktop apps), so there must be a large part of common concepts and implementations they share like subscription lists, synchronization through cloud storages between several devices, and crawler, that libearth actually implements.
Libearth officially supports the following Python implementations:
For environments setuptools not available, it has no required dependencies.
See also tox.ini file and CI builds.
This module provides commonly used codecs to parse RSS-related standard formats.
Codec to interpret boolean representation in strings e.g. 'true', 'no', and encode bool values back to string.
Parameters: |
---|
Codec that accepts only predefined fixed types of values:
gender = Enum(['male', 'female'])
Actually it doesn’t any encoding nor decoding, but it simply validates all values from XML and Python both.
Note that values have to consist of only strings.
Parameters: | values (collections.Iterable) – any iterable that yields all possible values |
---|
Codec to encode and decode integer numbers.
Codec to store datetime.datetime values to RFC 3339 format.
Parameters: | prefer_utc (bool) – normalize all timezones to UTC. False by default |
---|
Codec to encode/decode datetime.datetime values to/from RFC 822 format.
This module provides several subtle things to support multiple Python versions (2.6, 2.7, 3.2, 3.3) and VM implementations (CPython, PyPy).
(bool) Whether it is IronPython or not.
(bool) Whether it is Python 3.x or not.
(bool) Whether the Python VM uses Unicode strings by default. It must be True if PY3 or IronPython.
Makes string to str in Python 2. Makes string to bytes in Python 3 or IronPython.
Parameters: |
|
---|
(type) Type for representing binary data. str in Python 2 and bytes in Python 3.
alias of str
If filename is a text_type, encode it to binary_type according to filesystem’s default encoding.
(type, tuple) Types for file objects that have fileno().
(type) Type for text data. basestring in Python 2 and str in Python 3.
alias of basestring
Makes string to str in Python 3 or IronPython. Does nothing in Python 2.
Parameters: | string (bytes, str, unicode) – a string to cast it to text_type |
---|
This proxy module offers a compatibility layer between several ElementTree implementations.
It provides the following two functions:
Parse the given XML string.
Parameters: | string (str, bytes, basestring) – xml string to parse |
---|---|
Returns: | the element tree object |
Parse the given chunks of XML string.
Parameters: | iterable (collections.Iterable) – chunks of xml string to parse |
---|---|
Returns: | the element tree object |
Generate an XML string from the given element tree.
Parameters: | tree – an element tree object to serialize |
---|---|
Returns: | an xml string |
Return type: | str, bytes |
Parses an XML document or fragment from a string. Returns the root node (or the result returned by a parser target).
To override the default parser with a different parser you can pass it to the parser keyword argument.
The base_url keyword argument allows to set the original base URL of the document to support relative Paths when looking up external entities (DTD, XInclude, ...).
Parses an XML document from a sequence of strings. Returns the root node (or the result returned by a parser target).
To override the default parser with a different parser you can pass it to the parser keyword argument.
Python xml.sax parser implementation and ElementTree builder using CLR System.Xml.XmlReader.
(str) The reserved namespace URI for XML namespace.
System.IO.Stream implementation that takes a Python iterable and then transforms it into CLR stream.
Parameters: | iterable (collections.Iterable) – a Python iterable to transform |
---|
ElementTree builder using System.Xml.XmlReader.
SAX PullReader implementation using CLR System.Xml.XmlReader.
Create a new XmlReader() parser instance.
Returns: | a new parser instance |
---|---|
Return type: | XmlReader |
Get the number of CPU cores.
Returns: | the number of cpu cores |
---|---|
Return type: | numbers.Integral |
Parallel vesion of builtin map() except of some differences:
Parameters: |
|
---|---|
Returns: | a promise iterable to future results |
Return type: | collections.Iterable |
Changed in version 0.1.1: Errored values are raised at the lastest.
SAX parser interface which provides similar but slightly less power than IncremenetalParser.
IncrementalParser can feed arbitrary length of bytes while it can’t determine how long bytes to feed.
This method is called when the entire XML document has been passed to the parser through the feed method, to notify the parser that there are no more data. This allows the parser to do the final checks on the document and empty the internal data buffer.
The parser will not be ready to parse another document until the reset method has been called.
close() may raise SAXException.
Raises xml.sax.SAXException: | |
---|---|
when something goes wrong |
This method makes the parser to parse the next step node, emitting the corresponding events.
feed() may raise SAXException.
Returns: | whether the stream buffer is not empty yet |
---|---|
Return type: | bool |
Raises xml.sax.SAXException: | |
when something goes wrong |
This method is called by the parse implementation to allow the SAX 2.0 driver to prepare itself for parsing.
Parameters: | iterable (collections.Iterable) – iterable of bytes |
---|
This method is called after close has been called to reset the parser so that it is ready to parse new documents. The results of calling parse or feed after close without calling reset are undefined.
Crawl feeds.
Error which rises when crawling given url failed.
Crawl feeds in feed list using thread.
Parameters: | feeds – feeds |
---|---|
Returns: | set of pairs (~libearth.feed.Feed, crawler hint) |
Return type: | collections.Iterable |
libearth internally stores archive data as Atom format. It’s exactly not a complete set of RFC 4287, but a subset of the most of that. Since it’s not intended for crawling but internal representation, it does not follow robustness principle or such thing. It simply treats stored data are all valid and well-formed.
(str) The XML namespace name used for Earth Reader Mark metadata.
Category element defined in RFC 4287 (section 4.2.2).
(str) The optional human-readable label for display in end-user applications. It corresponds to label attribute of RFC 4287 (section 4.2.2.3).
(str) The URI that identifies a categorization scheme. It corresponds to scheme attribute of RFC 4287 (section 4.2.2.2).
See also
Content construct defined in RFC 4287 (section 4.1.3).
(re.RegexObject) The regular expression pattern that matches with valid MIME type strings.
(collections.Mapping) The mapping of type string (e.g. 'text') to the corresponding MIME type (e.g. text/plain).
Represent an individual entry, acting as a container for metadata and data associated with the entry. It corresponds to atom:entry element of RFC 4287 (section 4.1.2).
(Content) It either contains or links to the content of the entry.
It corresponds to atom:content element of RFC 4287 (section 4.1.3).
(datetime.datetime) The tz-aware datetime indicating an instant in time associated with an event early in the life cycle of the entry. Typically, published_at will be associated with the initial creation or first availability of the resource. It corresponds to atom:published element of RFC 4287 (section 4.2.9).
(Source) If an entry is copied from one feed into another feed, then the source feed’s metadata may be preserved within the copied entry by adding source if it is not already present in the entry, and including some or all of the source feed’s metadata as the source‘s data.
It is designed to allow the aggregation of entries from different feeds while retaining information about an entry’s source feed.
It corresponds to atom:source element of RFC 4287 (section 4.2.10).
Atom feed document, acting as a container for metadata and data associated with the feed.
It corresponds to atom:feed element of RFC 4287 (section 4.1.1).
Identify the agent used to generate a feed, for debugging and other purposes. It’s corresponds to atom:generator element of RFC 4287 (section 4.2.4).
Link element defined in RFC 4287 (section 4.2.7).
(numbers.Integral) The optional hint for the length of the linked content in octets. It corresponds to length attribute of RFC 4287 (section 4.2.7.6).
(str) The language of the linked content. It corresponds to hreflang attribute of RFC 4287 (section 4.2.7.4).
(str) The optional hint for the MIME media type of the linked content. It corresponds to type attribute of RFC 4287 (section 4.2.7.3).
(str) The relation type of the link. It corresponds to rel attribute of RFC 4287 (section 4.2.7.2).
Element list mixin specialized for Link.
Filter links by their mimetype e.g.:
links.filter_by_mimetype('text/html')
pattern can include wildcards (*) as well e.g.:
links.filter_by_mimetype('application/xml+*')
Parameters: | pattern (str) – the mimetype pattern to filter |
---|---|
Returns: | the filtered links |
Return type: | LinkList |
Represent whether the entry is read, starred, or tagged by user. It’s not a part of RFC 4287 Atom standard, but extension for Earth Reader.
(bool) Whether it’s marked or not.
(datetime.datetime) Updated time.
Common metadata shared by Source, Entry, and Feed.
(collections.MutableSequence) The list of Person objects which indicates the author of the entry or feed. It corresponds to atom:author element of RFC 4287 (section 4.2.1).
(collections.MutableSequence) The list of Category objects that conveys information about categories associated with an entry or feed. It corresponds to atom:category element of RFC 4287 (section 4.2.2).
(collections.MutableSequence) The list of Person objects which indicates a person or other entity who contributed to the entry or feed. It corresponds to atom:contributor element of RFC 4287 (section 4.2.3).
(str) The URI that conveys a permanent, universally unique identifier for an entry or feed. It corresponds to atom:id element of RFC 4287 (section 4.2.6).
(collections.LinkList) The list of Link objects that define a reference from an entry or feed to a web resource. It corresponds to atom:link element of RFC 4287 (section 4.2.7).
(Text) The text field that conveys information about rights held in and of an entry or feed. It corresponds to atom:rights element of RFC 4287 (section 4.2.10).
(Text) The human-readable title for an entry or feed. It corresponds to atom:title element of RFC 4287 (section 4.2.14).
(datetime.datetime) The tz-aware datetime indicating the most recent instant in time when the entry was modified in a way the publisher considers significant. Therefore, not all modifications necessarily result in a changed updated_at value. It corresponds to atom:updated element of RFC 4287 (section 4.2.15).
Person construct defined in RFC 4287 (section 3.2).
(str) The optional email address associated with the person. It corresponds to atom:email element of RFC 4287 (section 3.2.3).
All metadata for Feed excepting Feed.entries. It corresponds to atom:source element of RFC 4287 (section 4.2.10).
(Generator) Identify the agent used to generate a feed, for debugging and other purposes. It corresponds to atom:generator element of RFC 4287 (section 4.2.4).
(str) URI that identifies an image that provides iconic visual identification for a feed. It corresponds to atom:icon element of RFC 4287 (section 4.2.5).
Text construct defined in RFC 4287 (section 3.1).
(str) The secure HTML string of the text. If it’s a plain text, this becomes entity-escaped HTML string (for example, '<Hello>' becomes '<Hello>'), and if it’s a HTML text, the value is sanitized (for example, '<script>alert(1);</script><p>Hello</p>' comes '<p>Hello</p>').
Parsing Atom feed. Atom specification is RFC 4287
(str) The XML namespace for Atom format.
(str) The XML namespace for the predefined xml: prefix.
Atom parser. It parses the Atom XML and returns the feed data as internal representation.
Parameters: |
|
---|---|
Returns: | a pair of (Feed, crawler hint) |
Return type: |
This module provides functions to autodiscovery feed url in document.
(str) The MIME type of Atom format.
(str) The MIME type of RSS 2.0 format.
(collections.Mapping) The mapping table of feed types
Parse the given HTML and try finding the actual feed urls from it.
Namedtuple which is a pair of type` and ``url
Alias for field number 0
Alias for field number 1
Exception raised when feed url cannot be found in html.
If the given url refers an actual feed, it returns the given url without any change.
If the given url is a url of an ordinary web page (i.e. text/html), it finds the urls of the corresponding feed. It returns feed urls in feed types’ lexicographical order.
If autodiscovery failed, it raise FeedUrlNotFoundError.
Parameters: | |
---|---|
Returns: | list of FeedLink objects |
Return type: | collections.MutableSequence |
Guess the syndication format of an arbitrary document.
Parameters: | document (str, bytes) – document string to guess |
---|---|
Returns: | the function possible to parse the given document |
Return type: | collections.Callable |
Changed in version 0.2.0: The function was in libearth.parser.heuristic module (which is removed now) before 0.2.0, but now it’s moved to libearth.parser.autodiscovery.
Parsing RSS 2.0 feed.
Parse RSS 2.0 XML and translate it into Atom.
To make the feed data valid in Atom format, id and link[rel=self] fields would become the url of the feed.
If pubDate is not present, updated field will be from the latest entry’s updated time, or the time it’s crawled instead.
Parameters: |
|
---|---|
Returns: | a pair of (Feed, crawler hint) |
Return type: |
Repository abstracts storage backend e.g. filesystem. There might be platforms that have no chance to directly access file system e.g. iOS, and in that case the concept of repository makes you to store data directly to Dropbox or Google Drive instead of filesystem. However in the most cases we will simply use FileSystemRepository even if data are synchronized using Dropbox or rsync.
In order to make the repository highly configurable it provides the way to lookup and instantiate the repository from url. For example, the following url will load FileSystemRepository which sets path to /home/dahlia/.earthreader/:
file:///home/dahlia/.earthreader/
For extensibility every repository class has to implement from_url() and to_url() methods, and register it as an entry point of libearth.repositories group e.g.:
[libearth.repositories]
file = libearth.repository:FileSystemRepository
Note that the entry point name (file in the above example) becomes the url scheme to lookup the corresponding repository class (libearth.repository.FileSystemRepository in the above example).
Read a file through Iterator protocol, with automatic closing of the file when it ends.
Parameters: |
|
---|
Raised when a given path does not exist.
Builtin implementation of Repository interface which uses the ordinary file system.
Parameters: |
|
---|---|
Raises: |
|
Raised when a given path is not a directory.
Repository interface agnostic to its underlying storage implementation. Stage objects can deal with documents to be stored using the interface.
Every content in repositories is accessible using keys. It actually abstracts out “filenames” in “file systems”, hence keys share the common concepts with filenames. Keys are hierarchical, like file paths, so consists of multiple sequential strings e.g. ['dir', 'subdir', 'key']. You can list() all subkeys in the upper key as well e.g.:
repository.list(['dir', 'subdir'])
Return whether the key exists or not. It returns False if it doesn’t exist instead of raising RepositoryKeyError.
Parameters: | key (collections.Sequence) – the key to find whether it exists |
---|---|
Returns: | True only if the given key exists, or False if not exists |
Return type: | bool |
Note
Every subclass of Repository has to override exists() method to implement details.
Create a new instance of the repository from the given url. It’s used for configuring the repository in plain text e.g. *.ini.
Note
Every subclass of Repository has to override from_url() static/class method to implement details.
Parameters: | url (urllib.parse.ParseResult) – the parsed url tuple |
---|---|
Returns: | a new repository instance |
Return type: | Repository |
Raises ValueError: | |
when the given url is not invalid |
List all subkeys in the key.
Parameters: | key (collections.Sequence) – the incomplete key that might have subkeys |
---|---|
Returns: | the set of subkeys (set of strings, not set of string lists) |
Return type: | collections.Set |
Raises RepositoryKeyError: | |
the key cannot be found in the repository, or it’s not a directory |
Note
Every subclass of Repository has to override list() method to implement details.
Read the content from the key.
Parameters: | key (collections.Sequence) – the key which stores the content to read |
---|---|
Returns: | byte string chunks |
Return type: | collections.Iterable |
Raises RepositoryKeyError: | |
the key cannot be found in the repository, or it’s not a file |
Note
Every subclass of Repository has to override read() method to implement details.
Generate a url that from_url() can accept. It’s used for configuring the repository in plain text e.g. *.ini. URL scheme is determined by caller, and given through argument.
Note
Every subclass of Repository has to override to_url() method to implement details.
Parameters: | scheme – a determined url scheme |
---|---|
Returns: | a url that from_url() can accept |
Return type: | str |
Write the iterable into the key.
Parameters: |
|
---|
Note
Every subclass of Repository has to override write() method to implement details.
Exception which rises when the requested key cannot be found in the repository.
(collections.Sequence) The requested key.
Load the repository instance from the given configuration url.
Note
If setuptools is not installed it will only support file:// scheme and FileSystemRepository.
Parameters: | url (str, urllib.parse.ParseResult) – a repository configuration url |
---|---|
Returns: | the loaded repository instance |
Return type: | |
Raises: |
|
HTML parser that is internally used by sanitize_html() function.
(collections.Set) The set of disallowed URI schemes e.g. javascript:.
(re.RegexObject) The regular expression pattern that matches to disallowed CSS properties.
HTML parser that is internally used by clean_html() function.
Strip all markup tags from html string. That means, it simply makes the given html document a plain text.
Parameters: | html (str) – html string to clean |
---|---|
Returns: | cleaned plain text |
Return type: | str |
Sanitize the given html string. It removes the following tags and attributes that are not secure nor useful for RSS reader layout:
Parameters: | html (str) – html string to sanitize |
---|---|
Returns: | cleaned plain text |
Return type: | str |
There are well-known two ways to parse XML:
Pros and cons between these two ways are obvious, but there could be another way to parse XML: mix them.
The basic idea of this pulling DOM parser (which this module implements) is that the parser can consume the stream just in time when you actually reach the child node. There should be an assumption for that: parsed XML has a schema for it. If the document is schema-free, this heuristic approach loses the most of its efficiency.
So the parser should have the information about the schema of XML document it’d parser, and we can declare the schema by defining classes. It’s a thing like ORM for XML. For example, suppose there is a small XML document:
<?xml version="1.0"?>
<person version="1.0">
<name>Hong Minhee</name>
<url>http://dahlia.kr/</url>
<url>https://github.com/dahlia</url>
<url>https://bitbucket.org/dahlia</url>
<dob>1988-08-04</dob>
</person>
You can declare the schema for this like the following class definition:
class Person(DocumentElement):
__tag__ = 'person'
format_version = Attribute('version')
name = Text('name')
url = Child('url', URL, multiple=True)
dob = Child('dob', Date)
(collections.Sequence) The list of xml.sax parser implementations to try to import.
(str) The XML namespace name used for schema metadatq.
Declare possible element attributes as a descriptor.
Parameters: |
|
---|
Changed in version 0.2.0: The default option becomes to accept only callable objects. Below 0.2.0, default is not a function but a value which is simply used as it is.
(collections.Callable) The function that returns default value when the attribute is not present. The function takes an argument which is an Element instance.
Changed in version 0.2.0: It becomes to accept only callable objects. Below 0.2.0, default attribute is not a function but a value which is simply used as it is.
(bool) Whether it is required for the element.
Declare a possible child element as a descriptor.
In order to have Child of the element type which is not defined yet (or self-referential) pass the class name of the element type to contain. The name will be lazily evaluated e.g.:
class Person(Element):
'''Everyone can have their children, that also are a Person.'''
children = Child('child', 'Person', multiple=True)
Parameters: |
|
---|
Abstract base class for codecs to serialize Python values to be stored in XML and deserialize XML texts to Python values.
In most cases encoding and decoding are implementation details of format which is well-defined, so these two functions could be paired. The interface rely on that idea.
To implement a codec, you have to subclass Codec and override a pair of methods: encode() and decode().
Codec objects are acceptable by Attribute, Text, and Content (all they subclass CodecDescriptor).
Mixin class for descriptors that provide decoder() and encoder().
Attribute, Content and Text can take encoder and decoder functions for them. It’s used for encoding from Python values to XML string and decoding raw values from XML to natural Python representations.
It can take a codec, or encode and decode separately. (Of course they all can be present at a time.) In most cases, you’ll need only codec parameter that encoder and decoder are coupled:
Text('dob', Rfc3339(prefer_utc=True))
Encoders can be specified using encoder parameter of descriptor’s constructor, or encoder() decorator.
Decoders can be specified using decoder parameter of descriptor’s constructor, or decoder() decorator:
class Person(DocumentElement):
__tag__ = 'person'
format_version = Attribute('version')
name = Text('name')
url = Child('url', URL, multiple=True)
dob = Text('dob',
encoder=datetime.date.strftime.isoformat,
decoder=lambda s: datetime.date.strptime(s, '%Y-%m-%d'))
@format_version.encoder
def format_version(self, value):
return '.'.join(map(str, value))
@format_version.decoder
def format_version(self, value):
return tuple(map(int, value.split('.')))
Parameters: |
|
---|
Decode the given text as it’s programmed.
Parameters: | |
---|---|
Returns: | decoded value |
Note
Internal method.
Decorator which sets the decoder to the decorated function:
import datetime
class Person(DocumentElement):
'''Person.dob will be a datetime.date instance.'''
__tag__ = 'person'
dob = Text('dob')
@dob.decoder
def dob(self, dob_text):
return datetime.date.strptime(dob_text, '%Y-%m-%d')
>>> p = Person('<person><dob>1987-07-26</dob></person>')
>>> p.dob
datetime.date(1987, 7, 26)
If it’s applied multiple times, all decorated functions are piped in the order:
class Person(Element):
'''Person.age will be an integer.'''
age = Text('dob', decoder=lambda text: text.strip())
@age.decoder
def age(self, dob_text):
return datetime.date.strptime(dob_text, '%Y-%m-%d')
@age.decoder
def age(self, dob):
now = datetime.date.today()
d = now.month < dob.month or (now.month == dob.month and
now.day < dob.day)
return now.year - dob.year - d
>>> p = Person('<person>\n\t<dob>\n\t\t1987-07-26\n\t</dob>\n</person>')
>>> p.age
26
>>> datetime.date.today()
datetime.date(2013, 7, 30)
Note
This creates a copy of the descriptor instance rather than manipulate itself in-place.
Decorator which sets the encoder to the decorated function:
import datetime
class Person(DocumentElement):
'''Person.dob will be written to ISO 8601 format'''
__tag__ = 'person'
dob = Text('dob')
@dob.encoder
def dob(self, dob):
if not isinstance(dob, datetime.date):
raise TypeError('expected datetime.date')
return dob.strftime('%Y-%m-%d')
>>> isinstance(p, Person)
True
>>> p.dob
datetime.date(1987, 7, 26)
>>> ''.join(write(p, indent='', newline=''))
'<person><dob>1987-07-26</dob></person>'
If it’s applied multiple times, all decorated functions are piped in the order:
class Person(Element):
'''Person.email will have mailto: prefix when it's written
to XML.
'''
email = Text('email', encoder=lambda email: 'mailto:' + email)
@age.encoder
def email(self, email):
return email.strip()
@email.encoder
def email(self, email):
login, host = email.split('@', 1)
return login + '@' + host.lower()
>>> isinstance(p, Person)
True
>>> p.email
' earthreader@librelist.com '
>>> ''.join(write(p, indent='', newline=''))
>>> '<person><email>mailto:earthreader@librelist.com</email></person>')
Note
This creates a copy of the descriptor instance rather than manipulate itself in-place.
Rise when encoding/decoding between Python values and XML data goes wrong.
Declare possible text nodes as a descriptor.
Parameters: |
|
---|
Read raw value from XML, decode it, and then set the attribute for content of the given element to the decoded value.
Note
Internal method.
Event handler implementation for SAX parser.
It maintains the stack that contains parsing contexts of what element is lastly open, what descriptor is associated to the element, and the buffer for chunks of content characters the element has. Every context is represented as the namedtuple ParserContext.
Each time its events (startElement(), characters(), and endElement()) are called, it forwards the data to the associated descriptor. Descriptor subtypes implement start_element() method and end_element().
Rise when decoding XML data to Python values goes wrong.
Abstract base class for Child and Text.
Abstract method that is invoked when the parser meets an end of an element related to the descriptor. It will be called by ContentHandler.
Parameters: |
|
---|
(bool) Whether it can be zero or more for the element. If it’s True required has to be False.
(bool) Whether it is required for the element. If it’s True multiple has to be False.
(collections.Callable) An optional function to be used for sorting multiple elements. It has to take an element and return a value for sort key. It is the same to key option of sorted() built-in function.
It’s available only when multiple is True.
Use sort_reverse for descending order.
Note
It doesn’t guarantee that all elements must be sorted in runtime, but all elements become sorted when it’s written using write() function.
(bool) Whether to reverse elements when they become sorted. It is the same to reverse option of sorted() built-in function.
It’s available only when sort_key is present.
Abstract method that is invoked when the parser meets a start of an element related to the descriptor. It will be called by ContentHandler.
Parameters: | |
---|---|
Returns: | a value to reserve. it will be passed to reserved_value parameter of end_element() |
Error which rises when a schema has duplicate descriptors more than one for the same attribute, the same child element, or the text node.
The root element of the document.
(str) Every DocumentElement subtype has to define this attribute to the root tag name.
(str) A DocumentElement subtype may define this attribute to the XML namespace of the document element.
Represent an element in XML document.
It provides the default constructor which takes keywords and initializes the attributes by given keyword arguments. For example, the following code that uses the default constructor:
assert issubclass(Person, Element)
author = Person(
name='Hong Minhee',
url='http://dahlia.kr/'
)
is equivalent to the following code:
author = Person()
author.name = 'Hong Minhee'
author.url = 'http://dahlia.kr/'
Cast a value which isn’t an instance of the element type to the element type. It’s useful when a boxed element type could be more naturally represented using builtin type.
For example, Mark could be represented as a boolean value, and Text also could be represented as a string.
The following example shows how the element type can be automatically casted from string by implementing __coerce_from__() class method:
@classmethod
def __coerce_from__(cls, value):
if isinstance(value, str):
return Text(value=value)
raise TypeError('expected a string or Text')
Identify the entity object. It returns the entity object itself by default, but should be overridden.
Returns: | any value to identify the entity object |
---|
Merge two entities (self and other). It can return one of the two, or even a new entity object. This method is used by Session objects to merge conflicts between concurrent updates.
Parameters: | other (Element) – other entity to merge. it’s guaranteed that it’s older session’s (note that it doesn’t mean this entity is older than self, but the session’s last update is) |
---|---|
Returns: | on of the two, or even an new entity object that merges two entities |
Return type: | Element |
Note
The default implementation simply returns self. That means the entity of the newer session will always win unless the method is overridden.
List-like object to represent multiple chidren. It makes the parser to lazily consume the buffer when an element of a particular offset is requested.
You can extend methods or properties for a particular element type using element_list_for() class decorator e.g.:
@element_list_for(Link)
class LinkList(collections.Sequence):
'''Specialized ElementList for Link elements.'''
def filter_by_mimetype(self, mimetype):
'''Filter links by their mimetype.'''
return [link for link in self if link.mimetype == mimetype]
Extended methods/properties can be used for element lists for the type:
assert isinstance(feed.links, LinkList)
assert isinstance(feed.links, ElementList)
feed.links.filter_by_mimetype('text/html')
Consume the buffer for the parser. It returns a generator, so can be stopped using break statement by caller.
Note
Internal method.
Register specialized collections.Sequence type for a particular value_type.
An imperative version of :func`element_list_for()` class decorator.
Parameters: |
|
---|
(collections.MutableMapping) The internal table for specialized subtypes used by register_specialized_type() method and element_list_for() class decorator.
Rise when encoding Python values into XML data goes wrong.
Rise when an element is invalid according to the schema.
Error which rises when a schema definition has logical errors.
Descriptor that declares a possible child element that only cosists of character data. All other attributes and child nodes are ignored.
Parameters: |
|
---|
Completely load the given element.
Parameters: | element (Element) – an element loaded by read() |
---|
Class decorator which registers specialized ElementList subclass for a particular value_type e.g.:
@element_list_for(Link)
class LinkList(collections.Sequence):
'''Specialized ElementList for Link elements.'''
def filter_by_mimetype(self, mimetype):
'''Filter links by their mimetype.'''
return [link for link in self if link.mimetype == mimetype]
Parameters: | value_type (type) – a particular element type that specialized_type would be used for instead of default ElementList class. it has to be a subtype of Element |
---|
Index descriptors of the given element_type to make them easy to be looked up by their identifiers (pairs of XML namespace URI and tag name).
Parameters: | element_type (type) – a subtype of Element to index its descriptors |
---|
Note
Internal function.
Get the dictionary of Attribute descriptors of the given element_type.
Parameters: | element_type (type) – a subtype of Element to inspect |
---|---|
Returns: | a dictionary of attribute identifiers (pairs of xml namespace uri and xml attribute name) to pairs of instance attribute name and associated Attribute descriptor |
Return type: | collections.Mapping |
Note
Internal function.
Get the dictionary of Descriptor objects of the given element_type.
Parameters: | element_type (type) – a subtype of Element to inspect |
---|---|
Returns: | a dictionary of child node identifiers (pairs of xml namespace uri and tag name) to pairs of instance attribute name and associated Descriptor |
Return type: | collections.Mapping |
Note
Internal function.
Gets the Content descriptor of the given element_type.
Parameters: | element_type (type) – a subtype of Element to inspect |
---|---|
Returns: | a pair of instance attribute name and associated Content descriptor |
Return type: | tuple |
Note
Internal function.
Get the set of XML namespaces used in the given element_type, recursively including all child elements.
Parameters: | element_type (type) – a subtype of Element to inspect |
---|---|
Returns: | a set of uri strings of used all xml namespaces |
Return type: | collections.Set |
Note
Internal function.
Return whether the given element is not completely loaded by read() yet.
Parameters: | element (Element) – an element |
---|---|
Returns: | whether True if the given element is partially loaded |
Return type: | bool |
Initialize a document in read mode by opening the iterable of XML string.
with open('doc.xml', 'rb') as f:
read(Person, f)
Returned document element is not fully read but partially loaded into memory, and then lazily (and eventually) loaded when these are actually needed.
Parameters: |
|
---|---|
Returns: | initialized document element in read mode |
Return type: |
Validate the given element according to the schema.
from libearth.schema import IntegrityError, validate
try:
validate(element)
except IntegrityError:
print('the element {0!r} is invalid!'.format(element))
Parameters: |
|
---|---|
Returns: | True if the element is valid. False if the element is invalid and raise_error option is False` |
Raises IntegrityError: | |
when the element is invalid and raise_error option is True |
Write the given document to XML string. The return value is an iterator that yields chunks of an XML string.
with open('doc.xml', 'w') as f:
for chunk in write(document):
f.write(chunk)
Parameters: |
|
---|---|
Returns: | chunks of an XML string |
Return type: | collections.Iterable |
This module provides merging facilities to avoid conflict between concurrent updates of the same document/entity from different devices (installations). There are several concepts here.
Session abstracts installations on devices. For example, if you have a laptop, a tablet, and a mobile phone, and two apps are installed on the laptop, then there have to be four sessions: laptop-1, laptop-2, table-1, and phone-1. You can think of it as branch if you are familiar with DVCS.
Revision abstracts timestamps of updated time. An important thing is that it preserves its session as well.
Base revisions (MergeableDocumentElement.__base_revisions__) show what revisions the current revision is built on top of. In other words, what revisions were merged into the current revision. RevisionSet is a dictionary-like data structure to represent them.
(str) The XML namespace name used for session metadata.
Document element which is mergeable using Session.
The named tuple type of (Session, datetime.datetime) pair.
Alias for field number 0
Alias for field number 1
Codec to encode/decode Revision pairs.
>>> from libearth.tz import utc
>>> session = Session('test-identifier')
>>> updated_at = datetime.datetime(2013, 9, 22, 3, 43, 40, tzinfo=utc)
>>> rev = Revision(session, updated_at)
>>> RevisionCodec().encode(rev)
'test-identifier 2013-09-22T03:43:40Z'
(Rfc3339) The internally used codec to encode Revision.updated_at time to RFC 3339 format.
SAX content handler that picks session metadata (__revision__ and __base_revisions__) from the given document element.
Parsed result goes revision and base_revisions.
Used by parse_revision().
(bool) Represents whether the parsing is complete.
Set of Revision pairs. It provides dictionary-like mapping protocol.
Parameters: | revisions (collections.Iterable) – the iterable of (Session, datetime.datetime) pairs |
---|
Find whether the given revision is already merged to the revision set. In other words, return True if the revision doesn’t have to be merged to the revision set anymore.
Parameters: | revision (Revision) – the revision to find whether it has to be merged or not |
---|---|
Returns: | True if the revision is included in the revision set, or False |
Return type: | bool |
Make a copy of the set.
Returns: | a new equivalent set |
---|---|
Return type: | RevisionSet |
The list of (Session, datetime.datetime) pairs.
Returns: | the list of Revision instances |
---|---|
Return type: | collections.ItemsView |
Merge two or more RevisionSets. The latest time remains for the same session.
Parameters: | *sets – one or more RevisionSet objects to merge |
---|---|
Returns: | the merged set |
Return type: | RevisionSet |
Codec to encode/decode multiple Revision pairs.
>>> from datetime import datetime
>>> from libearth.tz import utc
>>> revs = RevisionSet([
... (Session('a'), datetime(2013, 9, 22, 16, 58, 57, tzinfo=utc)),
... (Session('b'), datetime(2013, 9, 22, 16, 59, 30, tzinfo=utc)),
... (Session('c'), datetime(2013, 9, 22, 17, 0, 30, tzinfo=utc))
... ])
>>> encoded = RevisionSetCodec().encode(revs)
>>> encoded
'c 2013-09-22T17:00:30Z,\nb 2013-09-22T16:59:30Z,\na 2013-09-22T16:58:57Z'
>>> RevisionSetCodec().decode(encoded)
libearth.session.RevisionSet([
Revision(session=libearth.session.Session('b'),
updated_at=datetime.datetime(2013, 9, 22, 16, 59, 30,
tzinfo=libearth.tz.Utc())),
Revision(session=libearth.session.Session('c'),
updated_at=datetime.datetime(2013, 9, 22, 17, 0, 30,
tzinfo=libearth.tz.Utc())),
Revision(session=libearth.session.Session('a'),
updated_at=datetime.datetime(2013, 9, 22, 16, 58, 57,
tzinfo=libearth.tz.Utc()))
])
(re.RegexObject) The regular expression pattern that matches to separator substrings between revision pairs.
The unit of device (more abstractly, installation) that updates the same document (e.g. Feed). Every session must have its own unique identifier to avoid conflict between concurrent updates from different sessions.
Parameters: | identifier (str) – the unique identifier. automatically generated using uuid if not present |
---|
(re.RegexObject) The regular expression pattern that matches to allowed identifiers.
(str) The session identifier. It has to be distinguishable from other devices/apps, but consistent for the same device/app.
(collections.MutableMapping) The pool of interned sessions. It’s for maintaining single sessions for the same identifiers.
Merge the given two documents and return new merged document. The given documents are not manipulated in place. Two documents must have the same type.
Parameters: |
|
---|
Pull the document (of possibly other session) to the current session.
Parameters: | document (MergeableDocumentElement) – the document to pull from the possibly other session to the current session |
---|---|
Returns: | the clone of the given document with the replaced __revision__. note that the Revision.updated_at value won’t be revised. it could be the same object to the given document object if the session is the same |
Return type: | MergeableDocumentElement |
Mark the given document as the latest revision of the current session.
Parameters: | document (MergeableDocumentElement) – mergeable document to mark |
---|
Check the type of the given pair and error unless it’s a valid revision pair (Session, datetime.datetime).
Parameters: |
|
---|---|
Returns: | the revision pair |
Return type: | Revision, collections.Sequence |
Efficiently parse only __revision__ and __base_revisions__ from the given iterable which contains chunks of XML. It reads only head of the given document, and iterable will be not completely consumed in most cases.
Note that it doesn’t validate the document.
Parameters: | iterable (collections.Iterable) – chunks of bytes which contains a MergeableDocumentElement element |
---|---|
Returns: | a pair of (__revision__, __base_revisions__). it might be None if the document is not stamped |
Return type: | collections.Sequence |
Stage is a similar concept to Git’s one. It’s a unit of updates, so every change to the repository should be done through a stage.
It also does more than Git’s stage: Route. Routing system hides how document should be stored in the repository, and provides the natural object-mapping interface instead.
Stage also provides transactions. All operations on staged documents should be done within a transaction. You can open and close a transaction using with statement e.g.:
with stage:
subs = stage.subscriptions
stage.subscriptions = some_operation(subs)
Transaction will merge all simultaneous updates if there are multiple updates when it’s committed. You can easily achieve thread safety using transactions.
Note that it however doesn’t guarantee data integrity between multiple processes, so you have to use different session ids when there are multiple processes.
Base stage class that routes nothing yet. It should be inherited to route document types. See also Route class.
It’s a context manager, which is possible to be passed to with statement. The context maintains a transaction, that is required for all operations related to the stage:
with stage:
v = stage.some_value
stage.some_value = operate(v)
If any ongoing transaction is not present while the operation requires it, it will raise TransactionError.
Parameters: |
|
---|
(collections.Sequence) The repository key of the directory where session list are stored.
Get the current ongoing transaction. If any transaction is not begun yet, it raises TransactionError.
Returns: | the dirty buffer that should be written when the transaction is committed |
---|---|
Return type: | DirtyBuffer |
Raises TransactionError: | |
if not any transaction is not begun yet |
Read a document of document_type by the given key in the staged repository.
Parameters: |
|
---|---|
Returns: | found document instance |
Return type: | |
Raises libearth.repository.RepositoryKeyError: | |
when the key cannot be found |
Note
This method is intended to be internal. Use routed properties rather than this. See also Route.
(Repository) The staged repository.
(collections.Set) List all sessions associated to the repository. It includes the session of the current stage.
Touch the latest staged time of the current session into the repository.
Note
This method is intended to be internal.
(collections.MutableMapping) Ongoing transactions. Keys are the context identifier (that get_current_context_id() returns), and values are pairs of the DirtyBuffer that should be written when the transaction is committed, and stack information.
Save the document to the key in the staged repository.
Parameters: |
|
---|---|
Returns: | actually written document |
Return type: | MergeableDocumentElement |
Note
This method is intended to be internal. Use routed properties rather than this. See also Route.
Mapping object which represents hierarchy of routed key path.
Parameters: |
|
---|
Note
The constructor is intended to be internal, so don’t instantiate it directory. Use Route instead.
Memory-buffered proxy for the repository. It’s used for transaction buffer which maintains updates to be written until the ongoing transaction is committed.
Parameters: |
|
---|
Note
This class is intended to be internal.
Flush all buffered updates to the repository.
(Repository) The bare repository where the buffer will flush() to.
Descriptor that routes a document_type to a particular key path pattern in the repository.
key_spec could contain some format strings. Format strings can take a keyword (session) and zero or more positional arguments.
For example, if you route a document type without any positional arguments in key_spec format:
class Stage(BaseStage):
'''Stage example.'''
metadata = Route(
Metadata,
['metadata', '{session.identifier}.xml']
)
Stage instance will has a metadata attribute that simply holds Metadata document instance (in the example):
>>> stage.metadata # ['metadata', 'session-id.xml']
<Metadata ...>
If you route something with one or more positional arguments in key_spec format, then it works in some different way:
class Stage(BaseStage):
'''Stage example.'''
seating_chart = Route(
Student,
['students', 'col-{0}', 'row-{1}', '{session.identifier}.xml']
)
In the above routing, two positional arguments were used. It means that the seating_chart property will return two-dimensional mapping object (Directory):
>>> stage.seating_chart # ['students', ...]
<libearth.directory.Directory ['students']>
>>> list(stage.seating_chart)
['A', 'B', 'C', 'D']
>>> b = stage.seating_chart['B'] # ['students', 'col-B', ...]
<libearth.directory.Directory ['students', 'col-B']>
>>> list(stage.seating_chart['B'])
['1', '2', '3', '4', '5', '6']
>>> stage.seating_chart['B']['6'] \
... # ['students', 'col-B', 'row-6', 'session-id.xml']
<Student B6>
Parameters: |
|
---|
(type) The type of the routed document. It is a subtype of MergeableDocumentElement.
(collections.Sequence) The repository key pattern that might contain some format strings.
Staged documents of Earth Reader.
(SubscriptionList) The set of subscriptions.
The error that rises if there’s no ongoing transaction while it’s needed to update the stage, or if there’s already begun ongoing transaction when the new transaction get tried to begin.
Compile a format_string to regular expression pattern. For example, 'string{0}like{1}this{{2}}' will be compiled to /^string(.*?)like(.*?)this\{2\}$/.
Parameters: | format_string (str) – format string to compile |
---|---|
Returns: | compiled pattern object |
Return type: | re.RegexObject |
Identifies which context it is (greenlet, stackless, or thread).
Returns: | the identifier of the current context |
---|
Maintain the subscription list using OPML format, which is de facto standard for the purpose.
Represent body element of OPML document.
Category which groups Subscription objects or other Category objects. It implements collections.MutableSet protocol.
Encode strings e.g. ['a', 'b', 'c'] into a comma-separated list e.g. 'a,b,c', and decode it back to a Python list. Whitespaces between commas are ignored.
>>> codec = CommaSeparatedList()
>>> codec.encode(['technology', 'business'])
'technology,business'
>>> codec.decode('technology, business')
['technology', 'business']
Represent head element of OPML document.
Represent outline element of OPML document.
(datetime.datetime) The created time.
Subscription which holds referring feed_uri.
The set (exactly, tree) of subscriptions. It consists of Subscriptions and Category objects for grouping. It implements collections.MutableSet protocol.
(distutils.version.StrictVersion) The OPML version number.
Mixin for SubscriptionList and Category, both can group Subscription object and other Category objects, to implement collections.MutableSet protocol.
Note
Every subclass of SubscriptionSet has to override children property to implement details.
Determine whether the set contains the given outline. If recursively is False (which is by default) it works in the same way to in operator.
Parameters: |
|
---|---|
Returns: | True if the set (or tree) contains the given outline, or False |
Return type: | bool |
New in version 0.2.0.
Add a subscription from Feed instance. Prefer this method over add() method.
Parameters: | feed (Feed) – feed to subscribe |
---|
(collections.Set) The subset which consists of only Subscription instances.
Almost of this module is from the official documentation of datetime module in Python standard library.
(Utc, datetime.timezone) The tzinfo instance that represents UTC. It’s an instance of Utc in Python 2 (which provide no built-in fixed-offset tzinfo implementation), and an instance of timezone with zero offset in Python 3.
Fixed offset in minutes east from UTC.
>>> kst = FixedOffset(9 * 60, name='Asia/Seoul') # KST +09:00
>>> current = now()
>>> current
datetime.datetime(2013, 8, 15, 3, 18, 37, 404562, tzinfo=libearth.tz.Utc())
>>> current.astimezone(kst)
datetime.datetime(2013, 8, 15, 12, 18, 37, 404562,
tzinfo=<libearth.tz.FixedOffset Asia/Seoul>)
Earth Reader aims to decentralize feed reader ecosystem which had been highly centralized to Google Reader. Google Reader had changed the world of news readers, from desktop apps to web-based services.
However Google Reader shut down on July 1, 2013. Everyone panicked, several new feed reader services were born, users had to migrate their data, and the most of alternative services were not able to import starred and read data, but just subscription list through OPML.
Feed readers are actually desktop apps at first. A few years later some people had started to lose their data, because desktop apps had simply stored data to local disk. In those days there were already some web-based feed readers e.g. Bloglines, Google Reader, but they provided worse experience than desktop apps (there were no Chrome, and JavaScript engines were way slower back then). Nevertheless people had gradually moved to web-based services from desktop apps, because they never (until the time at least) lost data, and were easily synchronized between multiple computers.
These feed reader services are enough convenient, but always have some risk that you can’t control your own data. If the service you use suddenly shut down without giving you a chance to backup data, you would have to start everything from scratch. Your starred articles would be gone.
The goal of Earth Reader is to achieve the following subgoals at the same time:
To achieve the goal of Earth Reader, its design need to resolve the following subproblems:
All data libearth deals with are based on (de facto) standard formats. For example, it stores subscription list and its category hierarchy to an OPML file. OPML have been a de facto standard format to exchange subscription list by feed readers. It also stores all feed data to Atom format (RFC 4287).
Actually the most technologies related to RSS/syndication formats are from early 00’s, and it means they had used XML instead of JSON today we use for the same purpose. OPML is an (though poorly structured) XML format, and Atom also is an XML format.
Since we need to deal with several XML data and not need any other formats, we decided to make something first-class model objects to XML like ORM to relational databases. You can find how it can be used for designing model objects at libearth/feed.py and libearth/subscribe.py. It looks similar to Django ORM and SQLAlchemy, and makes you to deal with XML documents in the same way you use plain Python objects.
Under the hood it does incremental parsing using SAX instead of DOM to reduce memory usage when the document is larger than a hundred megabytes.
See also
Earth Reader data can be shared by multiple installations e.g. desktop apps, mobile apps, web apps. So there must be simultaneous updates between them that could conflict. An important constraint we have is synchronization isn’t done by Earth Reader. We can’t lock files nor do atomic operations on them.
Our solution to this is read-time merge. All data are not shared between installations at least in filesystem level. They have isolated files for the same entities, and libearth merges all of them when it’s loaded into memory. Merged result doesn’t affect to all replicas but only a replica that corresponds to the installation. You can understand the approach similar to DVCS (although there are actually many differences): installations are branches, and updates from others can be pulled to mine. If there are simultaneous changes, these are merged and then committed to mine. If there’s no change for me, simply pull changes from others without merge. A big difference is that there’s no push. You can only do pull others, or wait others to pull yours. It’s because the most of existing synchronization utilities like Dropbox passively works in background. Moreover there could be offline.
Repository abstracts storage backend e.g. filesystem. There might be platforms that have no chance to directly access file system e.g. iOS, and in that case the concept of repository makes you to store data directly to Dropbox or Google Drive instead of filesystem. However in the most cases we will simply use FileSystemRepository even if data are synchronized using Dropbox or rsync.
See also
Session abstracts installations. Every installation has its own session identifier. To be more exact it purposes to distinguish processes, hence every process has its unique identifier even if they are child processes of the same installation e.g. prefork workers.
Every session makes its own file for a document, for example, if there are two sessions identified a and b, two files for a document e.g. doc.xml will be made doc.a.xml and doc.b.xml respectively.
For each change a session merges all changes from other sessions when a document is being loaded (read-time merge).
See also
Stage is a unit of changes i.e. an atomic changes to be merged. It provides transactions for multi threaded environment. If there are simultaneous changes from other sessions or other transactions, these are automatically merged when the currently ongoing transaction is committed.
Stage also provides Route, a convenient interface to access documents. For example, you can read the subscription list by stage.subscriptions, and write it by stage.subscriptions = new_subscriptions. In the similar way you can read a feed by stage.feeds[feed_id], and write it by stage.feeds[feed_id] = new_feed.
See also
Released on July 12, 2014.
Released on April 22, 2014.
Released on January 19, 2014.
Released on January 2, 2014.
Released on December 13, 2013. Initial alpha version.
Libearth is an open source software written by Hong Minhee and the Earth Reader team. See also the complete list of contributors as well. Libearth is free software licensed under the terms of the GNU General Public License Version 2 or any later version, and you can find the code at GitHub repository:
$ git clone git://github.com/earthreader/libearth.git
If you find any bugs, please report them to our issue tracker. Pull requests are always welcome!
We discuss about libearth’s development on IRC. Come #earthreader channel on Ozinger network. (We will make one on freenode as well soon!)