libearth.crawler — Crawler

Crawl feeds.

libearth.crawler.DEFAULT_TIMEOUT = 10

(numbers.Integral) The default timeout for connection attempts. 10 seconds.

New in version 0.3.0.

exception libearth.crawler.CrawlError(feed_uri, *args, **kwargs)

Error which rises when crawling given url failed.

New in version 0.3.0: Added feed_uri parameter and corresponding feed_uri attribute.

feed_uri = None

(str) The errored feed uri.

class libearth.crawler.CrawlResult(url, feed, hints, icon_url=None)

The result of each crawl of a feed.

It mimics triple of (url, feed, hints) for backward compatibility to below 0.3.0, so you can still take these values using tuple unpacking, though it’s not recommended way to get these values anymore.

New in version 0.3.0.

add_as_subscription(subscription_set)

Add it as a subscription to the given subscription_set.

Parameters:subscription_set (SubscriptionSet) – a subscription list or category to add a new subscription
Returns:the created subscription object
Return type:Subscription
feed = None

(Feed) The crawled feed.

hints = None

(collections.Mapping) The extra hints for the crawler e.g. skipHours, skipMinutes, skipDays. It might be None.

icon_url = None

(str) The favicon url of the feed if exists. It might be None.

url = None

(str) The crawled feed url.

libearth.crawler.crawl(feed_urls, pool_size, timeout=10)

Crawl feeds in feed list using thread.

Parameters:
Returns:

a set of CrawlResult objects

Return type:

collections.Iterable

Changed in version 0.3.0: It became to return a set of CrawlResults instead of tuples.

Changed in version 0.3.0: The parameter feeds was renamed to feed_urls.

New in version 0.3.0: Added optional timeout parameter.