libearth.crawler — Crawler¶
Crawl feeds.
-
libearth.crawler.DEFAULT_TIMEOUT= 10¶ (
numbers.Integral) The default timeout for connection attempts. 10 seconds.New in version 0.3.0.
-
exception
libearth.crawler.CrawlError(feed_uri, *args, **kwargs)¶ Error which rises when crawling given url failed.
New in version 0.3.0: Added
feed_uriparameter and correspondingfeed_uriattribute.
-
class
libearth.crawler.CrawlResult(url, feed, hints, icon_url=None)¶ The result of each crawl of a feed.
It mimics triple of (
url,feed,hints) for backward compatibility to below 0.3.0, so you can still take these values using tuple unpacking, though it’s not recommended way to get these values anymore.New in version 0.3.0.
-
add_as_subscription(subscription_set)¶ Add it as a subscription to the given
subscription_set.Parameters: subscription_set ( SubscriptionSet) – a subscription list or category to add a new subscriptionReturns: the created subscription object Return type: Subscription
-
hints= None¶ (
collections.Mapping) The extra hints for the crawler e.g.skipHours,skipMinutes,skipDays. It might beNone.
-
-
libearth.crawler.crawl(feed_urls, pool_size, timeout=10)¶ Crawl feeds in feed list using thread.
Parameters: - feed_urls – feed urls to crawl
- pool_size (
numbers.Integral) – the number of concurrent workers - timeout (
numbers.Integral) – optional timeout for connection attempts.DEFAULT_TIMEOUTis used if omitted
Returns: a set of
CrawlResultobjectsReturn type: collections.IterableChanged in version 0.3.0: It became to return a set of
CrawlResults instead oftuples.Changed in version 0.3.0: The parameter
feedswas renamed tofeed_urls.New in version 0.3.0: Added optional
timeoutparameter.