libearth.crawler
— Crawler¶
Crawl feeds.
-
libearth.crawler.
DEFAULT_TIMEOUT
= 10¶ (
numbers.Integral
) The default timeout for connection attempts. 10 seconds.New in version 0.3.0.
-
exception
libearth.crawler.
CrawlError
(feed_uri, *args, **kwargs)¶ Error which rises when crawling given url failed.
New in version 0.3.0: Added
feed_uri
parameter and correspondingfeed_uri
attribute.
-
class
libearth.crawler.
CrawlResult
(url, feed, hints, icon_url=None)¶ The result of each crawl of a feed.
It mimics triple of (
url
,feed
,hints
) for backward compatibility to below 0.3.0, so you can still take these values using tuple unpacking, though it’s not recommended way to get these values anymore.New in version 0.3.0.
-
add_as_subscription
(subscription_set)¶ Add it as a subscription to the given
subscription_set
.Parameters: subscription_set ( SubscriptionSet
) – a subscription list or category to add a new subscriptionReturns: the created subscription object Return type: Subscription
-
hints
= None¶ (
collections.Mapping
) The extra hints for the crawler e.g.skipHours
,skipMinutes
,skipDays
. It might beNone
.
-
-
libearth.crawler.
crawl
(feed_urls, pool_size, timeout=10)¶ Crawl feeds in feed list using thread.
Parameters: - feed_urls – feed urls to crawl
- pool_size (
numbers.Integral
) – the number of concurrent workers - timeout (
numbers.Integral
) – optional timeout for connection attempts.DEFAULT_TIMEOUT
is used if omitted
Returns: a set of
CrawlResult
objectsReturn type: collections.Iterable
Changed in version 0.3.0: It became to return a set of
CrawlResult
s instead oftuple
s.Changed in version 0.3.0: The parameter
feeds
was renamed tofeed_urls
.New in version 0.3.0: Added optional
timeout
parameter.