Settings¶
This Middleware adds some settings to configure how to work with Crawlera.
CRAWLERA_URL¶
Default: 'http://proxy.crawlera.com:8010'
Crawlera instance url, it varies depending on adquiring a private or dedicated instance. If Crawlera didn’t provide you with a private instance url, you don’t need to specify it.
CRAWLERA_MAXBANS¶
Default: 400
Number of consecutive bans from Crawlera necessary to stop the spider.
CRAWLERA_DOWNLOAD_TIMEOUT¶
Default: 190
Timeout for processing Crawlera requests. It overrides Scrapy’s DOWNLOAD_TIMEOUT
.
CRAWLERA_PRESERVE_DELAY¶
Default: False
If False
Sets Scrapy’s DOWNLOAD_DELAY
to 0
, making the spider to crawl faster. If set to True
, it will
respect the provided DOWNLOAD_DELAY
from Scrapy.
CRAWLERA_DEFAULT_HEADERS¶
Default: {}
Default headers added only to crawlera requests. Headers defined on DEFAULT_REQUEST_HEADERS
will take precedence as long as the CrawleraMiddleware
is placed after the DefaultHeadersMiddleware
. Headers set on the requests have precedence over the two settings.
This is the default behavior,
DefaultHeadersMiddleware
default priority is400
and we recommendCrawleraMiddleware
priority to be610
CRAWLERA_BACKOFF_STEP¶
Default: 15
Step size used for calculating exponential backoff according to the formula: random.uniform(0, min(max, step * 2 ** attempt))
.
CRAWLERA_BACKOFF_MAX¶
Default: 180
Max value for exponential backoff as showed in the formula above.
CRAWLERA_FORCE_ENABLE_ON_HTTP_CODES¶
Default: []
List of HTTP response status codes that warrant enabling Crawlera for the corresponding domain.
When a response with one of these HTTP status codes is received after a request that did not go through Crawlera, the request is retried with Crawlera, and any new request to the same domain is also sent through Crawlera.