Settings¶

This Middleware adds some settings to configure how to work with Crawlera.

CRAWLERA_APIKEY¶

Default: None

Unique Crawlera API Key provided for authentication.

CRAWLERA_URL¶

Default: 'http://proxy.crawlera.com:8010'

Crawlera instance url, it varies depending on adquiring a private or dedicated instance. If Crawlera didn’t provide you with a private instance url, you don’t need to specify it.

CRAWLERA_MAXBANS¶

Default: 400

Number of consecutive bans from Crawlera necessary to stop the spider.

CRAWLERA_DOWNLOAD_TIMEOUT¶

Default: 190

Timeout for processing Crawlera requests. It overrides Scrapy’s DOWNLOAD_TIMEOUT.

CRAWLERA_PRESERVE_DELAY¶

Default: False

If False Sets Scrapy’s DOWNLOAD_DELAY to 0, making the spider to crawl faster. If set to True, it will respect the provided DOWNLOAD_DELAY from Scrapy.

CRAWLERA_DEFAULT_HEADERS¶

Default: {}

Default headers added only to crawlera requests. Headers defined on DEFAULT_REQUEST_HEADERS will take precedence as long as the CrawleraMiddleware is placed after the DefaultHeadersMiddleware. Headers set on the requests have precedence over the two settings.

This is the default behavior, DefaultHeadersMiddleware default priority is 400 and we recommend CrawleraMiddleware priority to be 610

CRAWLERA_BACKOFF_STEP¶

Default: 15

Step size used for calculating exponential backoff according to the formula: random.uniform(0, min(max, step * 2 ** attempt)).

CRAWLERA_BACKOFF_MAX¶

Default: 180

Max value for exponential backoff as showed in the formula above.

CRAWLERA_FORCE_ENABLE_ON_HTTP_CODES¶

Default: []

List of HTTP response status codes that warrant enabling Crawlera for the corresponding domain.

When a response with one of these HTTP status codes is received after a request that did not go through Crawlera, the request is retried with Crawlera, and any new request to the same domain is also sent through Crawlera.