w3hello.com logo
Home PHP C# C++ Android Java Javascript Python IOS SQL HTML videos Categories
Python3, Urllib3: Fast server friendly requests to single site in the order of 100,000 requests?

Sounds like the server might be throttling/banning your IP for making too many requests too frequently.

First, I'd suggest checking the robots.txt on the domain to see if there is any guidance on automated request frequency. If not, you could ask the owner of the website to advise on how to best crawl the site. Otherwise, you may need to determine the rate limiting experimentally.

To throttle your requests, you can use something like apiclient.RateLimiter* (source). It would look something like this:

from apiclient import RateLimiter
from urllib3 import PoolManager

lock = RateLimiter(max_messages=30, every_seconds=60)
http = PoolManager(...)
...

for url in crawl_list:
    lock.acquire()
    r = http.request(...)

Another thing you could do is crawl a cached version of the site, if one is available through Google or archive.org.

[*] Disclaimer: I also wrote apiclient a long time ago. It's not super-well documented. I suspect there are other similar modules that you can use if you find it lacking, but the source should be reasonably easy to understand and extend.





© Copyright 2018 w3hello.com Publishing Limited. All rights reserved.