Friday, 09 October 2015 11:15
- StackOverflow: What’s the best way of scraping data from a website? - has some good thoughts, particularly about Rate Limiting and Bot behavior.
- See Scrapy on this site.
Rate Limiting and Bot Behavior
From the StackOverflow link above:
- General consensus is limit you page requests to 2-5 seconds per request.
- Identify your requests with a user agent string that identifies your bot.
- Have a webpage for your bot explaining it's purpose. This URL goes in the agent string.
Friday, 09 October 2015 10:46
It's fast, lightweight and state-less which makes it easy to distribute.