- StackOverflow: What’s the best way of scraping data from a website? - has some good thoughts, particularly about Rate Limiting and Bot behavior.
- See Scrapy on this site.
Rate Limiting and Bot Behavior
From the StackOverflow link above:
- General consensus is limit you page requests to 2-5 seconds per request.
- Identify your requests with a user agent string that identifies your bot.
- Have a webpage for your bot explaining it's purpose. This URL goes in the agent string.