Web Scraping Notes

General

Rate Limiting and Bot Behavior

From the StackOverflow link above:

  • General consensus is limit you page requests to 2-5 seconds per request.
  • Identify your requests with a user agent string that identifies your bot.
  • Have a webpage for your bot explaining it's purpose. This URL goes in the agent string.

Scrapy

Add Ons

Splash

Splash is a javascript rendering service with an HTTP API. It's a lightweight browser with an HTTP API, implemented in Python using Twisted and QT.
It's fast, lightweight and state-less which makes it easy to distribute.

Go to top