How to crawl using proxies with Python
We will see two options, one for using a single proxy, and another for randomly rotating a set of proxies.
You need to have advertools installed for the single proxy case.
How to use a single proxy while crawling
import advertools as adv
adv.crawl(
    url_list="https://example.com",
    output_file="output.jsonl",
    meta={"proxy": "http://proxy_server:port"})The proxy can also be set as well by including username and password like this http://username:password@proxy_server:port
How to use and rotate multiple proxies while crawling
For this you will need to install a special library for that, which is straightforward:
This package handles the proxy rotation for you, in addition to retries, so you don’t need to worry about those details.
Get a list of proxies and save them a text file, one proxy per line
These can be obtained from any proxy service provider. Save the list of proxies in a text file with the template:
https://username:password@IPADDRESS:PORT
Your file would like like this:
Set a few custom_settings in the crawl function
We will be using the these custom settings: DOWNLOADER_MIDDLEWARES and ROTATING_PROXY_LIST_PATH.
Then, you need to set a few custom_settings in the crawl function, and you’re done:
adv.crawl(
    "https://example.com",
    "output.jsonl",
    follow_links=True,
    custom_settings={
        "DOWNLOADER_MIDDLEWARES": {
            "rotating_proxies.middlewares.RotatingProxyMiddleware": 610,
            "rotating_proxies.middlewares.BanDetectionMiddleware": 620,
        },
        "ROTATING_PROXY_LIST_PATH": "proxies.txt", # or the full path to where you stored the file
    },
)Reading the crawl file and seeing the proxies
crawldf = pd.read_json("output.jsonl", lines=True)
crawldf.filter(regex="proxy").head()  # get columns that contain "proxy"| .. | proxy | _rotating_proxy | request_headers_proxy-authorization | proxy_retry_times | 
|---|---|---|---|---|
| 0 | https://123.456.789.101:8893 | 1 | Basic b3VzY214dHg6ODlld29rMGRsdfgt | nan | 
| 1 | https://123.456.789.101:8894 | 1 | Basic b3VzY214dHg6ODlld29rMGRsdfgt | nan | 
| 2 | https://123.456.789.101:8895 | 1 | Basic b3VzY214dHg6ODlld29rMGRsdfgt | nan | 
| 3 | https://123.456.789.101:8896 | 1 | Basic b3VzY214dHg6ODlld29rMGRsdfgt | nan | 
| 4 | https://123.456.789.101:8897 | 1 | Basic b3VzY214dHg6ODlld29rMGRsdfgt | nan | 
There are a few other settings that you might want to check out, which are available at the library’s documenation page.