How to crawl using proxies with Python

Crawling
advertools
proxy
scrapy
A Python script showing how the advertools crawl function can easily incorporate a proxy. We can also use an additional library to auomatically rotate a set of proxies as well.

We will see two options, one for using a single proxy, and another for randomly rotating a set of proxies.

You need to have advertools installed for the single proxy case.

How to use a single proxy while crawling

import advertools as adv
adv.crawl(
    url_list="https://example.com",
    output_file="output.jsonl",
    meta={"proxy": "http://proxy_server:port"})

The proxy can also be set as well by including username and password like this http://username:password@proxy_server:port

How to use and rotate multiple proxies while crawling

For this you will need to install a special library for that, which is straightforward:

$ pip install scrapy-rotating-proxies

This package handles the proxy rotation for you, in addition to retries, so you don’t need to worry about those details.

Get a list of proxies and save them a text file, one proxy per line

These can be obtained from any proxy service provider. Save the list of proxies in a text file with the template:

https://username:password@IPADDRESS:PORT

Your file would like like this:

proxies.txt
https://user123:password123@12.34.56.78:1111
https://user123:password123@12.34.56.78:1112
https://user123:password123@12.34.56.78:1113
https://user123:password123@12.34.56.78:1114

Set a few custom_settings in the crawl function

We will be using the these custom settings: DOWNLOADER_MIDDLEWARES and ROTATING_PROXY_LIST_PATH.

Then, you need to set a few custom_settings in the crawl function, and you’re done:

adv.crawl(
    "https://example.com",
    "output.jsonl",
    follow_links=True,
    custom_settings={
        "DOWNLOADER_MIDDLEWARES": {
            "rotating_proxies.middlewares.RotatingProxyMiddleware": 610,
            "rotating_proxies.middlewares.BanDetectionMiddleware": 620,
        },
        "ROTATING_PROXY_LIST_PATH": "proxies.txt", # or the full path to where you stored the file
    },
)

Reading the crawl file and seeing the proxies

crawldf = pd.read_json("output.jsonl", lines=True)
crawldf.filter(regex="proxy").head()  # get columns that contain "proxy"
.. proxy _rotating_proxy request_headers_proxy-authorization proxy_retry_times
0 https://123.456.789.101:8893 1 Basic b3VzY214dHg6ODlld29rMGRsdfgt nan
1 https://123.456.789.101:8894 1 Basic b3VzY214dHg6ODlld29rMGRsdfgt nan
2 https://123.456.789.101:8895 1 Basic b3VzY214dHg6ODlld29rMGRsdfgt nan
3 https://123.456.789.101:8896 1 Basic b3VzY214dHg6ODlld29rMGRsdfgt nan
4 https://123.456.789.101:8897 1 Basic b3VzY214dHg6ODlld29rMGRsdfgt nan

There are a few other settings that you might want to check out, which are available at the library’s documenation page.