How to crawl using proxies with Python
We will see two options, one for using a single proxy, and another for randomly rotating a set of proxies.
You need to have advertools installed for the single proxy case.
How to use a single proxy while crawling
import advertools as adv
adv.crawl(
url_list="https://example.com",
output_file="output.jsonl",
meta={"proxy": "http://proxy_server:port"})
The proxy can also be set as well by including username and password like this http://username:password@proxy_server:port
How to use and rotate multiple proxies while crawling
For this you will need to install a special library for that, which is straightforward:
This package handles the proxy rotation for you, in addition to retries, so you don’t need to worry about those details.
Get a list of proxies and save them a text file, one proxy per line
These can be obtained from any proxy service provider. Save the list of proxies in a text file with the template:
https://username:password@IPADDRESS:PORT
Your file would like like this:
Set a few custom_settings
in the crawl function
We will be using the these custom settings: DOWNLOADER_MIDDLEWARES
and ROTATING_PROXY_LIST_PATH
.
Then, you need to set a few custom_settings
in the crawl function, and you’re done:
adv.crawl(
"https://example.com",
"output.jsonl",
follow_links=True,
custom_settings={
"DOWNLOADER_MIDDLEWARES": {
"rotating_proxies.middlewares.RotatingProxyMiddleware": 610,
"rotating_proxies.middlewares.BanDetectionMiddleware": 620,
},
"ROTATING_PROXY_LIST_PATH": "proxies.txt", # or the full path to where you stored the file
},
)
Reading the crawl file and seeing the proxies
crawldf = pd.read_json("output.jsonl", lines=True)
crawldf.filter(regex="proxy").head() # get columns that contain "proxy"
.. | proxy | _rotating_proxy | request_headers_proxy-authorization | proxy_retry_times |
---|---|---|---|---|
0 | https://123.456.789.101:8893 | 1 | Basic b3VzY214dHg6ODlld29rMGRsdfgt | nan |
1 | https://123.456.789.101:8894 | 1 | Basic b3VzY214dHg6ODlld29rMGRsdfgt | nan |
2 | https://123.456.789.101:8895 | 1 | Basic b3VzY214dHg6ODlld29rMGRsdfgt | nan |
3 | https://123.456.789.101:8896 | 1 | Basic b3VzY214dHg6ODlld29rMGRsdfgt | nan |
4 | https://123.456.789.101:8897 | 1 | Basic b3VzY214dHg6ODlld29rMGRsdfgt | nan |
There are a few other settings that you might want to check out, which are available at the library’s documenation page.