How to download images in bulk with Python
The only thing you need here is to have advertools installed, as well as a bunch of URLs from which you want to download their images.
You can use advertools to obtain website URLs in various ways like crawling a website or downloading and parsing an XML sitemap.
Create a list of URLs
Download images from a list of URLs
You can set a minimum height and/or width for the images that you want to download using the min_height
and min_width
parameters. Either to prevent downloading the tiny ones like logos, or navigational icons, or you might know that the images of interest have a certain height/width combination.
Another thing you can use is the include_img_regex
parameter, where you can get only images matching that regex. For example you might learn that the images you want all have “-product-” in them, so you can get only those.
Previewing the downloaded images
from IPython.display import Image
import os
img_urls = []
for img in os.listdir('nyt_images/'):
if img.endswith(".jl"):
continue
display(Image(f"nyt_images/{img}"))
We ended up downloading author profile pictures here. If this is not what you want, you can set a higher minimum to height/weight than the 50 pixels we set in this process.
How to get downloaded image meta data
In the output_dir
that you specified, there is a file called image_summary.jl
containing further information about the images.
Columns:
image_urls
: A list of URLs of the images (the imagesrc
attribute)image_location
: The URL where those images are locatedimages
: Further details about the images (URL, path, checksum, and status)
import pandas as pd
img_summary = pd.read_json('nyt_images/image_summary.jl', lines=True)
img_summary
image_urls | image_location | images | |
---|---|---|---|
0 | [https://static01.nyt.com/images/2024/12/23/mu... | https://www.nytimes.com/2024/12/24/business/ho... | [{'url': 'https://static01.nyt.com/images/2024... |
1 | [https://static01.nyt.com/images/2024/12/20/mu... | https://www.nytimes.com/2024/12/20/technology/... | [{'url': 'https://static01.nyt.com/images/2024... |
How to get a mapping of URLs and the images they contain
Because image_location
conains lists of URLs, all we have to do is explode
this column.
image_location | image_urls | |
---|---|---|
0 | https://www.nytimes.com/2024/12/24/business/ho... | https://static01.nyt.com/images/2024/12/23/mul... |
0 | https://www.nytimes.com/2024/12/24/business/ho... | https://static01.nyt.com/images/2022/01/20/rea... |
1 | https://www.nytimes.com/2024/12/20/technology/... | https://static01.nyt.com/images/2024/12/20/mul... |
1 | https://www.nytimes.com/2024/12/20/technology/... | https://static01.nyt.com/images/2018/11/26/mul... |
Note that the indexes, like the URLs are duplicated in the image_location
column, so that each image is independently represented on its own row.
How to extract further image details
We parse the json of the images
columns using the pandas.json_normalize
function.
url | path | checksum | status | |
---|---|---|---|---|
0 | https://static01.nyt.com/images/2024/12/23/mul... | AUTO-DEALS-04-cbmf-articleLarge.jpg | dcd0fe020e1d86d174a83ea80b64d2fb | downloaded |
1 | https://static01.nyt.com/images/2022/01/20/rea... | author-neal-e-boudette-thumbLarge-v2.png | 9057e2462ec5f0e35380971b03fa60b6 | downloaded |
2 | https://static01.nyt.com/images/2024/12/20/mul... | 20openai-bplz-articleLarge.jpg | 6a1c0054c126e45188cf2d5bd0433010 | downloaded |
3 | https://static01.nyt.com/images/2018/11/26/mul... | author-cade-metz-thumbLarge.png | 3f4e09b6eec6a63cc131a297cc548ce2 | downloaded |