How to download images in bulk with Python

Python

advertools

crawling

images

automation

Intermediate

A Python script for downloading images in bulk from a set of given URLs. You can also set contstraints like minimum width and/or height, as well as matching a regex for image file names.

The only thing you need here is to have advertools installed, as well as a bunch of URLs from which you want to download their images.

You can use advertools to obtain website URLs in various ways like crawling a website or downloading and parsing an XML sitemap.

Create a list of URLs

start_urls = [
    "https://www.nytimes.com/2024/12/24/business/honda-nissan-auto-merger-deals.html",
    "https://www.nytimes.com/2024/12/20/technology/openai-new-ai-math-science.html"
]

Download images from a list of URLs

You can set a minimum height and/or width for the images that you want to download using the min_height and min_width parameters. Either to prevent downloading the tiny ones like logos, or navigational icons, or you might know that the images of interest have a certain height/width combination.

Another thing you can use is the include_img_regex parameter, where you can get only images matching that regex. For example you might learn that the images you want all have “-product-” in them, so you can get only those.

import advertools as adv

adv.crawl_images(
    start_urls=start_urls,
    output_dir="nyt_images/",
    min_height=50,
    min_width=50)

# import advertools as adv

# adv.crawl_images(
#     start_urls=start_urls,
#     output_dir="nyt_images/",
#     min_height=50,
#     min_width=50)

Previewing the downloaded images

from IPython.display import Image
import os
img_urls = []
for img in os.listdir('nyt_images/'):
    if img.endswith(".jl"):
        continue
    display(Image(f"nyt_images/{img}"))

We ended up downloading author profile pictures here. If this is not what you want, you can set a higher minimum to height/weight than the 50 pixels we set in this process.

How to get downloaded image meta data

In the output_dir that you specified, there is a file called image_summary.jl containing further information about the images.

Columns:

image_urls: A list of URLs of the images (the image src attribute)
image_location: The URL where those images are located
images: Further details about the images (URL, path, checksum, and status)

import pandas as pd
img_summary = pd.read_json('nyt_images/image_summary.jl', lines=True)
img_summary

	image_urls	image_location	images
0	[https://static01.nyt.com/images/2024/12/23/mu...	https://www.nytimes.com/2024/12/24/business/ho...	[{'url': 'https://static01.nyt.com/images/2024...
1	[https://static01.nyt.com/images/2024/12/20/mu...	https://www.nytimes.com/2024/12/20/technology/...	[{'url': 'https://static01.nyt.com/images/2024...

How to get a mapping of URLs and the images they contain

Because image_location conains lists of URLs, all we have to do is explode this column.

img_summary[['image_location', 'image_urls']].explode('image_urls')

	image_location	image_urls
0	https://www.nytimes.com/2024/12/24/business/ho...	https://static01.nyt.com/images/2024/12/23/mul...
0	https://www.nytimes.com/2024/12/24/business/ho...	https://static01.nyt.com/images/2022/01/20/rea...
1	https://www.nytimes.com/2024/12/20/technology/...	https://static01.nyt.com/images/2024/12/20/mul...
1	https://www.nytimes.com/2024/12/20/technology/...	https://static01.nyt.com/images/2018/11/26/mul...

Note that the indexes, like the URLs are duplicated in the image_location column, so that each image is independently represented on its own row.

How to extract further image details

We parse the json of the images columns using the pandas.json_normalize function.

dfs = []
for row in img_summary['images'].values:
    dfs.append(pd.json_normalize(row))

pd.concat(dfs, ignore_index=True)

	url	path	checksum	status
0	https://static01.nyt.com/images/2024/12/23/mul...	AUTO-DEALS-04-cbmf-articleLarge.jpg	dcd0fe020e1d86d174a83ea80b64d2fb	downloaded
1	https://static01.nyt.com/images/2022/01/20/rea...	author-neal-e-boudette-thumbLarge-v2.png	9057e2462ec5f0e35380971b03fa60b6	downloaded
2	https://static01.nyt.com/images/2024/12/20/mul...	20openai-bplz-articleLarge.jpg	6a1c0054c126e45188cf2d5bd0433010	downloaded
3	https://static01.nyt.com/images/2018/11/26/mul...	author-cade-metz-thumbLarge.png	3f4e09b6eec6a63cc131a297cc548ce2	downloaded