How to download images in bulk with Python

Python
advertools
crawling
images
automation
Intermediate
A Python script for downloading images in bulk from a set of given URLs. You can also set contstraints like minimum width and/or height, as well as matching a regex for image file names.

The only thing you need here is to have advertools installed, as well as a bunch of URLs from which you want to download their images.

You can use advertools to obtain website URLs in various ways like crawling a website or downloading and parsing an XML sitemap.

Create a list of URLs

start_urls = [
    "https://www.nytimes.com/2024/12/24/business/honda-nissan-auto-merger-deals.html",
    "https://www.nytimes.com/2024/12/20/technology/openai-new-ai-math-science.html"
]

Download images from a list of URLs

You can set a minimum height and/or width for the images that you want to download using the min_height and min_width parameters. Either to prevent downloading the tiny ones like logos, or navigational icons, or you might know that the images of interest have a certain height/width combination.

Another thing you can use is the include_img_regex parameter, where you can get only images matching that regex. For example you might learn that the images you want all have “-product-” in them, so you can get only those.

import advertools as adv

adv.crawl_images(
    start_urls=start_urls,
    output_dir="nyt_images/",
    min_height=50,
    min_width=50)
# import advertools as adv

# adv.crawl_images(
#     start_urls=start_urls,
#     output_dir="nyt_images/",
#     min_height=50,
#     min_width=50)

Previewing the downloaded images

from IPython.display import Image
import os
img_urls = []
for img in os.listdir('nyt_images/'):
    if img.endswith(".jl"):
        continue
    display(Image(f"nyt_images/{img}"))

We ended up downloading author profile pictures here. If this is not what you want, you can set a higher minimum to height/weight than the 50 pixels we set in this process.

How to get downloaded image meta data

In the output_dir that you specified, there is a file called image_summary.jl containing further information about the images.

Columns:

  • image_urls: A list of URLs of the images (the image src attribute)
  • image_location: The URL where those images are located
  • images: Further details about the images (URL, path, checksum, and status)
import pandas as pd
img_summary = pd.read_json('nyt_images/image_summary.jl', lines=True)
img_summary
image_urls image_location images
0 [https://static01.nyt.com/images/2024/12/23/mu... https://www.nytimes.com/2024/12/24/business/ho... [{'url': 'https://static01.nyt.com/images/2024...
1 [https://static01.nyt.com/images/2024/12/20/mu... https://www.nytimes.com/2024/12/20/technology/... [{'url': 'https://static01.nyt.com/images/2024...

How to get a mapping of URLs and the images they contain

Because image_location conains lists of URLs, all we have to do is explode this column.

img_summary[['image_location', 'image_urls']].explode('image_urls')
image_location image_urls
0 https://www.nytimes.com/2024/12/24/business/ho... https://static01.nyt.com/images/2024/12/23/mul...
0 https://www.nytimes.com/2024/12/24/business/ho... https://static01.nyt.com/images/2022/01/20/rea...
1 https://www.nytimes.com/2024/12/20/technology/... https://static01.nyt.com/images/2024/12/20/mul...
1 https://www.nytimes.com/2024/12/20/technology/... https://static01.nyt.com/images/2018/11/26/mul...

Note that the indexes, like the URLs are duplicated in the image_location column, so that each image is independently represented on its own row.

How to extract further image details

We parse the json of the images columns using the pandas.json_normalize function.

dfs = []
for row in img_summary['images'].values:
    dfs.append(pd.json_normalize(row))
pd.concat(dfs, ignore_index=True)
url path checksum status
0 https://static01.nyt.com/images/2024/12/23/mul... AUTO-DEALS-04-cbmf-articleLarge.jpg dcd0fe020e1d86d174a83ea80b64d2fb downloaded
1 https://static01.nyt.com/images/2022/01/20/rea... author-neal-e-boudette-thumbLarge-v2.png 9057e2462ec5f0e35380971b03fa60b6 downloaded
2 https://static01.nyt.com/images/2024/12/20/mul... 20openai-bplz-articleLarge.jpg 6a1c0054c126e45188cf2d5bd0433010 downloaded
3 https://static01.nyt.com/images/2018/11/26/mul... author-cade-metz-thumbLarge.png 3f4e09b6eec6a63cc131a297cc548ce2 downloaded