Skip to Main Content

Digital Scholarship at the Library of Congress: A Research Guide

Tutorials - How Do I...?

This section of the guide presents a series of hands-on tutorials for making use of the Library's digital collections. If you have further questions or do not see your question represented here, get in touch with a reference librarian through Ask a Librarian.

Select any of the topics below to view expanded content for each:

One common way to represent data is to create a visualization—this can be a chart, table, graph, map, or a new form you create! Regardless of the type of data visualization you're creating, you should ask yourself these basic questions about your goal, audience, and the information you intend to display. The following tips can help you get started. Here are the basic three steps in the process:

  1. Get your data in a spreadsheet
  2. Decide on the kind of visualization you’re making
  3. Decide on the tool to use

Step 1: Get your data in a spreadsheet

  • Find the item(s) or collection(s) you’d like to use.
  • Decide on the kinds of information you’d like to include about the items in your spreadsheet. This may include metadata like title, author, place and year of publication, media. It may also include information like past ownership, related works, editions, etc.
  • Build your spreadsheet either by using the API or by manually entering items across collections of interest.

Step 2: Decide on the kind of visualization you’re making

Some examples of visualizations include: maps, timelines, networks, or frequency charts

Things to take into consideration when deciding on a visualization:

  • What information does the visual element convey that another medium cannot? Does your visualization most effectively convey this information?
  • Is your visualization easy to grasp quickly?
  • What kinds of trends or patterns are you trying to communicate? (i.e. over time or spread across space, between individual actors or words?)

Step 3: Decide on the tool to use

Things to take into consideration when deciding on a tool:

  • Will this tool require you to learn a new programming language or install new software?
  • Is it free to use?
  • Where will your end product live? (i.e will it live on the program’s browser, can you download it, will you own the domain?)
  • Is it easy to share your end product? (By embedding it into a paper or on a website, sharing a link, etc.)

See a list of visualization tools in the Further Reading section of this guide.

We often receive the question: "how do I download all the images belonging to a single item on loc.gov--like the scanned pages of a book—at one time?"

Some items have a button that allows you to download all the images as a combined package, like a PDF. If you don't see that option, try using the method described below using a web browser, command line, spreadsheet software program, and a text editor.

Before you do, please note that you can only use this method for images that are made fully available on loc.gov and not under any copyright restrictions.

And if you're not sure if this method suits your needs, explore the examples included in the fifth tutorial on this page.

Overview

  • Step 1: retrieve image urls and save them in a text file. (Described in this tutorial.)
  • Step 2: download images to your computer using the command line. (Described in the next tutorial "How do I download images once I have their urls?")

If you are downloading 100 or fewer images, you can use this relatively lightweight method for which all you need is:

  • A web browser
  • A spreadsheet software program (such as Microsoft Excel)
  • Text editor (Notepad or Notes is built into Windows and OS X devices)
  • Command line (Windows Powershell or OS X Linux)

If you are downloading more than 100 images or if the data you get back from the API is very complicated, then you may want to try the following Python script laid out in the following Jupyter Notebook lesson External on the LC for Robots website.

Retrieve and clean the image urls

NOTE: if you already know the urls of the images you’d like to download, you should compile them in a single text file titled “urls.txt” and skip this step!

Retrieve Image URLs on LOC.gov

  • Find the item you want to download
  • Make sure you are in gallery view
    • this ensures you are displaying all the items on the page or else you won’t get all the URLs
  • If all of your images won't display on a single page, then add &c=[NUMBER OF IMAGES IN ITEM] to the end of your browser.
    • For example, for an item with 150 images in sequence, add &c=150
    • For an item with 239 images in sequence, add &c=250
screen capture of a grid of scanned pages of a book
Screen capture of the gallery view of "My Mother as I Recall Her" by Anna Murray Douglass, Frederick Douglass Papers at the Library of Congress, Manuscript Division.

Retrieve Image URLs—still on loc.gov but looking at the underlying JSON data

  • Add &fo=json&at=segments to the end of your browser
    • Note: if you are operating in Firefox, to view the raw JSON before doing a ctrl+a, you will need to click on "Raw Data."
    • Retain any additional parameters you had included. Example, for an item with 239 images, your url should end in &c=250&fo=json&at=segments
  • You will get something that looks like this:
screen capture of JSON data displayed in a browser

Using an online JSON-csv converter

  • Select all (ctrl-a) and then copy (ctrl-c) to copy the JSON data displayed in your browser.
  • Find an online JSON to CSV converter.
  • Paste (ctrl-v) your JSON data into the converter.
  • Download the csv file.

Clean the URLs in a spreadsheet editor

1. Open the csv file

  • Open the csv file in a spreadsheet viewer and editor; search for "image_url” → these are the image urls you will use to download your images.
    • Note: you may find multiple columns that contain the term “image_url.” Look for the column that contains a series of urls. It often will begin with “segments”. If in doubt, copy and paste the first url in your browser to see if it returns the first image you’d like to download.
  • If you are given the choice between tile and cdn, choose cdn. If you have the choice between jpg and gif, choose jpg.
    • Note: if your image_url begins with //tile then you should make sure you are selecting the largest file size for download
screen capture of a spreadsheet with the column image-urls highlighted in yellow

2. Create a new sheet

  • Copy the column of image urls into a new sheet on your csv file/spreadsheet.
screen capture of a spreadsheet displaying a single column of image urls with their dimensions appended to the end following a pound sign.

3. Clean the image urls

  • Clean dataset:
    • Split the image url into two columns: one for the url (ending in .jpg or .tif) and one starting with # and containing the height and width dimensions of the image
      • If operating in Microsoft Excel, one way to do so is navigating to "Data" then "Text to Columns." Select "Delimited" and "select #." This will split the url wherever the program finds "#".
        • Note: Make sure you have highlighted the entire column!
    • If relevant: Add http: to all image urls by finding the start of your url (ex: tile/loc) and replacing all with http://tile
screen capture of a spreadsheet with two columns. The left-most columns displays image urls ending in .jpg and the second column displays their dimensions.
  • Now you have clean image URLs! Proceed to the tutorial on "How do I download images once I have their URLs?".

Download the images to your computer using the urls

This step requires using a command line interface. Two common ways of using the command line are through Windows Powershell (if operating a PC) and Linux or OS X (if operating a Mac).

If operating in Windows Powershell:

  • Create a new folder for your images. For best results, create this folder in "downloads."
  • Copy your image urls from your spreadsheet into a text editor (such as Notepad, Notepad ++) as plain text and save in the folder you just made as a .txt file
    • Remember: save the txt file into the same folder you created for your images
screen capture of a text editor displaying a list of clean image urls.
  • Open Windows Powershell:
    • Navigate to that folder by using the command:
      • cd [top level folder name]
      • cd [your folder name]
    • Example:
      • cd downloads
      • cd Douglass
  • Write the command to pull all images from your .txt file:
    • In Powershell: gc urls.txt | % {iwr $_ -outf $(split-path $_ -leaf)}
    • Note: Leaf means that you’re only getting the last part of the URL so that means that if it’s the same file name, it will override itself. If you are having this issue, see the documentation under Troubleshooting.

Voila! Your images should all be saved in a single folder on your computer.

screen capture of a folder containing 31 images that are scanned pages from a book by Anna Murray Douglass.

If operating in OS X:

  • Create a new folder for your images
  • Install wgeton your machine
  • Copy your image_urls from your spreadsheet into a text editor (such as Notepad, Notepad ++) and save in the folder you just madeas a .txt file
    Remember: save the txt file in the same folder you created for your images
  • Open the command line
    • Navigate to that folder by using the command:
      • cd [top level folder name
      • cd [your folder name]
  • Write the command to pull images one at a time:
    • wget “image url”
  • Write the command to pull all images from your .txt file:
    • In Linux or OS X: wget --input file_name.txt

Voila! Your images should all be saved in a single folder on your computer.


Troubleshooting: If you are having troubles because your image URLs end with “default.jpg” take the following steps:

  1. Take all your clean http:// image urls ending in file extension (in this case: .jpg)
  2. Create a new column in your spreadsheet. Title it outfile
  3. Create two additional new columns to the right of outfile
screen capture of a spreadsheet with two columns. the first is titled url and contains image urls. The title of the second is highlighted in yellow and titled outfile.
  1. Populate one column with numbers 0--# of images you are downloading using Autofill
  2. Populate the other column with your file extension name (in this case: jpg)
screen capture of a spreadsheet with the outfile highlighted in yellow. Next to it are two columns populated with the numbers 0-31 and the letters jpg.
  1. Copy and paste the text from these two columns into a text editor
screen capture of a text editor displaying the numbers 0-31 next to the letters jpg.
  1. Find and replace the number of spaces between each number and the file extension. Replace with .
  2. Now you should have data that looks like this:
screen capture of a text editor displaying the numbers 0-31 followed immediately by .jpg
  1. Copy it back into the column titled outfile
  2. Column A should be titled url and column B should be outfile
screen capture of a spreadsheet whose far right column contains the numbers 0-31 followed by .jpg
  1. Make sure your file is saved as a .csv and titled urls
  2. Use the following command in your Powershell to call all the image urls from the csv file*:
    $csvData = Import-Csv urls.csv
    foreach($row in $csvData)
    {Invoke-WebRequest -Uri $row.url -OutFile $row.outfile}
    *Press down shift and enter to tab down another line--make sure each of these commands is written on a separate line of code.

Voila! Your images should all be saved in a single directory on your computer.

screen capture of a folder of scanned images on a computer

Once you have your images saved as .jpg or .tif files in a folder on your computer, you may wish to combine them into a single file, like a PDF.

  • For 25 or fewer images: consider using web-based JPG to PDF converters.
  • For many images, you may wish to use built-in functions available on PCs and Macintosh computers:
    • Microsoft Print to PDF: Microsoft Windows users can select all images in the directory; right click and select “print”; when prompted, select the “Microsoft Print to PDF” function.
    • Mac Preview to Print: Mac users may right click selected images, select “preview” arrange them in proper order then select “print to pdf.”

Example 1: Abdul Hamid II collection

www.loc.gov/collections/abdul-hamid-ii-books/about-this-collection
323 books and periodicals of varying lengths (40-802 pages)

Rights and access: The contents of the Library of Congress Sultan Abdul-Hamid II Collection are in the public domain and are free to use and reuse. Credit Line: Library of Congress, African and Middle East Division, Sultan Abdul-Hamid II Collection.

To download all 323 books in this collection

Use this Jupyter Notebook External available on GitHub

This method works for situations in which you need a batch of images from one collection. However, if your images are from more than one collection, you might end up with files with the same name and end up overwriting them. And either way, these filenames don't tell you anything about the item. You might need to look up further metadata. So, here's an alternative approach that renames the file with the identifier used on the loc.gov website. We'll first re-fetch the image URL for each item and download the file, renaming it using the identifier.

To download and install Jupyter Notebook you need:

  • Python: While Jupyter runs code in many programming languages, Python is a requirement (Python 3.3 or greater, or Python 2.7) for installing the Jupyter Notebook.
  • The Jupyter Notebook documentationExternal recommends using the Anaconda distribution to install Python and Jupyter. Anaconda conveniently installs Python, the Jupyter Notebook, and other commonly used packages for scientific computing and data science.

To download all the high quality images of one book at a time

*Note: because you are downloading 802 larger images, the download make take several to many minutes. For example, this download of 802 images took 30 minutes.


Example 2: My Mother As I Recall Her by Rosetta Douglass Sprague

  • Title: https://www.loc.gov/resource/mfd.02007/?st=gallery 
  • Query: https://www.loc.gov/resource/mfd.02007/?st=gallery&fo=json
  • Rights and Access: The contents of The Frederick Douglass Papers at the Library of Congress are in the public domain and are free to use and reuse. Credit Line: Library of Congress, Manuscript Division, The Frederick Douglass Papers at the Library of Congress
  • 25 pages
  • Followed steps outlined in the tutorial:
    • Used browser to get JSON data
    • Transformed JSON data into a CSV opened in Excel
    • Cleaned image_urls
    • Copied clean image_urls into a .txt file
    • Made a new folder on computer
    • Used the gc urls.txt | % {iwr $_ -outf $(split-path $_ -leaf)} command to get them all into a folder

Example 3: All photographs from Gottlieb collection

  • Title: https://www.loc.gov/collections/jazz-photography-of-william-p-gottlieb/
  • Query: https://www.loc.gov/collections/jazz-photography-of-william-p-gottlieb/?fo=json&c=1000
  • Rights and Access: In accordance with the wishes of William Gottlieb, the photographs in this collection entered into the public domain on February 16, 2010, but rights of privacy and publicity may apply. Privacy and publicity rights protect the interests of the person(s) who may be the subject(s) of the work or intellectual creation. Users of photographs in the Gottlieb collection are responsible for clearing any privacy or publicity rights associated with the use of the images.

    The following items are included in the William P. Gottlieb Collection with permission as noted:

    Articles from Down Beat magazine by William Gottlieb and others, Ed Enright, Editor, Down Beat magazine, 102 North Haven Road, Elmhurst, IL 60126-3370.

    "The Faces of Jazz," by W. Royal Stokes, Civilization, vol. 2, no. 5, September-October 1995, Civilizaton, Attn. Managing Editor, 666 Pennsylvania Ave. SE, Suite 303, Washington, D.C. 20003. Reproduced by permission. All rights reserved.

    Credit Line: William P. Gottlieb/Ira and Leonore S. Gershwin Fund Collection, Music Division, Library of Congress.

  • 1,607 images
  • Need to get all 1,607 not just 25 at a time
  • Changed the count in the API query to c=1000 to grab them 1000 at a time.
  • Two options:
    • JSON-CSV requires you to pay to download a file that large so you could try using kinbot json-csv to convert the JSON data for all 1,607 images at a time
    • Or you can more easily pull all 1,607 images from this collection using the Jupyter Notebook External

Example 4: Woodrow Wilson papers

  • Title: https://www.loc.gov/item/mss4602900001
  • Query: https://www.loc.gov/resource/mss46029.mss46029-001_0019_0792/?st=gallery&c=800&fo=json
  • Rights and Access: The Library of Congress believes that most of the papers in the Woodrow Wilson collection are in the public domain or have no known copyright restrictions. All manuscripts authored by President Wilson himself are in the public domain and are free to use and reuse. Researchers should watch for modern documents (for example, published in the United States less than 95 years ago, or unpublished and the author died less than 70 years ago) that may be copyrighted. Responsibility for making an independent legal assessment of an item and securing any necessary permission ultimately rests with persons desiring to use the item.
  • 774 images
  • Two options:
    • Jupyter Notebook: If you want to download more quickly and/or if you have experience with Python, the Jupyter Notebook may be for you.
    • Browser and command Line: Try following the normal steps outlined in the tutorial (make the request in browser, copy the JSON data into a spreadsheet, clean the URLs and then use the command line) but use a JSON to CSV converter that can handle a larger amount of data for free.

If you are searching for a specific term on loc.gov (for example: https://www.loc.gov/photos/?q=Adams+Morgan) and would like to download the first image for each result of this search, use the following instructions. They closely mirror the instructions in "How do I retrieve the urls for images corresponding to an item on LOC.gov?" with several small differences.

Step 1: Retrieve image urls

  • When constructing your search URL, add "&at=results&fo=json" to the end (instead of just "&fo=json").
  • After you convert the JSON to CSV and download the CSV, it may look like you have a lot of empty rows, but fear not! When you open the spreadsheet in Excel, the first thing you should do is sort it by "image_url_001". This will move the irrelevant rows to the bottom.
  • There will likely be several image_url columns, e.g., "image_url_001", "image_url_002", etc. These may be offering different types of file formats (e.g., gif vs jpeg), and different sizes. Simply copy and paste the URLs into your browser to see which column you would like.
    • Note: if you are looking for PDF documents (rather than images), links to PDF documents will not be found in the image_url_ columns; instead, they'll be found in the "resources__pdf" column.

Please note:

  • TIFF files and certain other formats will not be available for download this way.
  • If any search results have multi-pages, you'll only be able to download the first page using this method.

Step 2: Download images

  • Once you have your image urls: follow the instructions in "How do I download images once I have their urls?" to download them all at once to your computer.