The Library of Congress acquires, preserves, and provides enduring access to fixed datasets selected by subject experts. Datasets provide material for the emergent data science community to build upon, and the Library strives to cultivate a broad collection that is of use to researchers interested in a variety of topics, including open Citizen Science, machine learning, digital humanities, and government. The Library prioritizes datasets that are determined to qualify as at-risk born-digital content to preserve along with more traditional content. Consult our Selected Datasets Collection for information about datasets in the Library's collection.
This guide provides information about the collection of datasets at the Library of Congress, suggests tools for researchers, considers how datasets can be used for research, and provides guidance for locating datasets that may be sources for data science and machine learning projects. It is not intended to be comprehensive; rather, the goal of this guide is to provide credible starting points.
Consult Digital Scholarship at the Library of Congress: A Resource Guide for more information on ways to access digital materials for scholarship, classroom instruction, the exploration of hobbies or passions, and more.
The Signal is a collaborative blog between the Digital Strategy Directorate and the Digital Content Management Section at the Library of Congress, initiated in 2011 to share about digital preservation efforts. It contains a number of blogs focused on datasets. The Teaching with the Library of Congress blog also has articles on datasets.