Skip to Main Content

Chronicling America: A Guide for Researchers

Improved Machine-Readable Text for Newspapers

Background 

Chronicling America provides access to historic newspapers digitized under the National Digital Newspaper Program (NDNP). Sponsored by the Library of Congress and the National Endowment for the Humanities (NEH), the NDNP began in 2005 and continues to this day. In anticipation of the NDNP’s 20th year, the Library launched an effort to make the digitized newspaper data more accessible to users by re-processing select newspaper content digitized prior to 2012 to improve its machine-readable text.

Machine-readable text is created by a technology called Optical Character Recognition (OCR). Using the Tesseract Open Source OCR Engine External and custom post-processing scripts, the Library created this new OCR pipeline specifically for NDNP data. More information about the technologies and processes used in this OCR reprocessing effort is coming to this page soon.   

For questions, please contact [email protected].  

 

What is OCR? 

OCR is an automated process that converts the visual image of text into machine-readable text. Computer software can then search the OCR-generated text for words, phrases, numbers, or other characters. Although errors in the process are unavoidable, OCR is still a powerful tool for making text-based items accessible to searching. For example, important concept words often appear more than once within an article. Therefore, if OCR misreads one instance of a keyword in a passage, but correctly reads the second instance, the passage will still be found in a full-text search. OCR technology has advanced significantly since the beginning of the NDNP, thereby leading to this important reprocessing initiative.

 

NDNP-Open-OCR

NDNP-Open-OCR is an open-source project developed by the Library of Congress for re-processing OCR of NDNP data. More information coming soon.

Related Resources and Presentations:

 

Reprocessed Batches List 

Newspapers are added to Chronicling America in the form of batches.  See Recent Additions to Chronicling America for more info. The following batches have been re-processed to improve the machine-readable / searchable text and are now available on Chronicling America.   

Date Reprocessed Batch Added Contributor Batch Name Page Count Content on Batch
2024-12-09 LC- Library of Congress, Washington, DC dlc_alice_ver02 5276 New-York Daily Tribune (sn83030213) 1860-1861
2024-12-09 LC- Library of Congress, Washington, DC dlc_basic_ver02 5177 New-York Daily Tribune (sn83030213) 1862-1863
2024-12-09 LC- Library of Congress, Washington, DC dlc_cobol_ver02 5166 New-York Daily Tribune (sn83030213) 1864-1865
2024-12-09 LC- Library of Congress, Washington, DC dlc_delphi_ver03 5035 New-York Daily Tribune (sn83030213) 1866,
New-York Tribune (sn83030214) 1866-1867
2024-12-09 LC- Library of Congress, Washington, DC dlc_euclid_ver04 5332 New-York Tribune (sn83030214) 1866-1869
2024-12-09 LC- Library of Congress, Washington, DC dlc_grass_ver02 5436 New-York Tribune (sn83030214) 1872-1873
2024-12-09 LC- Library of Congress, Washington, DC dlc_hugo_ver02 4861 New-York Tribune (sn83030214) 1874-1875
2024-12-09 LC- Library of Congress, Washington, DC dlc_inform_ver02 5155 New-York Tribune (sn83030214) 1875-1877
2025-01-08 LC- Library of Congress, Washington, DC dlc_java_ver02 5286 New-York Tribune (sn83030214) 1877-1879
2025-01-08 LC- Library of Congress, Washington, DC dlc_lisp_ver02 930 New-York Tribune (sn83030214) 1879
2025-01-08 LC- Library of Congress, Washington, DC dlc_kite_ver04 910 New-York Tribune (sn83030214) 1877
2025-01-08 LC- Library of Congress, Washington, DC dlc_airy_ver02 6236 New-York Daily Tribune (sn83030213) 1845, 1852-1854
2025-01-08 LC- Library of Congress, Washington, DC dlc_buttery_ver02 4614 New-York Daily Tribune (sn83030213) 1847-1850
2025-01-08 LC- Library of Congress, Washington, DC dlc_crunchy_ver02 4340 New-York Daily Tribune (sn83030213) 1849-1852
2025-01-08 LC- Library of Congress, Washington, DC dlc_dry_ver02 5278 New-York Daily Tribune (sn83030213) 1854-1856
2025-01-08 LC- Library of Congress, Washington, DC dlc_eggy_ver02 5276 New-York Daily Tribune (sn83030213) 1852, 1856-1858
2025-01-08 LC- Library of Congress, Washington, DC dlc_flavory_ver02 4182 New-York Daily Tribune (sn83030213) 1842, 1858-1859
2025-01-08 LC- Library of Congress, Washington, DC dlc_gritty_ver02 6395 New-York Tribune (sn83030212) 1841-1842,
New-York Daily Tribune (sn83030213) 1842-1846
 

Notes:

  • There may be a lag time between the date of acceptance and the date made available in Chronicling America.
  • The number of pages may include extra microfilm target images or duplicates images on the batch that are not counted in Chronicling America.