Lifting liberally from an email from LoC today -– I thought others might be interested. cc: @adamcox
- The library has published a handful of web mapping applications, all of which are accompanied by a CSV spreadsheet of the mapped data including coordinates (linked from the info box on each app). The underlying feature layers are also public via ArcGIS Online, but I believe are not configured to allow public to export in alternative formats. Generally, each of these web mapping apps falls into one of three categories:
- Georeferenced point data for collection items (such as entire directories, rather than addresses within directories). e.g., Chronicling America Newspapers , United States City Directories, 1861-1960, Criss Cross Directories, 1930-2005, and Panoramic Maps.
- Collection datasets. One is currently available as a webmap, the Climatological Database for the World’s Oceans, 1750-1850, which is particularly popular.
- Sanborn fire insurance atlases - selected volume polygons, all of which can be reached from the Sanborn Atlas Volume Finder.
- Public geospatial datasets: Externally-produced geospatial datasets that have been collected by the library and are public domain or openly licensed can be found at Available Online, Geospatial Data | Library of Congress.
- The library’s upcoming exploratory data packages, hosted by LC Labs and featured in this Friday’s event that Josh mentioned (https://data.labs.loc.gov/), also include a number of georeferenced library collections. The metadata and coordinates will generally be available as CSV spreadsheets. Brian and the other Labs folks should be able to share more as these data packages are made public!
- Via the loc.gov API, there are several collections that have coordinates, although usually you’ll find that coordinates are available for only a portion of each collection. One of the most popular is the set of U.S. National Park Service surveys of historical buildings, engineering records, and landscapes at About this Collection | Historic American Buildings Survey/Historic American Engineering Record/Historic American Landscapes Survey | Digital Collections | Library of Congress. The coordinates provided by the API for this collection cover approximately 17% and were supplied by NPS (note that this data has some errors and has not been reviewed for quality by Library of Congress), and this demonstration Python notebook has information about how to access those coordinates. Interestingly, about 8% of the collection also has coordinates in Wikidata (such as Christmas Valley Air Force Station, which is on loc.gov at Over-the-Horizon Backscatter Radar Network, Christmas Valley Radar Site Transmit Sector Four Antenna Array, On unnamed road west of Lost Forest Road, Christmas Valley, Lake County, OR | Library of Congress).
- Library of Congress’s digitized U.S. Telephone Directory Collection. Thinking generally about OCR text from directories, I wonder what the quality of results might be if you experimented with using current large language models (LLMs) to parse out addresses and names? Our colleague Matt Miller ran a personal experiment and wrote an analysis about the pros and cons of using LLMs to parse unstructured library data, at Using GPT on Library Collections.
Please let me know if you have a particular interest in working with any of these datasets or tools and OHM. I know the LoC Labs team is excited when others leverage their data for cool projects!