Green Brook project discussion - now with wiki project page

Hi all -

I’m moving the chat discussion around the Green Book datasets from chat to this discussion thread. I originally intended the chat post as a “Hey, check out what I’m playing around with” comment b/c I wasn’t ready for a more thorough post and look where that approach got us. :cowboy_hat_face:

Originally sent in General
jeffmeyer

FYI - I've just uploaded some data from NPYL for the 1947 Negro Motorist Green Book. It's sort of a test for a business directory import. Forum post to come, but fyi: https://www.openhistoricalmap.org/relation/2806825

FYI - just updated some NYPL data for the 1947 Negro Motorist Green Book… project page and forum post to come, but fyi: https://www.openhistoricalmap.org/relation/2806825

Minh_Nguyen

Btw, the dates are all invalid:

jeffmeyer

Yes, plenty of errors, including that one. Will take it down and figure out what's up…

That is a simple error… the other stuff is more troubling…

Ugh… somehow, every field became offset by one row after the name fields…

Ok - I've improved the data significantly. Still not perfect afaict, but much better. You can see many states are still missing. Up for illustration now & I'll need to qa again later.

1947 Negro Motorist Green Book entries

That 1947 data is highly suspect and only left up right now as a temporary situation.

The 1941 Green Book data, on the other hand, I feel very good about… :sweat:… and has much more plausible coverage -->

Can't wait to get some roads to go along with these data points… : )

Minh_Nguyen

Looks like all the POIs got duplicated: Node: 2118175000 | OpenHistoricalMap vs. Node: ‪Sky Pharmacy‬ (‪2118172141‬) | OpenHistoricalMap

I spot-checked a few in Cincinnati. Hard to tell where they should all be due to street name changes over the years. That Sky Pharmacy is at least 7 blocks off, but some other POIs are at the right cross street at least.

Your end_date:edtf tags should not include the arbitrary 10-year offset. The range operator you used is good enough. You can also add ? if you want to emphasize the freewheeling nature of this dating.

jeffmeyer

Yeah, not sure how those dupes got in there. Will fix today.

so… end_date=[..1957?]?

The locations were based off a geocoding lookup and many of the addresses are incomplete, so there are bound to be errors, although most should be fairly close.

You know this, but he beauty of this dataset is that it can be modified based on more accurate / updated info, which isn't true of any of the Green Book sites I've looked at.

See: Big Buster Cafe, Fayetteville, NC - updated with information from the Big Buster page on the North Carolina African American Heritage Commission's Green Book site: Oasis Places

Minh_Nguyen

Our geocoder, or something else that only considers modern-day names? If there isn’t a good match, does the geocoder try to fall back somehow or does it give up, and how do you handle it? Apologies if you’re planning to write this all up. It’s the kind of thing we could learn from by discussing an import before it takes place. Not as a requirement in every instance, but at least when we’re about to do something novel.

I do like how I can quickly click on the image in the inspector to confirm the POI that got added. Reminds me of the “proofreading” workflow in Wikisource.

Speaking of which, what if we get this book uploaded to Commons, if it isn’t up there already, and start a proofreading project on Wikisource so that we can link individual entries to OHM?

jeffmeyer

Yeah… this kind of quickly went from proof-of-concept test to more of a project. I'm going to write it up now that I've done a little work. I just used the Google geocoding API through a Google sheets plugin. Unfortunately, that doesn't provide any additional data about no match, confidence, etc. I feel good that the 1941 data (questionable geocoding aside) is pretty review worthy. What is the best way of/format for posting to Wikisource?

Minh_Nguyen

Can we find another method to geocode these POIs? This approach brings Google Maps data into OHM, which is not great.

Help:Beginner's guide to adding texts - Wikisource, the free online library summarizes the process. The first step is easy because the LOC’s exhibits are already uploaded to Commons: File:Cover The Negro Motorist Green Book 1947.jpg - Wikimedia Commons

2 Likes

Wikisource is more than an image annotation service: the goal is to digitize the full text of any book, similar to the Internet Archive but with volunteers collaboratively proofreading the text. I’m sure there would be interest in digitizing the Green Book series. This would be a nice way to increase OHM’s visibility outside the historical geography community.

You’re already working with an OCR’d copy of this edition, but in case you need to repeat the process in the future, Wikisource integrates with multiple OCR providers to streamline the proofreading process. If you’re interested, we can create a template for linking an entry in the digitized text to OHM, then come up with a workflow for automatically applying these templates based on the POIs you upload to OHM.

2 Likes

@Minh_Nguyen - what’s your go-to / recommended tool for geocoding? Nominatim? Photon? mmgis plugin in, pointed at Nominatim? Your questions about confidence, fallback rules, etc., are good ones and it would be nice to include that in the dataset.

I expect that once I get the project page up, this data will be pulled down, cleaned up a bit and reuploaded.

For other readers, can you add some detail on why having Google-derived data is suboptimal (because it is).

1 Like

Ok… the Green Book project page is now live. Lots of areas open for discussion. Please give it a once over and let me know what additional questions you have!

1 Like

I don’t see swimming pools on that list. I’ve never seen the Green Book and only heard of existence 2 weeks ago (thank you HBO).

1 Like

Well, I haven’t found any swimming pool categories & based on a quick search I’ve done, there doesn’t appear to be any included in any of the directories. But… once we get all this information in OHM, we’ll be able to know for sure!

We should avoid relying on the Google Geocoding API because it draws from Google Maps data. Even if somehow it wouldn’t infringe on their copyright or terms of service, it would undermine OHM’s reputation as a repository of open geodata. Wikipedia largely relied on Google Maps to geotag its articles, and this decision has come to haunt Wikidata many years later. It’s the number one issue that drives antipathy toward Wikidata among OSM contributors. It triggers mappers even when Wikidata’s coordinate data isn’t relevant to a discussion.

To be frank, there are no good solutions. No existing geocoder covers the entire United States and intentionally knows about roads and addresses that no longer exist. I must point out that the absence of a decent historical geocoder is a superb reason to contribute to OHM! :smirk:

Most geocoders will still try very hard to find you a location, no matter how poorly it matches your query. This silent failure is particularly problematic for us, because we wouldn’t even know what to fix. Whole cities may fail to geocode due to renumbered street grids in the intervening years, but it would appear to be just fine.

That said, if you just need a ballpark estimate with all the caveats around anachronisms, the Census Bureau offers two geocoding services based on public domain TIGER data:

It’s also possible to set up a local TIGER-based geocoding service in PostGIS:

Yeah, yeah, TIGER and all, but it seems like the potential error in this dataset is greater than the usual TIGER hallucinations anyways. TIGER could actually outperform OSM-based geocoders. Looking around, many of the addresses are given only as cross streets. Nominatim doesn’t support cross street lookups. Even if the POI still exists and has been mapped in OSM with a cross street as its address, Nominatim won’t find it.[1]

Outdated streets and street names are normally a bad thing, but it’s probably exactly what we need. Many of the POIs featured in these guides were located in predominantly Black neighborhoods, many of which were demolished in later years for freeway construction and urban renewal projects. TIGER has gone through multiple rounds of quality assurance in recent years, but if you load in one of the earliest editions of TIGER, you might still find these streets undisturbed.[2]


  1. Nominatim’s developer has suggested an alternative tagging scheme that so far hasn’t taken off. ↩︎

  2. I’ve proposed comparing annual TIGER datasets in order to come up with mappable streets. ↩︎

2 Likes

How do we separate metadata about the entity from metadata about the source? The Green Books often only included a partial name, leaving off the descriptive parts of names. “The Saint James Hotel” would be listed as “The Saint James” under the category “Hotels.”

This happens a lot and isn’t specific to these directories. For example, most of our railway stations lack “station” at the end, even though one might append that word in a non-map context.

I don’t think we should automatically add “Hotel” or “Cafe” to the POI name, because there are other possibilities, such as “Motel” or “Restaurant”. We would need to go back and research the names on a case-by-case basis. In the meantime, the abridged name doesn’t seem problematic to me, especially since we’d be tagging the type for a renderer or geocoder to present to the user alongside the name.

If, during conflation, you encounter contradictory names for the same POI and are sure the POI wasn’t renamed in reality, then use alt_name=* along with alt_name:source=*. iD doesn’t have an alt_name=* field yet, but it’ll come someday.

Each entry might have 8 or more source tags. How many is enough?

How do we identify every map or every source where an entity might be depicted? Do we?

We could, but overcitation seems to be frowned upon in some circles. I think at least we’d tag the earliest mention as start_date:source=* and the latest mention as end_date:source=*. The better way to track this sort of information is in a repository of source texts, such as Wikisource or Wikimedia Commons, both of which have annotation functionality built in. Besides, manually annotating every label on a map is already an outdated practice except as a special case, now that we have access to scalable technologies such as the David Rumsey Map Collection’s Text on Maps search engine.

But! Because entries may be listed in more than 1 year, each entry might reasonably be tagged as:

start_date=<first directory year for that entry>
start_date:edtf=[..<first directory year for that entry>]
end_date={last directory year for that entry>+1
end_date:edtf=[<last directory year for that entry>..]

NOTE: The offset of “+1” for the end_date tagging is purely a rendering convention.

We shouldn’t fudge dates arbitrarily like this. If start_date=* and end_date=* are identical at year precision, the renderer will already show the feature for a whole year; there’s no need to extend it to two years by incrementing the end_date=*.

2 Likes