First of all, I can’t stress enough how big of a deal this import will be for OHM in the U.S. I remember chatting with you and other OHM mappers a few years ago, before I got involved with the project, about how these boundaries would be critical for bringing in contributors. Well, I guess I proved myself wrong because I’m here already.
Still, I think this will be for OHM what the TIGER import was for OSM – in a good way.
I have a few suggestions regarding the translation from the dataset’s attributes to OHM tags. None of them is necessarily a blocker, but I think we’d want to avoid having to revise all the boundaries multiple times in response to systemic feedback, if we can help it. Eventually it will become more difficult to make those revisions, as mappers manually fix factual errata from the imported data. For example, I hope to redraw all the Ohio counties along Lake Erie to include the lake, which the Newberry dataset intentionally ignores.
The Newberry dataset’s CHANGE field has been translated to start_date:cause. I think this would be a good opportunity to migrate to start_event. The current popularity of start_date:cause doesn’t even matter given the sheer scale of this import.
source:cite strikes me as odd considering the conversation we’re having about source tagging. I agree that we should include the original citation, but could we make that a source:2 instead?
Along the same lines, I’d encourage you to establish a newberry_ahcb:* namespace for anything dataset-specific that you want to “carry along” in OHM, rather than coining single-use keys under source:* that may conflict with unrelated citations in the future. This is a good practice in general that avoids the need for linguistic gymnastics and mitigates the risk of skunking a tag.
The license=CC0 / public domain tag is fine for now (modulo the unresolved issue of how to spell the key), but I’d encourage us to consider supplementing these human-readable strings with a machine-readable SPDX license identifier to facilitate safe reuse of OHM data.
I appreciate the note about creating chronology relations out of these boundaries. Once they’re created, we should add OpenHistoricalMap relation ID (P8424) statements to the corresponding Wikidata items. Conversely, the wikipedia and wikidata tags should be moved from the individual boundary relations to the higher-order chronology relations.
Do I understand correctly that there’s only one place node per county, even if the county’s centroid has shifted over time due to changing boundaries? This will result in some very unexpected labeling in some cases. For example, Hamilton County’s averaged centroid wound up in present-day Montgomery County, two counties to the north. Similarly, Knox County, Indiana, is all the way over in Putnam County. This will tend to affect any of a state’s original counties, from which smaller counties were carved out.
I’m glad I didn’t have to deal with this issue in the San José boundary import, since the place node for that city could be located at a fixed spot with real significance rather than an arbitrary centroid. We did end up with multiple place nodes due to name changes, though not one for each iteration of the boundary. I don’t have a good short-term solution for the county place nodes. We should prioritize synthesizing centroids automatically when generating tiles and deprecate and delete these manual centroids as soon as possible after the import. (Needless to say, I find the label and label_id keys to be superfluous.)
Just as the place nodes’ name tags omit the years, I think the boundary relations’ name tags should also omit the years. Software should be able to append the dates when appropriate, according to the user’s preferred date format (which is not necessarily equivalent to the English format, even for a CE year range). I realize this will cause some temporary inconvenience when perusing Nominatim results, but conversely, the current tagging results in clutter in iD. Given the time it would take to deploy a fix to ohm-website’s Nominatim integration code, this could be a followup edit.
Finally, what’s the vintage of the newest data in the Newberry dataset? My understanding is that the dataset is no longer actively maintained as far as present-day changes, but there have been a handful of significant changes in recent years, such as the renaming of Shannon County to Oglala Lakota County in 2015. If we know the dataset’s vintage, that’ll make it easier for us to coordinate a more manual update of the counties up to the present.