Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I saw that #30 tried to add this, but relied too heavily on magic field names. This is a more in-depth stab at bringing nicer coordinate matching to csvdedupe. (This also includes the commit from #82, since nothing involving config files will run without it)
To achieve this, it's necessary to both transform the data we read from the CSV and to make an internal copy of the field definitions that we can bend to match what dedupe expects.
For one field containing both latitude and longitude, the JSON config now supports both
LatLong
orLongLat
types. They expect a field with the coordinates separated by some (any) non-numeric characters, so formats like-122.23,46.42
,-122.23, 46.42
, and-122.23 46.42
will all work.Behind the scenes, we split the field to the tuple of floats that dedupe expects. For the
LongLat
type, we reverse it toLatLong
order and swap the internal type toLatLong
.For latitude and longitude in separate fields, we add
Latitude
andLongitude
convenience types. Internally, these are coalesced into a singleLatLong
field called__LatLong
.One side effect of this change is that the training UI will now show this combined field instead of the original field names, but I think that's a fairly minor tradeoff.