Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csvjson: Support types other than Point #867

Closed
wants to merge 1 commit into from

Conversation

jayvdb
Copy link
Contributor

@jayvdb jayvdb commented Jul 24, 2017

Parses and removes 'geojson' from the properties,
putting the coordinate into the GeoJSON feature geometry.

Fixes #866

@jayvdb
Copy link
Contributor Author

jayvdb commented Jul 24, 2017

I'm not super happy about the global, and otherwise happy to revise patch as requested, or split it into several commits.

@coveralls
Copy link

coveralls commented Jul 24, 2017

Coverage Status

Coverage increased (+0.1%) to 87.592% when pulling 6b7f119 on jayvdb:csvjson-shapes into da1f330 on wireservice:master.

@jayvdb jayvdb mentioned this pull request Jul 24, 2017
@jpmckinney
Copy link
Member

jpmckinney commented Jul 24, 2017

Re: global, Python 3 has nonlocal, but we can't use it since we need to support Python 2.

What if you make them class properties (self.min_lon, etc.)?


for i, c in enumerate(row):
if i == lat_column:
if not c:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is that if c is None, then float(c) raises TypeError instead of ValueError. Can you make a separate PR that fixes the issue of c being None, by catching TypeError? It's separate from the rest of this issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, correct.
However, this only occurs when the type is not Point, and fixing this exception only causes the tool to generate a .geojson which is broken in other ways. Fixing only the exception doesnt make the tool better; in fact, I would say that the exception is a good thing, as it prevents people from believing that their input was correctly parsed.

But, happy to split to separate it to a separate commit - other improvements needed anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nevermind, I confused this with a different comment you made about lat/lon being empty.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if a '0' is read as an integer 0 instead of the string '0', then we should check if c is None instead.

update_boundary_lon(coords[0])
update_boundary_lat(coords[1])
else:
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain in what scenario the coords from the geojson will fail to iterate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didnt find any. Removing.
(I did a quick check of types that I dont use using https://github.com/node-geojson/geojson-flatten/tree/master/test/fixture , resulting in #871 ). If there are still some types which might cause an exception, we need exceptions here to catch and fix them.

@jayvdb
Copy link
Contributor Author

jayvdb commented Jul 25, 2017

What if you make them class properties (self.min_lon, etc.)?

My main reason to avoid this is efficiency. Any non 'local' name results in a lot of extra name resolution work for large files, and this computation of the bbox is IMO not always desirable (see #868) so I am reticent to make it also inefficient .

@jpmckinney
Copy link
Member

Do you have a real speed comparison?

@jayvdb
Copy link
Contributor Author

jayvdb commented Jul 25, 2017

Happy to provide it this evening (WIB) when I get back into this project.

@coveralls
Copy link

coveralls commented Jul 26, 2017

Coverage Status

Coverage increased (+0.2%) to 87.701% when pulling 8e53286 on jayvdb:csvjson-shapes into f63a654 on wireservice:master.

@jayvdb
Copy link
Contributor Author

jayvdb commented Jul 26, 2017

On real data for the bbox computation, using cProfile, csvjson.py:119(update_boundary_coords) is the 24th most significant component by cumulative time. This amount of time is quite insignificant to the total time, however much of the time is spent in csv reading.

Sorting by total time instead, it is the 10th most significant component.

Ordered by number of calls, number 11-13 are:

  • csvjson.py:119(update_boundary_coords) : 287562/1691
  • csvjson.py:111(update_boundary_lon) : 285911
  • csvjson.py:103(update_boundary_lat) : 285911

That is a lot of calls!

I do not have analysis of switching from local vars to instance variables, as that is going to need a more involved benchmark/profile rig to avoid the CSV read/JSON write phases. (c.f. #872 ) But as seen by the number of calls, this is not trivial due to the recursive nature, as using instance variables is going to require at least 2 * 2 * 285911 extra getattr for my dataset. I am not sure how well that is optimised internally by cPython , but it could be optimised given that these are very tight loops without any other alterations to the self attribute dictionary.

Parses and removes 'geojson' from the properties,
putting the coordinate into the GeoJSON feature geometry.

Fixes wireservice#866
@coveralls
Copy link

coveralls commented Jul 26, 2017

Coverage Status

Coverage increased (+0.2%) to 87.701% when pulling 04b443a on jayvdb:csvjson-shapes into f63a654 on wireservice:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 87.701% when pulling 04b443a on jayvdb:csvjson-shapes into f63a654 on wireservice:master.

@jayvdb
Copy link
Contributor Author

jayvdb commented Aug 4, 2017

Do you still want me to get rid of the globals?

@jpmckinney jpmckinney mentioned this pull request Aug 4, 2017
@jpmckinney
Copy link
Member

Considering reading/writing take up so much time (and are absolutely necessary to do anything with csvkit), I think it's unnecessary optimization to make that loop more efficient, at the cost of potentially hard-to-fix bugs in the future due to the use of globals. So, removing globals would be best.

@jayvdb
Copy link
Contributor Author

jayvdb commented Aug 5, 2017

ok

@jpmckinney jpmckinney added this to the 1.0.3 milestone Jan 28, 2018
@jpmckinney jpmckinney mentioned this pull request Jan 28, 2018
jpmckinney added a commit that referenced this pull request Jan 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants