Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: read_ply replace pandas.read_csv engine=python with c; improve read_off header-parsing robustness #352

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

YodaEmbedding
Copy link

@YodaEmbedding YodaEmbedding commented Aug 8, 2023

UPDATE: I have rebased this PR on top of the latest commit. The revised changes are:

  • perf: Speed up reading of ASCII PLY files.
  • feat: improve robustness for OFF headers on e.g. ModelNet40
  • perf: reuse already open file for reading instead of opening it twice
  • style: renamed variables for clarity (e.g. color -> has_color; and count -> n_header)

In particular, ModelNet40 has faulty headers:

$ head -n 1 ModelNet40/chair/train/chair_0856.off
OFF6586 5534 0

For reference, the correct format is:

OFF
6586 5534 0

Nonetheless, it is still valuable to parse the faulty header.


(Original text before #353 was merged)

Big performance improvement by removing the need to use the slow engine="python" by reading the sliced file from an in-memory StringIO buffer.

Also fixes bug where OFF files containing more lines than num_points + num_faces tries to read potential edges as faces!

As Wikipedia says, the OFF file may contain:

  • points
  • faces (optional)
  • edges (optional)

Of course, this still does not encompass all possible OFF file variants described by Wikipedia, but it's an improvement.

pyntcloud/io/off.py Fixed Show fixed Hide fixed
@YodaEmbedding YodaEmbedding force-pushed the perf/pandas_c_engine branch 2 times, most recently from fa353ac to 4c5211b Compare December 24, 2023 12:31
Improve robustness of header parsing a bit.

In particular, ModelNet40 has faulty headers:
```bash
$ head -n 1 ModelNet40/chair/train/chair_0856.off
OFF6586 5534 0
```

For reference, the correct format is:
```
OFF
6586 5534 0
```

Nonetheless, it is still valuable to parse the faulty header.

Also, reuse already open file for reading instead of opening it twice.
@YodaEmbedding YodaEmbedding changed the title perf: read_off replace pandas.read_csv engine=python with c perf: read_ply replace pandas.read_csv engine=python with c; improve read_off header-parsing robustness Dec 24, 2023
@YodaEmbedding
Copy link
Author

YodaEmbedding commented Dec 24, 2023

Both this PR and #353 improved pandas performance for *.OFF files with engine=c. Therefore, I rebased this PR on top of #353. This PR still contains some other useful changes, listed above.


Future work:

Once this is reviewed/accepted, I can look into improving compatibility with Wikipedia's description of the *.OFF file format. Of course, perfect compatibility is too slow, but there's still some missing features:

  • "C" in the header should not be needed to detect the presence of color (see Wikipedia's example).
  • Edges, and edge colors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant