Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented rle_fast Extension for Real Time Encoding #1

Closed
wants to merge 5 commits into from

Conversation

AshishS-1123
Copy link

Changes Made

  • Created a C Extension to Python to implement Run-Length Encoding
  • Added tests for testing this extension
  • Updated setup.py file to build and install the extension

Why?

While the algorithm for performing encoding and decoding operations is pretty efficient, one of its major drawbacks is that it is written is Python. Python is not a very fast language.
As such, python provides a C API for users to write extensions, which are written in C, but can be run from Python.
The extension I wrote, uses a similar algorithm to the one in rle package. But the speed has increased by almost 5x.

How?

The extension I built is present in the folder rle_fast.

Wrapper For Extension

   1. The file rle_fast/rle_fast_extension.c contains the wrapper code for the extension. It contains the module definitions, method declarations, and two wrapper methods- one for encode and decode each.
   2. These methods are namely- encode_c and decode_c.
   3. The first step in these functions is to get and parse the arguments.
   4. The next step is to check the correctness of the given arguments and raise appropriate errors if any.
   5. After that, we call a function from the header file rle_utils.h

Utility Functions for Encode and Decode

   1. The functions in the file, rle_fast/rle_utils.h are responsible for the actual encoding and decoding operations.
   2. The algorithm used in these functions is the same as the one in rle/init.py file. The only difference being how various operations are being performed.

Different functions are used for performing operations like creating a new empty list, getting iterators, etc. These functions can be found on the official python site

Docstrings

No software can be complete without documentation. The docstrings for various methods and modules in C extensions need to be added at module definitions. Here, the module definitions are present in rle_fast_extension.c
But the docstrings can be found in the file rle_fast/rle_docs.h.
You can either view the docs from this file or use the help function after installing the extension.

Installation

To install the extension, the code has already been added to setup.py

Run the following commands from the terminal.

To build the package
python setup.py build

To install the package
python setup.py install

TO-DO

The extension doesn't support encoding operations on sequences containing strings or characters. This is because the code assumes that we are operating on numbers only. When such an input is given, it raises a NotImplementedError.
So, I have commented out the test that fails in file tests/test_encode_rlefast.py

Also, the README file needs to be updated with the extension.
And since this seems like major improvement to the package, maybe we can release this as a major version?

@tnwei
Copy link
Owner

tnwei commented Dec 27, 2020

Hello @AshishS-1123 , this is wonderful! Any speed up is certainly welcome. Hope you don't mind me getting back to you in a week or so when I'm able to take a closer look after vacation. Thank you and happy new year!

@tnwei
Copy link
Owner

tnwei commented Jan 4, 2021

Coming back to take a closer look at this.

I'm thinking of pointing this PR to a separate develop branch where further improvements can be introduced without affecting the master branch, which currently houses code for the stable build. My reasoning is as below:

  • Packaging wheels for multiple distributions has yet to be set up for this library. The existing code running on pure Python can be packaged into platform-agnostic wheels that work on all distributions. Incorporating C code will require setting up a workflow to build and release platform-specific wheels. Releasing source distributions will negate having to build these wheels for distributions, however users who don't have the build tools available will be in for a surprise: a small library like this that does something rather simple will not be ready for use after a quick pip install.
  • The code in this pull request that performs encoding only works on sequences of integers at the moment. (Very simple test cases were written to test the logic in Python, but with the introduction of a C extension, these tests will need to be extended to cover multiple data types.) I believe it is more appropriate to have the aforementioned code implemented as its own fast encoding function specific for ints, so that users will understand what to expect from the output when they choose to use this function.
  • Given their type-specific nature, checks and graceful error handling will need to be built-in to these C extensions when the user passes incompatible input. I am particularly in favour of wrapping all logic under the generic encode / decode functions, where a graceful fallback to the Python implementation can be utilized when necessary.

I cannot redirect PRs, so I will need to close this one. If you are still interested, you are welcome to re-submit a PR to the newly created develop branch where we can safely figure out how to address the items above bit by bit. We can start with a speed benchmark to get an idea of the performance gains we are looking to achieve by going through all of the above.

Once again, thanks and appreciate your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants