-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New API for Dice Coefficient #121
Conversation
:param datasets: A length 2 sequence of datasets. A dataset is a | ||
sequence of bitarrays. | ||
:param threshold: Pairs whose similarity is above this value may be | ||
a match. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment for threshold
could be clearer. Perhaps: "Pairs whose similarity is below this value will be discarded"
if matches < 0: | ||
raise RuntimeError('bad key length') | ||
elif matches > 0: # TODO: Things may be faster without this check | ||
result_sims.extend(c_scores[0:matches]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you look into using the array
object from within match_one_against_many_dice_k_top
? https://docs.python.org/3/library/array.html#array.array.buffer_info
I figured it wouldn't have been straight forward with a Python list, but as an array is already a contiguous single typed data structure it might be easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is theoretically possible but has big caveats.
Without including the CPython headers (which we don’t want to do to maintain support for PyPy), we cannot extend the array to admit extra elements. Instead, we would have to reserve k
elements in the array from within Python code for the C++ code to write to.
The only way to reserve this space from Python is to write some data (by convention zeros). However, this doesn’t actually prevent any copying as you still have to copy all the zeros…
This leaves us with four two options:
1. Cythonise dice_x86.py and reserve space in the arrays using low-level functions. (This should be straightforward, but I’m only superficially familiar with Python, so I estimate it would take me a day.)
2. Much like 1, but provide callbacks to C++ code to permit it to expand the array itself when necessary. This would eliminate the need for k
.
3. Drop PyPy support for the accelerated Dice coefficient code and access the Python headers. PyPy users would still have access to the pure Python implementation. In terms of code maintainability, this may be the best option.
4. Leave everything as is.
Edit: Cython on PyPy doesn’t support low-level access to arrays. https://bitbucket.org/pypy/pypy/issues/2807/arrayarray-fails-to-import-in-cython
* Blocking functions * Minor docstring * Types for Mypy * Hamming similarity * Refactor tests * Minor refactor * Require NumPy * Multiparty greedy solver * Candidate generation. Still missing some features * Some optimisations * More optimisation * Still optimising * Ready for PR * Hamish PR comments * Fix Python 3.5 bug * Serialization * Serialisation of candidate pairs! * Update typechecking. * Fix typing * Fix tests on big numbers * Oops. Minor. * Minor comment * Brian PR comments * Bug fix + more Pythonic type checking * Only test on >=Python3.6 * Travis does not yet have Python 3.7... * GPU 1 does not yet have Python 3.7 * Hamish PR comments * Revert release to Python 3.6 * I forgot to commit this before merging… (#119) * Candidate generation refactor (#122) * Candidate generation * Minor docstring * Typo in docstring * Hamish PR Comments * New API for Dice Coefficient (#121) * New API for Dice Coefficient * Docstrings * Minor PEP8 * Typing * Make tests pass… * Brian PR comments * Mypy typing + make API nice and unified (#123) * Type hints * Bring Hamming similarity in line with rest of API * Make solver tests pass * Ok, solving tests should finally work now... * Brian PR comments * Make typechecking.Record importable * Add dependency on mypy extensions * Blocking style (#125) * Blocking style * Minor commenting * Improve coverage on edge cases + misc. testing (#127) * Type hints * Bring Hamming similarity in line with rest of API * Make solver tests pass * Ok, solving tests should finally work now... * Edge cases and tests * Restore old similarity tests * Minor: I don't know how that typo got there * Make Dice similarity tests pass with new Clkhash * Minor: unused imports * Style and docstrings (#130) * Style and comments * More style and comments * Mypy * Minor: Fix link * Brian PR comments
Change API for Dice coefficient function. Makes it fit within the multi-party linkage API.
(Also, if the C++ code has not been compiled, we fall back to the Python implementation.)