Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New API for Dice Coefficient #121

Merged
merged 6 commits into from
Jul 9, 2018
Merged

New API for Dice Coefficient #121

merged 6 commits into from
Jul 9, 2018

Conversation

nbgl
Copy link
Contributor

@nbgl nbgl commented Jul 6, 2018

Change API for Dice coefficient function. Makes it fit within the multi-party linkage API.

(Also, if the C++ code has not been compiled, we fall back to the Python implementation.)

@nbgl nbgl added this to the Sprint 2018-07-13 milestone Jul 6, 2018
@nbgl nbgl self-assigned this Jul 6, 2018
@nbgl nbgl requested a review from hardbyte July 6, 2018 04:06
:param datasets: A length 2 sequence of datasets. A dataset is a
sequence of bitarrays.
:param threshold: Pairs whose similarity is above this value may be
a match.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment for threshold could be clearer. Perhaps: "Pairs whose similarity is below this value will be discarded"

if matches < 0:
raise RuntimeError('bad key length')
elif matches > 0: # TODO: Things may be faster without this check
result_sims.extend(c_scores[0:matches])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you look into using the array object from within match_one_against_many_dice_k_top? https://docs.python.org/3/library/array.html#array.array.buffer_info

I figured it wouldn't have been straight forward with a Python list, but as an array is already a contiguous single typed data structure it might be easy.

Copy link
Contributor Author

@nbgl nbgl Jul 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is theoretically possible but has big caveats.

Without including the CPython headers (which we don’t want to do to maintain support for PyPy), we cannot extend the array to admit extra elements. Instead, we would have to reserve k elements in the array from within Python code for the C++ code to write to.

The only way to reserve this space from Python is to write some data (by convention zeros). However, this doesn’t actually prevent any copying as you still have to copy all the zeros…

This leaves us with four two options:
1. Cythonise dice_x86.py and reserve space in the arrays using low-level functions. (This should be straightforward, but I’m only superficially familiar with Python, so I estimate it would take me a day.)
2. Much like 1, but provide callbacks to C++ code to permit it to expand the array itself when necessary. This would eliminate the need for k.
3. Drop PyPy support for the accelerated Dice coefficient code and access the Python headers. PyPy users would still have access to the pure Python implementation. In terms of code maintainability, this may be the best option.
4. Leave everything as is.

Edit: Cython on PyPy doesn’t support low-level access to arrays. https://bitbucket.org/pypy/pypy/issues/2807/arrayarray-fails-to-import-in-cython

@nbgl nbgl merged commit 6430e80 into api-new Jul 9, 2018
nbgl added a commit that referenced this pull request Aug 1, 2018
* Blocking functions

* Minor docstring

* Types for Mypy

* Hamming similarity

* Refactor tests

* Minor refactor

* Require NumPy

* Multiparty greedy solver

* Candidate generation. Still missing some features

* Some optimisations

* More optimisation

* Still optimising

* Ready for PR

* Hamish PR comments

* Fix Python 3.5 bug

* Serialization

* Serialisation of candidate pairs!

* Update typechecking.

* Fix typing

* Fix tests on big numbers

* Oops. Minor.

* Minor comment

* Brian PR comments

* Bug fix + more Pythonic type checking

* Only test on >=Python3.6

* Travis does not yet have Python 3.7...

* GPU 1 does not yet have Python 3.7

* Hamish PR comments

* Revert release to Python 3.6

* I forgot to commit this before merging… (#119)

* Candidate generation refactor (#122)

* Candidate generation

* Minor docstring

* Typo in docstring

* Hamish PR Comments

* New API for Dice Coefficient (#121)

* New API for Dice Coefficient

* Docstrings

* Minor PEP8

* Typing

* Make tests pass…

* Brian PR comments

* Mypy typing + make API nice and unified (#123)

* Type hints

* Bring Hamming similarity in line with rest of API

* Make solver tests pass

* Ok, solving tests should finally work now...

* Brian PR comments

* Make typechecking.Record importable

* Add dependency on mypy extensions

* Blocking style (#125)

* Blocking style

* Minor commenting

* Improve coverage on edge cases + misc. testing (#127)

* Type hints

* Bring Hamming similarity in line with rest of API

* Make solver tests pass

* Ok, solving tests should finally work now...

* Edge cases and tests

* Restore old similarity tests

* Minor: I don't know how that typo got there

* Make Dice similarity tests pass with new Clkhash

* Minor: unused imports

* Style and docstrings (#130)

* Style and comments

* More style and comments

* Mypy

* Minor: Fix link

* Brian PR comments
@wilko77 wilko77 deleted the api-new-dice branch June 23, 2020 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants