Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

okascore.c misuses Python C API, leading to odd failures #18

Closed
tseaver opened this issue May 15, 2024 · 1 comment · Fixed by #19
Closed

okascore.c misuses Python C API, leading to odd failures #18

tseaver opened this issue May 15, 2024 · 1 comment · Fixed by #19

Comments

@tseaver
Copy link
Member

tseaver commented May 15, 2024

E.g., see the following test:

    def test__search_wids_non_empty_wids_multiple_docs(self):

        NUM_DOCS = 23   # get past the DICT_CUTOFF size
        TEXT = 'one two three'
        family = self._getBTreesFamily()
        index = self._makeOne()

        for i in range(NUM_DOCS):
            texts = [TEXT] * ((i % 5) + 1)
            index.index_doc(i, " ".join(texts) )

        d2f = index._wordinfo[1]
        assert isinstance(d2f, family.IF.BTree)

        wids = [index._lexicon._wids[x] for x in TEXT.split()]

        relevances = index._search_wids(wids)

        self.assertEqual(len(relevances), len(wids))
        for relevance in relevances:
            self.assertTrue(isinstance(relevance[0], index.family.IF.Bucket))
            self.assertEqual(len(relevance[0]), NUM_DOCS)
            self.assertTrue(isinstance(relevance[0][1], float))
            self.assertTrue(isinstance(relevance[1], int))

When run with PURE_PYTHON set in the environment, this test passes, but without it, I see the following errors:

____ OkapiIndexTest32Impure.test__search_wids_non_empty_wids_multiple_docs _____

self = <hypatia.text.tests.test_okapiindex.OkapiIndexTest32Impure testMethod=test__search_wids_non_empty_wids_multiple_docs>

    def test__search_wids_non_empty_wids_multiple_docs(self):
    
        NUM_DOCS = 23   # get past the DICT_CUTOFF size
        TEXT = 'one two three'
        family = self._getBTreesFamily()
        index = self._makeOne()
    
        for i in range(NUM_DOCS):
            texts = [TEXT] * ((i % 5) + 1)
            index.index_doc(i, " ".join(texts) )
    
        d2f = index._wordinfo[1]
        assert isinstance(d2f, family.IF.BTree)
    
        wids = [index._lexicon._wids[x] for x in TEXT.split()]
    
>       relevances = index._search_wids(wids)

hypatia/text/tests/test_okapiindex.py:142: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <hypatia.text.okapiindex.OkapiIndex object at 0x7fe3cc5301a0>
wids = [1, 2, 3]

    def _search_wids(self, wids):
        if not wids:
            return []
        N = float(self.indexed_count())  # total # of docs
        try:
            doclen = self._totaldoclen()
        except TypeError:
            # _totaldoclen has not yet been upgraded
            doclen = self._totaldoclen
        meandoclen = doclen / N
        #K1 = self.K1
        #B = self.B
        #K1_plus1 = K1 + 1.0
        #B_from1 = 1.0 - B
    
        #                           f(D, t) * (k1 + 1)
        #   TF(D, t) =  -------------------------------------------
        #               f(D, t) + k1 * ((1-b) + b*len(D)/E(len(D)))
    
        L = []
        docid2len = self._docweight
        for t in wids:
            d2f = self._wordinfo[t] # map {docid -> f(docid, t)}
            idf = inverse_doc_frequency(len(d2f), N)  # an unscaled float
            result = self.family.IF.Bucket()
>           score(result, _items(d2f), docid2len, idf, meandoclen)
E           KeyError: 0

hypatia/text/okapiindex.py:341: KeyError
____ OkapiIndexTest64Impure.test__search_wids_non_empty_wids_multiple_docs _____
TypeError: 'float' object cannot be interpreted as an integer

The above exception was the direct cause of the following exception:

self = <hypatia.text.tests.test_okapiindex.OkapiIndexTest64Impure testMethod=test__search_wids_non_empty_wids_multiple_docs>

    def test__search_wids_non_empty_wids_multiple_docs(self):
    
        NUM_DOCS = 23   # get past the DICT_CUTOFF size
        TEXT = 'one two three'
        family = self._getBTreesFamily()
        index = self._makeOne()
    
        for i in range(NUM_DOCS):
            texts = [TEXT] * ((i % 5) + 1)
            index.index_doc(i, " ".join(texts) )
    
        d2f = index._wordinfo[1]
        assert isinstance(d2f, family.IF.BTree)
    
        wids = [index._lexicon._wids[x] for x in TEXT.split()]
    
>       relevances = index._search_wids(wids)

hypatia/text/tests/test_okapiindex.py:142: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <hypatia.text.okapiindex.OkapiIndex object at 0x7fe3cc54e430>
wids = [1, 2, 3]

    def _search_wids(self, wids):
        if not wids:
            return []
        N = float(self.indexed_count())  # total # of docs
        try:
            doclen = self._totaldoclen()
        except TypeError:
            # _totaldoclen has not yet been upgraded
            doclen = self._totaldoclen
        meandoclen = doclen / N
        #K1 = self.K1
        #B = self.B
        #K1_plus1 = K1 + 1.0
        #B_from1 = 1.0 - B
    
        #                           f(D, t) * (k1 + 1)
        #   TF(D, t) =  -------------------------------------------
        #               f(D, t) + k1 * ((1-b) + b*len(D)/E(len(D)))
    
        L = []
        docid2len = self._docweight
        for t in wids:
            d2f = self._wordinfo[t] # map {docid -> f(docid, t)}
            idf = inverse_doc_frequency(len(d2f), N)  # an unscaled float
            result = self.family.IF.Bucket()
>           score(result, _items(d2f), docid2len, idf, meandoclen)
E           SystemError: <built-in function score> returned a result with an exception set

hypatia/text/okapiindex.py:341: SystemError
@tseaver
Copy link
Member Author

tseaver commented May 15, 2024

Note that the new test passes on Python3 < 3.10.

tseaver added a commit that referenced this issue May 15, 2024
Fix 'setup.py' to prevent building 'okascore' module if 'PURE_PYTHON'
is defined.

Add 'py310-pure' environment to build / run tests with that env var set.

Note that the new test fails under 'py310', but passes under 'py310-pure'
tseaver added a commit that referenced this issue May 16, 2024
Fix 'setup.py' to prevent building 'okascore' module if 'PURE_PYTHON'
is defined.

Add 'py310-pure' environment to build / run tests with that env var set.

Note that the new test fails under 'py310', but passes under 'py310-pure'
tseaver added a commit that referenced this issue May 16, 2024
* tests: demonstrate #18

Fix 'setup.py' to prevent building 'okascore' module if 'PURE_PYTHON'
is defined.

Add 'py310-pure' environment to build / run tests with that env var set.

Note that the new test fails under 'py310', but passes under 'py310-pure'

* fix: use 'PyFloat_AsDouble' for upcasts

- We know that the second element of each 'd2fitems' *should* be a float
  already (we store it, after all, in an IF BTree).

- The upcast in the second change is safe, and more regular.

* ci: bump checkout, setup-python action versions

- Use released Python 3.12

* ci: work around GHA macos-latest messes:

- actions/setup-python#850
- actions/setup-python#860

---------

Co-authored-by: Peter Wilkinson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant