Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug using words like 'Sube' at beginning #18

Open
JavierBJ opened this issue Dec 5, 2019 · 2 comments
Open

Bug using words like 'Sube' at beginning #18

JavierBJ opened this issue Dec 5, 2019 · 2 comments
Assignees

Comments

@JavierBJ
Copy link

JavierBJ commented Dec 5, 2019

I'm using spacy-affixes as part of the SpaCy pipeline, as explained in the usage guide. It has been working properly until I tried the following sentence: "Sube el paro". When doing nlp("Sube el paro.") I'm getting the following error:

Traceback (most recent call last):
  File "/home/usuario/.local/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3319, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-21-751769ff6949>", line 1, in <module>
    nlp("Sube el paro.")
  File "/home/usuario/.local/lib/python3.6/site-packages/spacy/language.py", line 435, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))
  File "/home/usuario/.local/lib/python3.6/site-packages/spacy_affixes/main.py", line 163, in __call__
    self.apply_rules(retokenizer, token, rule)
  File "/home/usuario/.local/lib/python3.6/site-packages/spacy_affixes/main.py", line 140, in apply_rules
    token, [*rule["affix_text"], token_sub], heads
  File "_retokenize.pyx", line 88, in spacy.tokens._retokenize.Retokenizer.split
ValueError: [E117] The newly split tokens must match the text of the original token. New orths: subSube. Old text: Sube.

From my experience and tries, I can say the bug happens with texts like:

nlp("Sube el paro.")
nlp("Sube")
nlp("Subir")
nlp("Subiendo")

But not with texts like:

nlp("sube el paro.")
nlp("sube")
nlp("Subasta")
nlp("Subimos")

Given the error thrown, something related to matching prefix "sub" might be messing things up.

My configuration

  • Ubuntu 18.04.3 LTS
  • Python 3.6.9
  • spacy-affixes 0.1.4
  • spacy 2.2.3
@alvp alvp self-assigned this Dec 5, 2019
@JavierBJ JavierBJ changed the title Bug using words like 'sube' at beginning Bug using words like 'Sube' at beginning Dec 5, 2019
@versae
Copy link
Contributor

versae commented Dec 5, 2019

Thanks for reporting, @JavierBJ!

In out experience, prefix splitting can cause more trouble than is worth. We're looking at the problematic Freeling rule (^sub) to figure out a solution. In the meantime, you could try only using suffixes rules (e.g., clitics) if that fits your scenario. We use something like this in other projects:

import spacy 
from spacy_affixes import AffixesMatcher
from spacy_affixes.utils import AFFIXES_SUFFIX
from spacy_affixes.utils import load_affixes

nlp = spacy.load("es") 

suffixes = {k: v for k, v in load_affixes().items()
            if k.startswith(AFFIXES_SUFFIX)} 
affixes_matcher = AffixesMatcher(nlp, split_on=["VERB"], rules=suffixes)
nlp.add_pipe(affixes_matcher, name="affixes", before="tagger")

@JavierBJ
Copy link
Author

Thank you very much @versae for your workaround, it solved the problems mentioned. I'll keep an eye on any solutions you find on the Freeling rule issue.

Kind regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants