Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decompound adds letters #6

Open
marbleman opened this issue Jan 8, 2014 · 5 comments
Open

Decompound adds letters #6

marbleman opened this issue Jan 8, 2014 · 5 comments

Comments

@marbleman
Copy link

Hi,

I just got stuck with some "FetchPhaseExecutionException" when using the highlighting and the decomp filter:

InvalidTokenOffsetsException[Token verzinnte exceeds length of provided text sized 83]

Drilling down into that was a little tricky since the words causing the Exceptions did not occur in the indexed text! After a while I found the following:

Using decompound add some words to the index that are longer than the orignal:

e.g. for "Kupferleiter, verzinnt" it ads "verzinnt" AND "verzinnte"
I have no clue what "verzinnte" is good for, but it sounds to me like the plural. However, since it is the last word in the text, highlighting fails because it exceeds the end of the text.

Here is an example analyzation of "verzinnt"

{
"tokens": [
{
"token": "verzinnt",
"start_offset": 0,
"end_offset": 8,
"type": "",
"position": 1
},
{
"token": "verzinnte",
"start_offset": 0,
"end_offset": 9,
"type": "",
"position": 1
}
]
}

My guess: The end_offset: 9 is the problem here because the analyzed text is just 8 characters long. So when it comes to highlighting, the highlighter probably tries to to highlight "verzinnte" as well, which leads to the Exception...

@jprante
Copy link
Owner

jprante commented Jan 8, 2014

Good catch. Decompound uses some probabilistics, but not at 100% reliability. "verzinnt" looks like it was not in the training set, so the algorithm fails.

Maybe it helps to reduce or increase the threshold parameter a little bit.

@marbleman
Copy link
Author

Unfortunately changing the treshold even much more than a little bit does not seem to have any effect at all...

Is there a way to train the decompounder? Especially when it comes to technical vocabulary compounding of words sometimes becomes really insane such as "Aluminiumtiefziehteile" which should split into "Aluminium", "tiefziehen" and "Teil" and not into "Aluminium", "tief" and "ziehteile". In this case "tiefziehen" is still a compound word but must not be split into parts. Otherwise the context/original meaning gets lost.

I mean, it is already amazing to see the decomp and baseform filter in action together splitting words like "Straßenbahnschienenritzenreiniger". However, as in the example above it would be cool to train the decompounder to accept "Straßenbahn" as a word that must not be decompounded any further into "Straße" and "Bahn".

@jprante
Copy link
Owner

jprante commented Jan 9, 2014

I started to rewrite the original trainer tool to let it run from command line but I got short on time.

The original tool is "ASV Toolbox Baseform" with a GUI-based trainer, available at http://wortschatz.uni-leipzig.de/~cbiemann/software/toolbox/Baseforms%20Tool.htm

Because I just copied the trees in binary form, I don't know the original training set. By dumping the tree files, this training set can be reconstructed. If you can spend time, maybe you find a way to enhance the existing solution, I would appreciate it.

@marbleman
Copy link
Author

I'd love to drill deeper here and enhance the solution but I am in doubt about choosing the right strategy. It is hard to tell wether a tool could use some enhancement or if I just have a lack of experience with elasticserach, e.g. not applying the right filters etc.

For example I cannot judge if training the decompounder is the way to go, or if it would be better to have a dictionary of compound words that must not be decompounded. I also figured out that there is a huge difference in the search result when I decompound the words myself before searching. It seems to me that "default_operator": "AND" does not apply to the words decompounded automatically. Instead I get results having part A OR B of a decompounded word. Maybe this is a real issue or maybe I just missed some analysis tweeks...

Right now I am prepraring a list of issues and examples showing what could improve the results from my point of view. Maybe you can comment on that when I am done so we can locate the issues to be addressed with some further investigation/coding/training.

@Pictor13
Copy link

Pictor13 commented Mar 6, 2015

I am having the same issue with the plugin.

"InvalidTokenOffsetsException[Token l-ops exceeds length of provided text sized 93]"

Being the nature of the bug rather unpredictable I cannot forecast or prevent the exception (I also don't have control on the data to index).

Can you do something for this?

Is such a pity to just not using this plugin just for some few exceptions. It usually works really well!
Please let us know. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants