-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eliminate_duplicate_disjuncts(): Return the discarded disjunct count #1518
Conversation
d = eliminate_duplicate_disjuncts(d, false); | ||
unsigned int dnum1 = count_disjuncts(d); | ||
eliminate_duplicate_disjuncts(d, false); | ||
unsigned int dnum1 = dnum0 - eliminate_duplicate_disjuncts(d, false); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm reading this correctly, eliminate_duplicate_disjuncts()
is being called twice. Is that right?
Elsewhere, you call it twice, with arguments false
and true
but here it is false
both times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You caught a bug. The intention was to call it only once.
The idea here is to provide information on the number of duplicate disjuncts in the expression of the given word. With two calls, it would never report any duplicates. I will force-push a fix.
Without this fix:
linkparser> !!was.w-d//
Token "was.w-d" matches:
was.w-d 448 disjuncts
Token "was.w-d" disjuncts:
was.w-d 448/448 disjuncts
was.w-d: [0] 1.000= @E- Ss- <> Xc+ Vv+ VC+
was.w-d: [1] 0.000= @E- Ss- dIV- <> Xc+ Vv+ VC+
was.w-d: [2] 0.000= @E- Ss- dIV- <> Xc+ Vv+ VC+ VC+
...
With the fix:
linkparser> !!was.w-d//
Token "was.w-d" matches:
was.w-d 448 disjuncts
Token "was.w-d" disjuncts:
was.w-d 352/448 disjuncts
was.w-d: [0] 1.000= @E- Ss- <> Xc+ Vv+ VC+
...
Regarding the calls with false
and true,
this is done on a wildcard word to eliminate duplicate disjuncts with identical categories and then condense all the remaining duplicate disjuncts (disregarding the category). However, there is a code rot there, seemingly because I changed the implementation w/o fixing this code. It is inefficient, and when I look at this code again from a distance, I find nasty bugs, which may explain some strange things. I already rewrote part of the generation code for speed (and I know how to fix finding unused disjuncts), which I still need to continue. Instead, I'm now working on rewriting the counting stuff because I saw how to speed it up by a significant factor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and when I look at this code again from a distance, I find nasty bugs, which may explain some strange things.
To clarify, I referred to generation mode. I didn't find bugs regarding parsing mode. However, the speed can be improved (future PR).
No need to return the disjunct list because the deletion is implemented "in place", and the first disjunct is never deleted. Thus the given argument of disjunct list still points to the list after duplicate removal. Returning the count is more useful since it, e.g., enables to find out the total disjuncts removed in the whole sentence.
Just for eliminating IDE errors.
I created this branch to keep the number of disjuncts up-to-date by adding or substructing it every time a disjunct is added/discarded. The idea was to avoid wasting time on CPU cache misses while iterating long disjunct lists. However, it became very cumbersome, so I discarded this code.
This change in
eliminate_duplicate_disjuncts()
and the addition of the number of discarded disjuncts to- verbosity=2
remained.