Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phonological search: Add option to list up target segments separately in summary results #801

Closed
9 tasks done
Tracked by #800
stannam opened this issue Apr 1, 2022 · 14 comments
Closed
9 tasks done
Tracked by #800

Comments

@stannam
Copy link
Member

stannam commented Apr 1, 2022

image

Currently, the phonological search result window repeats the user input. But it will be better off by adding an option to view the summary results in the old way, i.e., each row by segments.

Need to implement/solve/decide

  • when the search has no hits, PCT should not crash.
    • I remember this error existed in the old way of result summary. (can't find among the closed issues, though.)
  • what should PCT do when the user redoes searches but changes the summarize_by_segments settings?
    • Currently, it follows the most recent value e.g., if True then False, the summary window repeats the user input.
    • → Decided to keep it this way
  • make sure the numbers add up, i.e., the results are true to the data.
  • Option wording: “List target segments and environments separately in summary results.”

Test the function

  • with very complicated searches (especially, multiple segments in the environments)
    • For this, I searched for {i OR a} proceded by {ʃ OR g} and PCT returned correct results. Specifically, the parameters I used were (env: {ʃ, g}_; target: i, a), on the example corpus. Therefore, PCT should find four sequences of ʃi, gi, ʃa, ga. Four words in the corpus, ɡaɡa, ʃaʃi, ʃi, ʃisota contain one of those sequences.
  • with the syllabified example corpus
  • input by both 'add segments' and 'add features'
  • duplicating a search (see [ENC] Duplicate phonological searches should be blocked #765, and [BUG] Duplicated search too strict #762)

Update documentation

  • Make sure the documentation is updated to make the functionality clear. In particular, make it clear that (under the current functionality), if results are listed separately, every combination is considered separately, but different parts of the same environment are still combined (e.g. {#, i} {s, ʃ} {i, a} results in four environments for each of the two target segments): #_i, #_a, i_i, and i_a.
@stannam stannam mentioned this issue Apr 1, 2022
18 tasks
stannam added a commit that referenced this issue Apr 5, 2022
stannam added a commit that referenced this issue Apr 5, 2022
stannam added a commit that referenced this issue Apr 5, 2022
GUI changes. #801 (neg/pos) search type overrides the present summary option.
@kchall
Copy link
Member

kchall commented Apr 6, 2022

  1. If you ask PCT to list the target results separately, it does so initially in the summary results, but when you then go to the individual results and back to the summary results, the summary results revert to the combined targets. The behaviour should be static.

  2. When I try to replicate the example search about type frequency you showed earlier, @stannam, I get a different result (and more expected!). E.g., if I search for [ʃ] in the example corpus, then the summary results show a type frequency of 6, not 9, and that matches the six individual words found, regardless of the number of target segments each word contains:

image

So, that’s alarming that our machines are giving us different results! But you also seem to have deleted the comment that showed that example, so maybe you discovered something in the meantime?

  1. Furthermore, the way that type frequency is being treated seems to be different depending on whether the targets are listed separately or not! If the search is for [s, ʃ] in the example corpus, and the default / combined view is used, then I get the expected result that matches what I got above, i.e., each word counts once, regardless of the number of target segments it contains, for a total type frequency of 8, in 8 unique words:

image

BUT, if I do the same search but ask for the results to be separated out by targets, I get 5 for [s] and 9 for [ʃ]. While it’s fine in this case for words that have both [s] AND [ʃ] to get ‘double counted’ (one in each result row), that should give the result of 4 for [s] and 6 for [ʃ]. But instead, it’s doing the double-counting targets by word that you had originally reported, @stannam, so that e.g. [sasi] counts as two ‘types’ for [s] instead of 1.

image

Note that if I do now go back to the summary results, it reverts completely to the not-separated-by-target option (as mentioned in point (1) above), i.e., that entails re-combining the type frequencies and not double-counting within words, i.e., I get 8 again and not 14 (which would be 9+5):

image


I think the desired behaviour should be:

  1. results display should be static, so if a user has asked for results to be split out by target, they should always get that, even if they toggle back and forth between ‘summary’ and ‘individual’ results;
  2. a word should count only once toward type frequency (and its token frequency similarly only used once), regardless of the number of target segments it contains.

[As a side note, I’ll just comment to remind myself: I did check to see whether the behaviour was the same if the targets were specified with features instead of segments, i.e., [+continuant, -syllabic], and the results are identical to those described above.]

@stannam
Copy link
Member Author

stannam commented Apr 19, 2022

  1. The settings for the result display should be static
  • That is strange. I could not replicate this issue on the virtual MacOS environment (macOS Monterey).

ezgif com-gif-maker

  1. A word should count only once toward type frequency
  • I'm working on this issue... To make sure that I'm on the same page, searching for [s, ʃ] in the example corpus should yield something like the following.

image

because in the individual results:
image

@stannam
Copy link
Member Author

stannam commented Apr 20, 2022

I think I fixed the type frequency issue.

I tested on the following two types of use case scenarios. It worked as expected.

1. Use case 1: Searching for [s, ʃ] (see above)

  • desired results:
symbol type F token F
s 4 179
ʃ 6 275

.
.

2. Use case 2: Searching for [s, ʃ], but the environment is specified as [_a, _i]

  • i.e.,
    image

  • in this case, 'sasi' should be counted twice, once for [sa] and another for [si].

  • desired results:
    image

  • because in the individual results:
    image

@kchall Can you update to the latest codes and test on your machine whether the summary setting is static and the results are correct?

stannam added a commit that referenced this issue Apr 25, 2022
This should fix the issue that Kathleen reported.
@stannam
Copy link
Member Author

stannam commented Apr 27, 2022

cf. #356

@kchall
Copy link
Member

kchall commented May 2, 2022

Thanks, Stanley! This mostly seems to be fixed, but there are two separate follow-ups. I'll list them in two separate comments for clarity.

(But more generally, I can confirm that the combined searches all seem to be working for me as expected, i.e., searching for [s, ʃ] gets me a result of 8 types and 446 tokens; asking PCT to separating the results out by target segments gives me 4 types and 179 tokens for [s], and 6 types and 275 tokens for [ʃ], and the results are now 'sticky' / 'static' rather than reverting to unseparated out when toggling between individual and summary results).

@kchall
Copy link
Member

kchall commented May 2, 2022

First, the more serious bug:

Leaving aside doing ‘combined’ searches for the moment, I tried just doing two individual searches, one for [s] and one for [ʃ].

Weirdly, if I do these searches literally one at a time (a search for [s] and then, separately, a search for [ʃ]), everything is fine, even if I add the results to the current results table.

Results if I do the searches separately:
image

BUT if I define two separate environments at the same time (which should give the same result), the results for the first environment ([s] in this case) are correct, but the ones for the second ([ʃ]) are not — it seems to have somehow added in one of the [s] words into the total counts for [ʃ] (specifically, the numbers make sense if it’s counting [sasi] as an [ʃ] word!). (And similarly, it gets [ʃ] correct but [s] wrong if the order is switched.)

image

Results for the ‘combined’ search of two separate environments:
image

@kchall
Copy link
Member

kchall commented May 2, 2022

The second issue is more about clarity of what we mean when we someone selects "List target segments separately in summary results."

Currently, what is actually happening (and thanks for pointing this out through your second test case above!) is that both the targets AND the environments are getting split out. For example, if the basic search definition is to search for [s, ʃ] before [i, a], and the results are separated, PCT currently separates out both the targets and the environments, yielding four results rows (for [si], [sa], [ʃi], and [ʃa]) (and this seems to be working correctly).

But with the way the option is worded, I would have expected it to separate out the targets and NOT the environments, so that you get just two results rows (for [s] before [i, a] and [ʃ] before [i, a]).

I think we should proceed as follows:

  1. Change the text of the current option button to say “List target segments and environments separately in summary results.”

  2. Check with the person who requested this functionality to see whether she specifically wanted one or both of these options. If she just wants the one we're currently doing, fine, we'll leave it as is.

  3. If she wanted both or she only cared about the one we're NOT currently doing, let's ADD a new option button that does say “List target segments separately in summary results" and indeed has that functionality.

  4. Either way, make sure the documentation is updated to make the functionality clear. In particular, make it clear that (under the current functionality), if results are listed separately, every combination is considered separately, but different parts of the same environment are still combined (e.g. {#, i} {s, ʃ} {i, a} results in four environments for each of the two target segments): #_i, #_a, i_i, and i_a.

stannam added a commit that referenced this issue May 3, 2022
This change partially undoes f359d5e where I made the summary result to have a 0 frequency row when the search has no hit. The introduction of 0 frequency rows was due to the new way of summary presentation. It is not relevant to the old way.
@stannam
Copy link
Member Author

stannam commented May 5, 2022

Thanks, Kathleen for looking into this.

re: One search with separate environments gives wrong numbers

I think I fixed this error now. I confirmed that the numbers are correct. It turned out that a conditional I added to the summary function (f359d5e) overapplies to some words. I simply made changes to bypass this conditional if the user selected the 'split' option.

Previously when I was working on the summary results, I added a feature that makes 'zero frequency rows' if some conditions are met. For example, if we search for a readily illegal sequence like [gst] on example without the 'split' option, PCT still returns a row, rather than an empty window. It depends on whether the main phonological search function returned None ('not found'). The logic is: if there is a value, then a search must have happened with that parameters, but None means nothing found --- so add a row but give zero values for token and type frequencies.

However, for some reason, the phonological search function wrongly adds redundant None values if more than one environment is used. These redundant None values are wrongly translated by the summary side as a need for adding zero frequencies, messing the frequency counts for the second environment.

For now, I do not understand why this is happening in the main function. Someone should investigate this eventually, but I think it does not need to be a top priority since the final results are not affected either on the individual or summary window.

re: Splitting the environment segments

I emailed the user to ask if the other option would be needed. Once I hear back, I will make changes accordingly.
edit: we don't need to have the additional option. I have just changed the wording to "List target segments and environments separately in summary results."

stannam referenced this issue May 5, 2022
added "and environments" Also inserted a linebreak
@kchall
Copy link
Member

kchall commented May 5, 2022

Thanks, @stannam! But, weirdly, this doesn't seem to have fixed the error on my Mac -- I synced to the latest version and see your commit in the history, but when I do the two searches together, I still get the wrong result for the second one:

image

image

(Interestingly, this does work correctly if I check the option for separating out the targets in the results!)

@kchall
Copy link
Member

kchall commented May 5, 2022

I have updated the https://github.com/PhonologicalCorpusTools/CorpusTools/blob/master/docs/source/phonological_search.rst documentation, including updating the existing search interface images, to include the "List separately" option. So far, though, it doesn't seem to be actually syncing with https://corpustools.readthedocs.io/en/latest/phonological_search.html -- the build is failing. But I'm not savvy enough to understand what the actual issue is.

@stannam
Copy link
Member Author

stannam commented May 8, 2022

1. [s] [ʃ] positive search

Two parameters.
i. list by seg True / False
ii. single env / separate envs

Individual results -- the summary function should summarize the following results according to the parameter settings.
image

Desired outputs:
All combinations should return the same results as the screenshot below.... except for 'single env, list by seg False.' In that case, all results should be smooshed into one row (Type F of 8; Token F of 446, or 5+139+43+2+3+126+96+32)

Picture 1:
image

Passed tests?

single env separate envs
by seg True ✅ Picture 1 ✅ Picture 1
by seg False ✅ One row ✅ Picture 1

//

2. [s] [ʃ] negative search

Two parameters
i. list by seg True / False (negative search should not allow listing by segments)
ii. single env / separate envs (negative search should not allow two envs)

Individual results -- the summary function should summarize the following results.
image

Desired outputs: There should only be one combination of the two parameters. It should return the results as below.

  • Type freq: 8
  • Token freq: 139 (= 11 + 11 + 6 + 2 + 2 + 67 + 7 + 33 )

Passed tests?

  • one env seg negative

//

3. [sɑ] [ʃɑ] Syllable search (No negative option for syllable search)

Two parameters
i. list by seg True / False
ii. single env / separate envs

Individual results
image

  • 3.1 listing by seg True // single env

parameter setting:
image

desired: two rows

type F token F
[sɑ] syllable 3 174 (=139+3+32)
[ʃɑ] syllable 1 43 ('shashi')
  • 3.2 listing by seg True // separate envs

parameter setting:

image

desired: two rows

type F token F
[sɑ] syllable 3 174 (=139+3+32)
[ʃɑ] syllable 1 43 ('shashi')
  • 3.3 listing by seg False // single env

desired: one row

type F token F
[{ s,ʃ } ɑ] syllable 4 217 (=139+43+3+32)
  • 3.4 listing by seg False // separate envs

desired: two rows

type F token F
[sɑ] syllable 3 174 (=139+3+32)
[ʃɑ] syllable 1 43 ('shashi')

//

stannam added a commit that referenced this issue May 9, 2022
Passed Tests 1, 2, 3.1, 3.2, and 3.3
Need more work to pass test 3.3, i.e., syllable search with the listing by seg set as False and [sa] and [ʃa] in a single env.
@stannam
Copy link
Member Author

stannam commented May 9, 2022

@kchall could you synced to the latest codes and see if problems are all fixed?

re: documentation build failing (#801 (comment))
Readthedocs automatically imports the latest version of Sphinx, which is a package for rendering rst files to html. But that latest version seems buggy, making the builds fail (sometimes, the newest might not be the best one). So I forced the version that worked for PCT previously.

re: fixing the PS summary function
I think I fixed the summary function for good. I tested various use case scenarios, including the negative search and syllable search (a lengthy comment is here ). Can you check if everything is correct on your end?

stannam added a commit that referenced this issue May 12, 2022
There are issues that have been unnoticed because they didn't raise errors (raised warnings). In the API reference part, couple of subsections are missing for example:
https://corpustools.readthedocs.io/en/v1.5.0/apireference.html#textgrids
https://corpustools.readthedocs.io/en/v1.5.0/apireference.html#functional-load

That parts were not updated when the corresponding codes were changed, causing 'warnings'.

Well, the warnings have now evolved (?) into errors and newer version of sphinx just refuses to build any new commits. This was the cause of the problem mentioned in this comment #801 (comment)
@kchall
Copy link
Member

kchall commented May 16, 2022

I can confirm that all of the above seems to be working correctly now on my end as well! With the (exception?) that in part 3 above, "[sɑ] [ʃɑ] Syllable search," I am able to do a negative search as long as I have the environments combined into a single set and don't try to separate out the results. My syllabified example corpus doesn't have [ga.ga] in it, but the results for the search do seem to be correct for my corpus, i.e., with type frequency of 11 and token of 362 for "words that do not contain either [sɑ] or [ʃɑ]." So, I think it is all working correctly...yay!

@stannam
Copy link
Member Author

stannam commented May 19, 2022

I think I was wrong in Part 3 (syllable search) about the negative search option.

The documentation does not exclude the option to do a negative search with syllables, and I can also confirm that the negative search works correctly. However, selecting the 'syllables' button wrongly greys out 'negative.' The option reopens when 'New environment' is clicked.

I think this is a very strange behaviour, so I'll disable it. After that, new executables will be created. yay!

/

Notes to (future) myself: why that behaviour?

if btn.isChecked(): # sylMode is checked
self.mode = 'sylMode'
self.resultTypeGroup.widgets[1].setEnabled(False) # disable 'negative search' in result type

Out of concern that any changes would not end well, I dug into my notes trying to find why I added the codes in the first place.

As shown above, I did intentionally disable neg syllables searches in 2021, but that edit should have been reverted already. Disabling neg + syll was a solution to this problem #746 (comment) (part 4) where the target and environment columns became empty in the negative syllable search results. I think I just removed the problem as a makeshift (how ostrich!). When issue #746 was fixed, the codes above were not needed and I should have deleted them. Instead, I added more codes that reenact the disabled option 🤦‍♂️. So when changing from the seg mode to syllable mode, the negative option got disabled -> enabled -> and then disabled again (with more than 1 env)..

stannam added a commit that referenced this issue May 19, 2022
@stannam stannam closed this as completed Jun 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants