Phonological search: Add option to list up target segments separately in summary results #801

stannam · 2022-04-01T00:24:37Z

Currently, the phonological search result window repeats the user input. But it will be better off by adding an option to view the summary results in the old way, i.e., each row by segments.

Need to implement/solve/decide

when the search has no hits, PCT should not crash.
- I remember this error existed in the old way of result summary. (can't find among the closed issues, though.)
what should PCT do when the user redoes searches but changes the summarize_by_segments settings?
- Currently, it follows the most recent value e.g., if True then False, the summary window repeats the user input.
- → Decided to keep it this way
make sure the numbers add up, i.e., the results are true to the data.
Option wording: “List target segments and environments separately in summary results.”

Test the function

with very complicated searches (especially, multiple segments in the environments)
- For this, I searched for {i OR a} proceded by {ʃ OR g} and PCT returned correct results. Specifically, the parameters I used were (env: {ʃ, g}_; target: i, a), on the example corpus. Therefore, PCT should find four sequences of ʃi, gi, ʃa, ga. Four words in the corpus, ɡaɡa, ʃaʃi, ʃi, ʃisota contain one of those sequences.
with the syllabified example corpus
input by both 'add segments' and 'add features'
duplicating a search (see [ENC] Duplicate phonological searches should be blocked #765, and [BUG] Duplicated search too strict #762)

Update documentation

Make sure the documentation is updated to make the functionality clear. In particular, make it clear that (under the current functionality), if results are listed separately, every combination is considered separately, but different parts of the same environment are still combined (e.g. {#, i} {s, ʃ} {i, a} results in four environments for each of the two target segments): #_i, #_a, i_i, and i_a.

re: issue #801

GUI changes. #801 (neg/pos) search type overrides the present summary option.

kchall · 2022-04-06T19:06:45Z

If you ask PCT to list the target results separately, it does so initially in the summary results, but when you then go to the individual results and back to the summary results, the summary results revert to the combined targets. The behaviour should be static.
When I try to replicate the example search about type frequency you showed earlier, @stannam, I get a different result (and more expected!). E.g., if I search for [ʃ] in the example corpus, then the summary results show a type frequency of 6, not 9, and that matches the six individual words found, regardless of the number of target segments each word contains:

So, that’s alarming that our machines are giving us different results! But you also seem to have deleted the comment that showed that example, so maybe you discovered something in the meantime?

Furthermore, the way that type frequency is being treated seems to be different depending on whether the targets are listed separately or not! If the search is for [s, ʃ] in the example corpus, and the default / combined view is used, then I get the expected result that matches what I got above, i.e., each word counts once, regardless of the number of target segments it contains, for a total type frequency of 8, in 8 unique words:

BUT, if I do the same search but ask for the results to be separated out by targets, I get 5 for [s] and 9 for [ʃ]. While it’s fine in this case for words that have both [s] AND [ʃ] to get ‘double counted’ (one in each result row), that should give the result of 4 for [s] and 6 for [ʃ]. But instead, it’s doing the double-counting targets by word that you had originally reported, @stannam, so that e.g. [sasi] counts as two ‘types’ for [s] instead of 1.

Note that if I do now go back to the summary results, it reverts completely to the not-separated-by-target option (as mentioned in point (1) above), i.e., that entails re-combining the type frequencies and not double-counting within words, i.e., I get 8 again and not 14 (which would be 9+5):

I think the desired behaviour should be:

results display should be static, so if a user has asked for results to be split out by target, they should always get that, even if they toggle back and forth between ‘summary’ and ‘individual’ results;
a word should count only once toward type frequency (and its token frequency similarly only used once), regardless of the number of target segments it contains.

[As a side note, I’ll just comment to remind myself: I did check to see whether the behaviour was the same if the targets were specified with features instead of segments, i.e., [+continuant, -syllabic], and the results are identical to those described above.]

stannam · 2022-04-19T23:56:25Z

The settings for the result display should be static

That is strange. I could not replicate this issue on the virtual MacOS environment (macOS Monterey).

A word should count only once toward type frequency

I'm working on this issue... To make sure that I'm on the same page, searching for [s, ʃ] in the example corpus should yield something like the following.

because in the individual results:

re: issue #801

stannam · 2022-04-20T02:33:04Z

I think I fixed the type frequency issue.

I tested on the following two types of use case scenarios. It worked as expected.

1. Use case 1: Searching for [s, ʃ] (see above)

desired results:

symbol	type F	token F
s	4	179
ʃ	6	275

.
.

2. Use case 2: Searching for [s, ʃ], but the environment is specified as [_a, _i]

i.e.,
in this case, 'sasi' should be counted twice, once for [sa] and another for [si].
desired results:
because in the individual results:

@kchall Can you update to the latest codes and test on your machine whether the summary setting is static and the results are correct?

This should fix the issue that Kathleen reported.

stannam · 2022-04-27T20:51:18Z

cf. #356

kchall · 2022-05-02T18:38:18Z

Thanks, Stanley! This mostly seems to be fixed, but there are two separate follow-ups. I'll list them in two separate comments for clarity.

(But more generally, I can confirm that the combined searches all seem to be working for me as expected, i.e., searching for [s, ʃ] gets me a result of 8 types and 446 tokens; asking PCT to separating the results out by target segments gives me 4 types and 179 tokens for [s], and 6 types and 275 tokens for [ʃ], and the results are now 'sticky' / 'static' rather than reverting to unseparated out when toggling between individual and summary results).

kchall · 2022-05-02T18:40:17Z

First, the more serious bug:

Leaving aside doing ‘combined’ searches for the moment, I tried just doing two individual searches, one for [s] and one for [ʃ].

Weirdly, if I do these searches literally one at a time (a search for [s] and then, separately, a search for [ʃ]), everything is fine, even if I add the results to the current results table.

Results if I do the searches separately:

BUT if I define two separate environments at the same time (which should give the same result), the results for the first environment ([s] in this case) are correct, but the ones for the second ([ʃ]) are not — it seems to have somehow added in one of the [s] words into the total counts for [ʃ] (specifically, the numbers make sense if it’s counting [sasi] as an [ʃ] word!). (And similarly, it gets [ʃ] correct but [s] wrong if the order is switched.)

Results for the ‘combined’ search of two separate environments:

kchall · 2022-05-02T18:47:32Z

The second issue is more about clarity of what we mean when we someone selects "List target segments separately in summary results."

Currently, what is actually happening (and thanks for pointing this out through your second test case above!) is that both the targets AND the environments are getting split out. For example, if the basic search definition is to search for [s, ʃ] before [i, a], and the results are separated, PCT currently separates out both the targets and the environments, yielding four results rows (for [si], [sa], [ʃi], and [ʃa]) (and this seems to be working correctly).

But with the way the option is worded, I would have expected it to separate out the targets and NOT the environments, so that you get just two results rows (for [s] before [i, a] and [ʃ] before [i, a]).

I think we should proceed as follows:

Change the text of the current option button to say “List target segments and environments separately in summary results.”
Check with the person who requested this functionality to see whether she specifically wanted one or both of these options. If she just wants the one we're currently doing, fine, we'll leave it as is.
If she wanted both or she only cared about the one we're NOT currently doing, let's ADD a new option button that does say “List target segments separately in summary results" and indeed has that functionality.
Either way, make sure the documentation is updated to make the functionality clear. In particular, make it clear that (under the current functionality), if results are listed separately, every combination is considered separately, but different parts of the same environment are still combined (e.g. {#, i} {s, ʃ} {i, a} results in four environments for each of the two target segments): #_i, #_a, i_i, and i_a.

This change partially undoes f359d5e where I made the summary result to have a 0 frequency row when the search has no hit. The introduction of 0 frequency rows was due to the new way of summary presentation. It is not relevant to the old way.

stannam · 2022-05-05T01:18:32Z

Thanks, Kathleen for looking into this.

re: One search with separate environments gives wrong numbers

I think I fixed this error now. I confirmed that the numbers are correct. It turned out that a conditional I added to the summary function (f359d5e) overapplies to some words. I simply made changes to bypass this conditional if the user selected the 'split' option.

Previously when I was working on the summary results, I added a feature that makes 'zero frequency rows' if some conditions are met. For example, if we search for a readily illegal sequence like [gst] on example without the 'split' option, PCT still returns a row, rather than an empty window. It depends on whether the main phonological search function returned None ('not found'). The logic is: if there is a value, then a search must have happened with that parameters, but None means nothing found --- so add a row but give zero values for token and type frequencies.

However, for some reason, the phonological search function wrongly adds redundant None values if more than one environment is used. These redundant None values are wrongly translated by the summary side as a need for adding zero frequencies, messing the frequency counts for the second environment.

For now, I do not understand why this is happening in the main function. Someone should investigate this eventually, but I think it does not need to be a top priority since the final results are not affected either on the individual or summary window.

re: Splitting the environment segments

I emailed the user to ask if the other option would be needed. Once I hear back, I will make changes accordingly.
edit: we don't need to have the additional option. I have just changed the wording to "List target segments and environments separately in summary results."

added "and environments" Also inserted a linebreak

kchall · 2022-05-05T22:07:59Z

Thanks, @stannam! But, weirdly, this doesn't seem to have fixed the error on my Mac -- I synced to the latest version and see your commit in the history, but when I do the two searches together, I still get the wrong result for the second one:

(Interestingly, this does work correctly if I check the option for separating out the targets in the results!)

kchall · 2022-05-05T22:53:16Z

I have updated the https://github.com/PhonologicalCorpusTools/CorpusTools/blob/master/docs/source/phonological_search.rst documentation, including updating the existing search interface images, to include the "List separately" option. So far, though, it doesn't seem to be actually syncing with https://corpustools.readthedocs.io/en/latest/phonological_search.html -- the build is failing. But I'm not savvy enough to understand what the actual issue is.

stannam · 2022-05-08T09:24:43Z

1. [s] [ʃ] positive search

Two parameters.
i. list by seg True / False
ii. single env / separate envs

Individual results -- the summary function should summarize the following results according to the parameter settings.

Desired outputs:
All combinations should return the same results as the screenshot below.... except for 'single env, list by seg False.' In that case, all results should be smooshed into one row (Type F of 8; Token F of 446, or 5+139+43+2+3+126+96+32)

Picture 1:

Passed tests?

	single env	separate envs
by seg True	✅ Picture 1	✅ Picture 1
by seg False	✅ One row	✅ Picture 1

//

2. [s] [ʃ] negative search

Two parameters
i. list by seg ~~True~~ / False (negative search should not allow listing by segments)
ii. single env / ~~separate envs~~ (negative search should not allow two envs)

Individual results -- the summary function should summarize the following results.

Desired outputs: There should only be one combination of the two parameters. It should return the results as below.

Type freq: 8
Token freq: 139 (= 11 + 11 + 6 + 2 + 2 + 67 + 7 + 33 )

Passed tests?

one env seg negative

//

3. [sɑ] [ʃɑ] Syllable search (No negative option for syllable search)

Two parameters
i. list by seg True / False
ii. single env / separate envs

Individual results

3.1 listing by seg True // single env

parameter setting:

desired: two rows

	type F	token F
[sɑ] syllable	3	174 (=139+3+32)
[ʃɑ] syllable	1	43 ('shashi')

3.2 listing by seg True // separate envs

parameter setting:

desired: two rows

	type F	token F
[sɑ] syllable	3	174 (=139+3+32)
[ʃɑ] syllable	1	43 ('shashi')

3.3 listing by seg False // single env

desired: one row

	type F	token F
[{ s,ʃ } ɑ] syllable	4	217 (=139+43+3+32)

3.4 listing by seg False // separate envs

desired: two rows

	type F	token F
[sɑ] syllable	3	174 (=139+3+32)
[ʃɑ] syllable	1	43 ('shashi')

//

Passed Tests 1, 2, 3.1, 3.2, and 3.3 Need more work to pass test 3.3, i.e., syllable search with the listing by seg set as False and [sa] and [ʃa] in a single env.

stannam · 2022-05-09T10:09:45Z

@kchall could you synced to the latest codes and see if problems are all fixed?

re: documentation build failing (#801 (comment))
Readthedocs automatically imports the latest version of Sphinx, which is a package for rendering rst files to html. But that latest version seems buggy, making the builds fail (sometimes, the newest might not be the best one). So I forced the version that worked for PCT previously.

re: fixing the PS summary function
I think I fixed the summary function for good. I tested various use case scenarios, including the negative search and syllable search (a lengthy comment is here ). Can you check if everything is correct on your end?

There are issues that have been unnoticed because they didn't raise errors (raised warnings). In the API reference part, couple of subsections are missing for example: https://corpustools.readthedocs.io/en/v1.5.0/apireference.html#textgrids https://corpustools.readthedocs.io/en/v1.5.0/apireference.html#functional-load That parts were not updated when the corresponding codes were changed, causing 'warnings'. Well, the warnings have now evolved (?) into errors and newer version of sphinx just refuses to build any new commits. This was the cause of the problem mentioned in this comment #801 (comment)

kchall · 2022-05-16T21:28:19Z

I can confirm that all of the above seems to be working correctly now on my end as well! With the (exception?) that in part 3 above, "[sɑ] [ʃɑ] Syllable search," I am able to do a negative search as long as I have the environments combined into a single set and don't try to separate out the results. My syllabified example corpus doesn't have [ga.ga] in it, but the results for the search do seem to be correct for my corpus, i.e., with type frequency of 11 and token of 362 for "words that do not contain either [sɑ] or [ʃɑ]." So, I think it is all working correctly...yay!

stannam · 2022-05-19T00:32:41Z

I think I was wrong in Part 3 (syllable search) about the negative search option.

The documentation does not exclude the option to do a negative search with syllables, and I can also confirm that the negative search works correctly. However, selecting the 'syllables' button wrongly greys out 'negative.' The option reopens when 'New environment' is clicked.

I think this is a very strange behaviour, so I'll disable it. After that, new executables will be created. yay!

/

Notes to (future) myself: why that behaviour?

CorpusTools/corpustools/gui/psgui.py

Lines 694 to 696 in adfb9a7

    
           if btn.isChecked():     # sylMode is checked 
        
               self.mode = 'sylMode' 
        
               self.resultTypeGroup.widgets[1].setEnabled(False)  # disable 'negative search' in result type

Out of concern that any changes would not end well, I dug into my notes trying to find why I added the codes in the first place.

As shown above, I did intentionally disable neg syllables searches in 2021, but that edit should have been reverted already. Disabling neg + syll was a solution to this problem #746 (comment) (part 4) where the target and environment columns became empty in the negative syllable search results. I think I just removed the problem as a makeshift (how ostrich!). When issue #746 was fixed, the codes above were not needed and I should have deleted them. Instead, I added more codes that reenact the disabled option 🤦‍♂️. So when changing from the seg mode to syllable mode, the negative option got disabled -> enabled -> and then disabled again (with more than 1 env)..

re: #801 ( #801 (comment) )

stannam mentioned this issue Apr 1, 2022

After v.1.5.1 #800

Open

18 tasks

stannam added a commit that referenced this issue Apr 5, 2022

[PS] separate summary result rows by target segments

48db2e0

re: issue #801

stannam added a commit that referenced this issue Apr 5, 2022

[PS] wording #801

a8e208a

stannam added a commit that referenced this issue Apr 5, 2022

[PS] 'summarize by seg' and 'neg' can't go together!

19c9879

GUI changes. #801 (neg/pos) search type overrides the present summary option.

stannam added a commit that referenced this issue Apr 20, 2022

[PS] summarize skips a word if (seg,env) pair duplicates

3f1b04d

re: issue #801

stannam added a commit that referenced this issue Apr 25, 2022

Additional fix for #801

5388a42

This should fix the issue that Kathleen reported.

stannam referenced this issue May 5, 2022

[PS] summary result option wording

654c832

added "and environments" Also inserted a linebreak

stannam added a commit that referenced this issue May 9, 2022

[PS] summary results re: #801

b05c9de

Passed Tests 1, 2, 3.1, 3.2, and 3.3 Need more work to pass test 3.3, i.e., syllable search with the listing by seg set as False and [sa] and [ʃa] in a single env.

stannam added a commit that referenced this issue May 19, 2022

negative search radio button

845a0f9

re: #801 ( #801 (comment) )

stannam closed this as completed Jun 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phonological search: Add option to list up target segments separately in summary results #801

Phonological search: Add option to list up target segments separately in summary results #801

stannam commented Apr 1, 2022 •

edited

Loading

kchall commented Apr 6, 2022

stannam commented Apr 19, 2022

stannam commented Apr 20, 2022 •

edited

Loading

stannam commented Apr 27, 2022

kchall commented May 2, 2022 •

edited

Loading

kchall commented May 2, 2022

kchall commented May 2, 2022

stannam commented May 5, 2022 •

edited

Loading

kchall commented May 5, 2022

kchall commented May 5, 2022

stannam commented May 8, 2022 •

edited

Loading

stannam commented May 9, 2022

kchall commented May 16, 2022 •

edited

Loading

stannam commented May 19, 2022 •

edited

Loading

Phonological search: Add option to list up target segments separately in summary results #801

Phonological search: Add option to list up target segments separately in summary results #801

Comments

stannam commented Apr 1, 2022 • edited Loading

kchall commented Apr 6, 2022

stannam commented Apr 19, 2022

stannam commented Apr 20, 2022 • edited Loading

stannam commented Apr 27, 2022

kchall commented May 2, 2022 • edited Loading

kchall commented May 2, 2022

kchall commented May 2, 2022

stannam commented May 5, 2022 • edited Loading

kchall commented May 5, 2022

kchall commented May 5, 2022

stannam commented May 8, 2022 • edited Loading

stannam commented May 9, 2022

kchall commented May 16, 2022 • edited Loading

stannam commented May 19, 2022 • edited Loading

stannam commented Apr 1, 2022 •

edited

Loading

stannam commented Apr 20, 2022 •

edited

Loading

kchall commented May 2, 2022 •

edited

Loading

stannam commented May 5, 2022 •

edited

Loading

stannam commented May 8, 2022 •

edited

Loading

kchall commented May 16, 2022 •

edited

Loading

stannam commented May 19, 2022 •

edited

Loading