Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check for header row #73

Merged
merged 20 commits into from
Oct 6, 2022
Merged

check for header row #73

merged 20 commits into from
Oct 6, 2022

Conversation

DerMoehre
Copy link
Contributor

I used your code to implement a small test for a header row
Also updated the code in transcriptprocessor to check for the header row and jump to the next row.
Because sometimes, there are no rows following, the next() has also to be in a try:except block

Copy link
Collaborator

@chrisbrickhouse chrisbrickhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really good! One small point: test files should be named after the module they test so that it's easier to track down where the failures are. If test_readFile.py fails, we need to go into the test to figure out what it's testing before we can go fix it. If test_transcriptprocessor.py fails when we know exactly where the issue is: fave/align/transcriptprocessor.py. So test_readFile.py should be added to test_transcriptprocessor.py instead of its own file.

Besides that top-level point, my code comments cover two main points:

  1. You seem to have a good grasp of the testing basics, so I've given you some feedback on how to improve your test writing so that you can write them more easily in the future. You've duplicated a lot of code, but with a few tricks you can actually simplify a lot of what you need to write.
  2. Your changes to transcriptprocessor.py are good, and my main complaint is how the try-except blocks are scoped. Except blocks can cause a lot of problems if they catch exceptions we didn't predict, so a few comments on how to improve that.

lines = self.replace_smart_quotes(f.readlines())
self.lines = lines
try:
lines = self.replace_smart_quotes(f.readlines())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this outside try block. If there is a ValueError in replace_smart_quotes, we don't want to catch that.

@@ -33,6 +33,7 @@
import re
import sys
import os
import csv
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we're not using CSV, so we can remove this import.

float(lines[0].split('\t')[2])
except ValueError:
# Log a warning about having detected a header row
print("Header row was detected")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use python's logging library instead of print because the logger gives users more control over how much information the program puts out. This would probably be a "warning" level because things are still working, but the program is making an assumption (a header row exists) that might lead to errors later if the assumption is wrong. You can read more about what the log levels mean in this article.

The logger is accessed using self.logger and using this format, self.logger.warning('Message text'), you can convert the print statement to a log entry.

Copy link
Contributor Author

@DerMoehre DerMoehre Oct 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, never used this :-)

Comment on lines 256 to 262
# check, if there is a next line
try:
# jump to the next line of the file
next(f)
except:
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  1. This approach is too cautious. Try-except should be used when we expect a specific situation could arise and want to handle that case. If we cannot handle that situation, we should let the function exit with the error code and let the parent function deal with it (and so on up the chain until it is handled or the program exits with an error). If something goes wrong with next() we want that to error to "bubble up" to the top of the program. See exception handling best practices 4 and 9.
  2. You don't want to use next(f) here. You can see in line 249/251 that we've already read the file into memory as lines so we should be working with that instead of the file. I'm realizing now that next() is probably the wrong approach. This function sets self.lines and it should be a list. If we use next() the type will be an iterator and not a list. So we should just remove the first entry from the list, del lines[0] should do the trick.

Copy link
Contributor Author

@DerMoehre DerMoehre Oct 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, that would also make it easier to check, if there is another line. This is why I added the extra try:except block, because I ran in some errors because there were no next lines

edit: ah now I understand. We dont have to check for another line, if we just delete the first row, right? :-)

Comment on lines 32 to 58
for test_case in provide_value_error_file():
test_text = test_case[0]
flags = test_case[1]
expected = test_case[2]
tmp_file.write_text(test_text)
tp_obj = transcriptprocessor.TranscriptProcessor(
tmp_file,
cmu_dict,
**flags
)
tp_obj.read_transcription_file()

assert tp_obj.lines == expected

for test_case in provide_no_error_file():
test_text = test_case[0]
flags = test_case[1]
expected = test_case[2]
tmp_file.write_text(test_text)
tp_obj = transcriptprocessor.TranscriptProcessor(
tmp_file,
cmu_dict,
**flags
)
tp_obj.read_transcription_file()

assert tp_obj.lines == expected
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can (and should) combine these. Notice that these loops iterate over the return of the provider functions, and notice that the provider functions return a list of lists. If you add new test cases to the list of lists returned by the provider function, you only need one of these loops. This is known as the data provider pattern.

Comment on lines 61 to 86
def provide_value_error_file():
return [
[
"Style\tSpeaker\tBeginning\tEnd\tDuration\nFoo\tBar\t0.0\t3.2\t3.2",
{
'prompt': "IDK what this is -CJB",
'check' : '',
'verbose': logging.DEBUG
},
['Style\tSpeaker\tBeginning\tEnd\tDuration\n', 'Foo\tBar\t0.0\t3.2\t3.2']
]
]

# this does not raise a ValueError
def provide_no_error_file():
return [
[
"Foo\tBar\t0.0\t3.2\t3.2\nTest\t1.0\4.5\t3.5",
{
'prompt': "IDK what this is -CJB",
'check' : '',
'verbose': logging.DEBUG
},
['Foo\tBar\t0.0\t3.2\t3.2\n', 'Test\t1.0\4.5\t3.5']
]
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on the above comment, notice that the parameters for the test are a list:

[
  "foo\tbar...\n...4.5\t3.5", # test_text
  {
    ... # flags
  },
 ['foo...', 'Test...'] # expected
]

Meanwhile, that list is an entry in a list returned by the function, and each element of that list of lists is iterated over and tested. So instead of having multiple providers for a test, we can just add new test cases as entries in the list:

def provide_foo():
    return [
        [ 'Test case 1'],
        [ 'Test case 2'],
        [ 'Test case 3']
    ]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you :) I will read your comments and try to learn from them.

@DerMoehre
Copy link
Contributor Author

OKay, I hope everything went right.
Something was off, when I tried to git push the directory for the pull request.

In the end it worked, but maybe it added to much files.

Copy link
Collaborator

@chrisbrickhouse chrisbrickhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. A few typos, but once those are fixed I can merge.

['Foo\tBar\t0.0\t3.2\t3.2']
],
[ # test with one line
"Foo\tBar\t0.0\t3.2\t3.2\nTest\t1.0\4.5\t3.5",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot a tab: 1.0\4.5 should be 1.0\t4.5. Same typo in the expected field (which is why the test passes) and in the following test cases.

},
['Foo\tBar\t0.0\t3.2\t3.2\n', 'Test\t1.0\4.5\t3.5']
],
[ # test with more lines
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above. Happens twice in this line and the expected value.

@chrisbrickhouse chrisbrickhouse merged commit ce4ee71 into JoFrhwld:dev Oct 6, 2022
Copy link
Collaborator

@chrisbrickhouse chrisbrickhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, merged

chrisbrickhouse added a commit to chrisbrickhouse/FAVE that referenced this pull request Mar 13, 2024
Resolves JoFrhwld#65 by checking the data type of the first time field. If it's not a float, we assume it's a header row and remove it from the returned list. Otherwise the function returns as previously.

Squashed commit of DerMoehre's PR JoFrhwld#73

Co-authored-by: JoFrhwld <[email protected]>
Co-authored-by: Christian Brickhouse <[email protected]>
chrisbrickhouse added a commit that referenced this pull request Mar 29, 2024
Resolves #65 by checking the data type of the first time field. If it's not a float, we assume it's a header row and remove it from the returned list. Otherwise the function returns as previously.

Squashed commit of DerMoehre's PR #73

Co-authored-by: JoFrhwld <[email protected]>
Co-authored-by: Christian Brickhouse <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants