Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignment fails if the transcript file has a header row #65

Closed
Tracked by #62
chrisbrickhouse opened this issue Aug 29, 2022 · 9 comments
Closed
Tracked by #62

Alignment fails if the transcript file has a header row #65

chrisbrickhouse opened this issue Aug 29, 2022 · 9 comments
Labels
bug Something broke good first issue Easy to do and suitable for new developers
Milestone

Comments

@chrisbrickhouse
Copy link
Collaborator

At line 175 in aligner.align() the program loops through the lines of the transcript TSV. If the file has a header row, the alignment fails at line 192 because the headers (strings) cannot be coerced into time stamps (floats).

  • This failure mode should be avoided if possible. We should detect whether a header row is present, and skip it if so. This check-and-skip functionality should probably be added in TranscriptProcessor, and might already be built-in to the csv reader.
  • Logging and error reporting could be improved here. Currently the error message says that something couldn't be converted to a float, but where and what it is are not reported. FAVE should report the line and contents which would make identification of transcript problems easier.
@JoFrhwld
Copy link
Owner

issue

@chrisbrickhouse chrisbrickhouse added this to the Version 2.1 milestone Sep 23, 2022
@chrisbrickhouse chrisbrickhouse added bug Something broke good first issue Easy to do and suitable for new developers labels Sep 23, 2022
@DerMoehre
Copy link
Contributor

DerMoehre commented Sep 30, 2022

In the csv, there is a Sniffer class with the function has_header implemented.
It returns TRUE, if the first column seems to be a column

https://docs.python.org/3/library/csv.html

I would like to contribute to this issue, although I dont know where to start :D I never worked with such a big project. Could you give me some guidance?

@chrisbrickhouse
Copy link
Collaborator Author

First, thanks, that documentation link alone is a great help! I'd be happy to orient you to the code.

The program is divided into two main parts: FAVE-align and FAVE-extract. These correspond to two subfolders of the fave director: align and extract. This issue corresponds to align, which has a number of python files inside. Each file contains one class (or, at least, should contain only one class, I think there are a couple that don't follow that convention) which has the same name as the file. FAVE-align is mainly implemented by the Aligner class specified in aligner.py. The aligner takes a transcript file, technically a TSV (tab seperated value) file which is processed by the TranscriptProcessor class in fave/align/transcriptprocessor.py. Together, these two files are the source of the problem in this issue.

We can trace back from the error to the transcript which should give you a sense of what functions will need modified.

  1. A TranscriptProcessor class instance is stored as Aligner.transcript
  2. Aligner.read_transcription_file() calls TranscriptProcessor.read_transcription_file() which reads the file into memory and stores a list of strings in TranscriptProcessor.lines which can be accessed through Aligner.transcript.lines (see 1)
  3. TranscriptProcessor.check_transcription_file() populates TranscriptProcessor.trans_lines attribute as a list of strings.
  4. Aligner.align() at line 175 defines a loop over these lines (and some other stuff). The first item of this loop is sometimes a header row.
  5. When the item is a header row, Aligner.align() fails at line 192 because it tries to cast a string (the header) to a floating point number which is impossible. The program crashes.

The most immediate problem you will face is that we do not use the csv library anywhere in this process (I thought we did). You can either:

  • Modify TranscriptProcessor.read_transcription_file() to use csv instead of our current f.readlines() approach. Best long-term plan, but might require some changes to functions outside the traceback I gave.
  • Keep our non-csv approach and modify Aligner.align() so that it does its own header check. Easier short-term fix, but will be harder to maintain in the future.

I'd suggest implementing the csv library, but either would work as a bug fix.

@DerMoehre
Copy link
Contributor

DerMoehre commented Oct 1, 2022

Wow! Thank you for this detailled description. I will try to get the structure and see, how I can implement an improvement.

Is there a chance to test this before? Or should I just make some csv files and test it with the dummy files?

I played a bit around with has_header(). To identify the header, the datatypes should be different in the header and the following rows. Is this also in your csv? If so, this should not be a problem :)

@DerMoehre
Copy link
Contributor

Maybe we can fix this error with one of the following:

  • We add a try:except block at the position, where the division is made. If the error occurs, the line will be skipped
  • We start with adding the csv. The places I identified would be in the transscriptprocessor at line 246. Maybe it would be enough to check there for header, and give this information somehow to align

The thing I don't know is, how to start in such a large project :D I can add some line of code here and there, but how could I test this? Could you help me with this? :-)

@chrisbrickhouse
Copy link
Collaborator Author

chrisbrickhouse commented Oct 1, 2022

Is there a chance to test this before?

We use pytest (tutorial) to test our code and you can find them in the tests directory. Inside the project directory, if you run python -m pytest tests/ then you should be able to test all of the code on your own computer; it's the same command we run in our workflow, so if the tests pass on your machine they will probably pass on github.

Since this issue involves file read-write, you will want to look at test_cmudictionary.py in the dev branch which also deals with file read-write functions. You'll notice that it's a little sloppy and somewhat complex. I will write some tests for this issue which you can use, so don't feel like you need to figure out the testing suite alongside the project itself.

Looking at test_cmudictionary.py you'll see that the function definitions take an argument: tmp_path. This lets pytest know that we will be using a temporary folder to read and write files in this test. Using that tmp_path object, we create a temporary file and then write the testing text. Then we call the function with the temporary file and see if everything works as intended.

To identify the header, the datatypes should be different in the header and the following rows

Yes, this is the case. Headers will be strings, the data will be a mix of floating point numbers and strings.

We add a try:except block at the position, where the division is made. If the error occurs, the line will be skipped

Possible, but risky. We get a ValueError which is a really common error type in python. If an error occurs for some other reason, we risk more bugs. For example, we're focused on headers right now, but what if a user accidentally puts data in the wrong column? A try-except block will skip over it, and the program will go on like nothing is wrong when, in fact, there's an error that the user needs to fix. So we can wind up skipping more than just headers.

Maybe it would be enough to check there for header, and give this information somehow to align

I think this is on the right track, but instead of passing that information along, could TranscriptProcessor.read_transcription_file() just remove the header row? I don't think any other functions expect a header row. It's just there to make the document more human readable. If I remember correctly, I think using next() on a csv.reader object skips the first line which might be easier than passing along a variable or checking an attribute.

chrisbrickhouse added a commit to chrisbrickhouse/FAVE that referenced this issue Oct 1, 2022
Adds test for JoFrhwld#65 also adds more code debt for turning the
CMU dictionary excerpt into a fixture.
@DerMoehre
Copy link
Contributor

Hi Chris,
thank you again, for your detailed information :)
I played a bit around with the pytest and tried to write a small test for the read_transcription_file(). I could pass the example csv file, but it stopped because of the missing cmudictionary.
I think, I understand the code a bit better now, so still some progress ;-)

my idea to fix the header alignment right now is:

  • import csv and use it only in the transcriptprocessor.py in the function read_transcription_file()
  • with the opened file, it will check for header
  • if it has a header, another csv will be opened and all but the first line will be written in this file and saved
  • after that, it will override the old csv with the new
  • with the new csv, the original code will run through
    (it worked for a local csv file on my computer)

I will try to read a bit more into pytest, to understand better what is going on :-D

@chrisbrickhouse
Copy link
Collaborator Author

I could pass the example csv file, but it stopped because of the missing cmudictionary.

Yeah, fixing how we test code which relies on the cmudictionary is on my to-do list. I wrote a similar test over on my branch for this issue which will show you how to incorporate the cmu dictionary into the test. My current solution is to copy and paste some code from the cmudictionary tests, but that's not the best long term solution.

if it has a header, another csv will be opened and all but the first line will be written in this file and saved

This will work, but remember that file read-write can be very slow. Our users work with data sets that can be very large, and this would effectively double the run time.

I'm just thinking of this now, but maybe we don't need to implement the csv library. If all it does is look at the data types of the 2nd and later rows to see if they match the first, maybe we can just implement our own test? The 3rd element of a line will always be a float, so in read_transcription_file() we could just test to see if float(lines[0].split('\t')[2]) raises an error. This is similar to your idea of using a try-except block, but instead of putting it in Aligner.align(), we move it earlier in the chain. Something like:

def read_transcription_file(self):
    ...
    try:
        float(lines[0].split('\t')[2])
    except ValueError:
        # Log a warning about having detected a header row
        next(lines) # skips header row, see https://stackoverflow.com/questions/4796764
    self.lines = lines

after that, it will override the old csv with the new

Generally, we should not overwrite files without a user's explicit permission. If they don't have a backup, we could wind up causing data loss, and we should avoid creating that situation.

@DerMoehre
Copy link
Contributor

okay this is great :)
I will try to implement the code of yours into the file and see, how it works :)

The override would be neccessary to make the new file (without the header) the old one. But you are right, I also did not think about the extra time it will take.

chrisbrickhouse added a commit that referenced this issue Oct 6, 2022
Resolves #65 by checking the data type of the first time field. If it's not a float, we assume it's a header row and remove it from the returned list. Otherwise the function returns as previously.

Squashed commit of DerMoehre's PR #73

Co-authored-by: JoFrhwld <[email protected]>
Co-authored-by: Christian Brickhouse <[email protected]>
chrisbrickhouse added a commit to chrisbrickhouse/FAVE that referenced this issue Mar 13, 2024
Resolves JoFrhwld#65 by checking the data type of the first time field. If it's not a float, we assume it's a header row and remove it from the returned list. Otherwise the function returns as previously.

Squashed commit of DerMoehre's PR JoFrhwld#73

Co-authored-by: JoFrhwld <[email protected]>
Co-authored-by: Christian Brickhouse <[email protected]>
chrisbrickhouse added a commit that referenced this issue Mar 29, 2024
Resolves #65 by checking the data type of the first time field. If it's not a float, we assume it's a header row and remove it from the returned list. Otherwise the function returns as previously.

Squashed commit of DerMoehre's PR #73

Co-authored-by: JoFrhwld <[email protected]>
Co-authored-by: Christian Brickhouse <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something broke good first issue Easy to do and suitable for new developers
Projects
None yet
Development

No branches or pull requests

3 participants