Skip to content
This repository was archived by the owner on Apr 9, 2022. It is now read-only.

fix bug when seqIDs are numbers. #1

Merged
merged 7 commits into from
Nov 28, 2016
Merged

fix bug when seqIDs are numbers. #1

merged 7 commits into from
Nov 28, 2016

Conversation

cduvallet
Copy link
Collaborator

When sequence IDs are numbers, pandas reads index as int64 dtype (even if you specify dtype in read_table() - index_col=0 overrides it and the index is read as int64 no matter what). fasta sequence IDs are always read as strings, and so all calls to "seq_id in seq_table.index" returns False (e.g. line 129: missing_ids = [seq_id for seq_id in self.seq_abunds.index if seq_id not in self.records]).

Looks like this is still an open issue: pandas-dev/pandas#9435

When sequence IDs are numbers, pandas reads index as int64 dtype (even
if you specify dtype - index_col=0 overrides specification and index is read
as int64 no matter what). Open issue: pandas-dev/pandas#9435
@cduvallet
Copy link
Collaborator Author

Also imported python3 print function explicitly, for python2 compatibility.

@swo
Copy link
Owner

swo commented Nov 23, 2016

re: Python 2 compatibility. Sounds good. Confirm that this is the only problem with running in Python 2?

re: sequence IDs as integers. Also sounds good, but I'd like to see a unit test for it before merging the pull request.

@cduvallet
Copy link
Collaborator Author

re: python2 compatibility - confirmed! All tests pass, and running dbotu.py on real data seems to work fine.

@swo
Copy link
Owner

swo commented Nov 27, 2016

One last request: Don't include the table_test.counts and .fasta files, instead make them into strings that you "read" in via StringIO. (test_dbotu line 10 gives an example).

The thing with pandas index columns is funny. I think your solution is good except for some weird edge cases (e.g., if the ID is "001" then the string is "001" but the int is 1 and the int-to-string conversion will give "1", which doesn't match the original "001"). Maybe write a test for that and then we can put it as a caveat/warning in the docs? I think it's reasonable to ask users to edit their IDs if they fall into one of these weird edge cases.

@cduvallet
Copy link
Collaborator Author

re: files - done.

re: edge case - I changed the way the sequence table is read in dbotu.py to handle this case as well. It's not very pretty, but it should work. I also changed the test I wrote to read in a sequence table with these kinds of sequence IDs.

@swo swo merged commit c10f608 into swo:master Nov 28, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants